Kernel vs. Kernel: Exploring How the Data Structure Affects Neural Collapse

Vignesh Kothapalli
LinkedIn Inc.
[email protected] &Tom Tirer
Bar-Ilan University, Israel
[email protected]
Abstract

Recently, a vast amount of literature has focused on the “Neural Collapse” (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network’s deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. In this paper, we provide a kernel-based analysis that does not suffer from this limitation. First, given a kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples’ features (and consequently an NC1 metric). Then, we turn to focus on kernels associated with shallow NNs. First, we consider the NN Gaussian Process kernel (NNGP), associated with the network at initialization, and the complement Neural Tangent Kernel (NTK), associated with its training in the “lazy regime”. Interestingly, we show that the NTK does not represent more collapsed features than the NNGP for prototypical data models. As NC emerges from training, we then consider an alternative to NTK: the recently proposed adaptive kernel, which generalizes NNGP to model the feature map** learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

1 Introduction

Deep Neural Network classifiers are often trained beyond the zero training error point [1, 2]. In this regime, a phenomenon dubbed “Neural Collapse” (NC) emerges [3]. NC is typically described by the following components: (NC1) the networks’ deepest features exhibit a significant decrease in the variability of within-class samples, (NC2) the mean features of different classes approach a certain symmetric structure, and (NC3) the last layer’s weights become more aligned with the penultimate layer features’ means. This behavior has been observed both when using the cross-entropy (CE) loss [3] and the mean squared error (MSE) loss [4].

Recently, a vast amount of literature has been dedicated to exploring NC (as surveyed in [5]), studying the effect of imbalanced data [6, 7], depthwise evolution [8, 9, 10, 11], fine-grained structures [12, 13, 14], and implications [15, 16, 17]. Note that without a sufficient decrease in the features’ within-class variability around the class means, measured by NC1 metrics, one may not gain valuable insights from the structure of the means of different classes. Therefore, oftentimes, NC papers focus specifically on NC1 rather than on other components of NC [12, 13, 14, 16, 11, 18]. Notably, most of the works that attempt to theoretically analyze the NC behavior [19, 6, 15, 4, 20, 8, 21, 7, 22, 12, 23, 10, 24] are based on variants of the unconstrained features model (UFM) [19], which treats the deepest features of the training samples as free optimization variables. Therefore, these analyses cannot predict the effect of the data structure on the extent of collapse.

Since theoretically analyzing the behavior of (deep) NNs is challenging, simplifying approaches that are based on kernel methods [25] have gained massive popularity [26, 27, 28, 29, 30, 31]. Prominent examples include the NN Gaussian Process kernel (NNGP) [26, 27, 32], associated with the infinitely wide NNs at initialization, and the Neural Tangent Kernel (NTK) [28], associated with their training in the “lazy regime”, where the learning rate is sufficiently small. These approaches, and in particular NTK, were used to provide mathematical reasoning for deep learning phenomena such as achieving zero training loss [28, 30, 31], faster learning of lower frequencies [33], benefits of ResNets over fully connected networks [34, 35, 36], usefulness of positional encoding in coordinated-based NNs [37], and more. More recently, finite width variants of the aforementioned kernels have been studied [38, 39, 40], aiming to mitigate the gap between practical NNs behavior and infinite width analyses [41, 42, 43, 44, 45].

In this paper, we provide a kernel-based analysis for NC1, the core component of NC, which does not suffer from the limitations of UFM-based analysis. Thus, it allows us to explore how the data structure affects the collapse. Since kernels provide fixed feature map**, we propose a “kernel vs. kernel” analysis — that is, gaining insights by comparing the properties across NN-related kernels.

Our main contributions can be summarized as follows:

  • Given an arbitrary kernel function, we establish expressions for the traces of the within- and between-class covariance matrices of the samples’ features — and consequently, an NC1 metric that depends on the features only through the kernel function.

  • We specialize our kernel-based NC1 to kernels associated with shallow NNs. We analyze it for NNGP (NN at initialization), and NTK (NN trained in the “lazy regime”) and show that, perhaps surprisingly, the NTK does not represent more collapsed features than the NNGP for prototypical data models.

  • As NC emerges from training, we consider an alternative to NTK: the recently proposed adaptive kernel [40], which generalizes NNGP to model the feature map** learned from the training data. Contrasting our NC1 analysis for these two kernels enables gaining deeper insights into the effect of data distribution on the extent of collapse, which are empirically aligned with the behavior observed with practical training of NNs.

2 Related Work

As mentioned in the previous section, most of the works that attempt to analyze the NC behavior theoretically [19, 6, 15, 4, 20, 8, 21, 7, 22, 23, 10] are based on variants of the UFM [19]. The work in [12] attempts to generalize the UFM by adding a penalty term to the loss that ensures that the features matrix is in the vicinity of a predefined matrix. Yet, the model still lacks an explicit connection between the features and the data. In [46], the authors avoid optimizing the features directly but assume that the model is linear and that the data is nearly orthogonal, which are restrictive assumptions. In [47], the authors claim that having an exact class-wise block structure in the Gram matrix of the empirical NTK on training samples implies NC. Yet, they do not provide reasoning for reaching this collapsed Gram matrix and the analysis is still disconnected from the data. Here, however, we fully depart from the UFM approach and provide analysis that explicitly depends on the data. Furthermore, unlike most works, our analysis is applicable to the less studied case where the data is class imbalanced [6, 17, 23, 7]. Our kernel-based analysis utilizes results on NNGP and NTK from [48, 49, 27, 28, 31]. To simplify the analysis, we theoretically analyze kernels associated with shallow fully connected NNs. Focusing on shallow networks is justified by recent works demonstrating monotonic depthwise evolution of NC1 [8, 9, 10, 11, 12]. Specifically, a data structure that promotes a larger reduction in NC1 for shallow NNs is expected to be more collapsed when using deep NNs. Since NC is related to training, and we show the limitation of NTK to capture it when compared to NNGP, we also utilize the generalization of NNGP that has been proposed in [40]. In this adaptive kernel model, there is an explicit kernel function that depends on the training data. Recently, this richer model has been used to study phase transition behaviors, such as grokking [50], that cannot be captured with the data-independent kernels.

3 Problem Setup

In this section, we outline the notations and the setup.

\displaystyle\bullet Data: We consider a dataset 𝐗d0×N𝐗superscriptsubscript𝑑0𝑁\displaystyle\mathbf{X}\in\mathbb{R}^{d_{0}\times N}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, comprising N𝑁\displaystyle Nitalic_N data points of dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT belonging to C𝐶\displaystyle Citalic_C classes. Each class has size nc,c[C]subscript𝑛𝑐𝑐delimited-[]𝐶\displaystyle n_{c},c\in[C]italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_c ∈ [ italic_C ], where [C]:={1,2,,C}assigndelimited-[]𝐶12𝐶\displaystyle[C]:=\{1,2,\cdots,C\}[ italic_C ] := { 1 , 2 , ⋯ , italic_C } and cnc=Nsubscript𝑐subscript𝑛𝑐𝑁\displaystyle\sum\nolimits_{c}{n_{c}}=N∑ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_N. The dataset is represented in an “organized” matrix form as 𝐗=[𝐱1,1𝐱1,n1,𝐱2,1𝐱C,nC]d0×N𝐗matrixsuperscript𝐱11superscript𝐱1subscript𝑛1superscript𝐱21superscript𝐱𝐶subscript𝑛𝐶superscriptsubscript𝑑0𝑁\displaystyle\mathbf{X}=\begin{bmatrix}\mathbf{x}^{1,1}&\cdots&\mathbf{x}^{1,n% _{1}},\mathbf{x}^{2,1}\cdots&\mathbf{x}^{C,n_{C}}\end{bmatrix}\in\mathbb{R}^{d% _{0}\times N}bold_X = [ start_ARG start_ROW start_CELL bold_x start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_x start_POSTSUPERSCRIPT 1 , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT 2 , 1 end_POSTSUPERSCRIPT ⋯ end_CELL start_CELL bold_x start_POSTSUPERSCRIPT italic_C , italic_n start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT, where 𝐱c,id0superscript𝐱𝑐𝑖superscriptsubscript𝑑0\displaystyle\mathbf{x}^{c,i}\in\mathbb{R}^{d_{0}}bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT represents the ithsuperscript𝑖𝑡\displaystyle i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT data point of the cthsuperscript𝑐𝑡\displaystyle c^{th}italic_c start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT class. Specific assumptions on the data distribution will be presented during the paper together with the related theory or experiments.

\displaystyle\bullet Neural Network: Unless stated otherwise, we consider a 2-layer fully connected neural network (2L-FCN) ψ:d0d2:𝜓superscriptsubscript𝑑0superscriptsubscript𝑑2\displaystyle\psi:\mathbb{R}^{d_{0}}\to\mathbb{R}^{d_{2}}italic_ψ : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT with lthsuperscript𝑙𝑡\displaystyle l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer width dl,l{1,2}subscript𝑑𝑙𝑙12\displaystyle d_{l},l\in\{1,2\}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_l ∈ { 1 , 2 }, and point-wise activation function ϕ()::italic-ϕ\displaystyle\phi(\cdot):\mathbb{R}\to\mathbb{R}italic_ϕ ( ⋅ ) : blackboard_R → blackboard_R. Let 𝐖(l)dl×dl1,𝐛(l)dlformulae-sequencesuperscript𝐖𝑙superscriptsubscript𝑑𝑙subscript𝑑𝑙1superscript𝐛𝑙superscriptsubscript𝑑𝑙\displaystyle\mathbf{W}^{(l)}\in\mathbb{R}^{d_{l}\times d_{l-1}},\mathbf{b}^{(% l)}\in\mathbb{R}^{d_{l}}bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUPERSCRIPT denote the weight and bias parameters of the lthsuperscript𝑙𝑡\displaystyle l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer. At initialization, the entries Wij(l),bi(l)superscriptsubscript𝑊𝑖𝑗𝑙superscriptsubscript𝑏𝑖𝑙\displaystyle W_{ij}^{(l)},b_{i}^{(l)}italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT are drawn i.i.d from Gaussian distributions of mean 00\displaystyle 0 and variance σw2/dl1,σb2superscriptsubscript𝜎𝑤2subscript𝑑𝑙1superscriptsubscript𝜎𝑏2\displaystyle\sigma_{w}^{2}/d_{l-1},\sigma_{b}^{2}italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT italic_l - 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, respectively. For an input 𝐱d0𝐱superscriptsubscript𝑑0\displaystyle\mathbf{x}\in\mathbb{R}^{d_{0}}bold_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to the network ψ()𝜓\displaystyle\psi(\cdot)italic_ψ ( ⋅ ), we denote the ithsuperscript𝑖𝑡\displaystyle i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT component of the output vector y^i(𝐱)subscript^𝑦𝑖𝐱\displaystyle\hat{y}_{i}(\mathbf{x})\in\mathbb{R}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) ∈ blackboard_R as follows:

y^i(𝐱)=bi(2)+j=1d1Wij(2)ϕ(zj(𝐱)),zj(𝐱)=bj(1)+k=1d0Wjk(1)xk.formulae-sequencesubscript^𝑦𝑖𝐱superscriptsubscript𝑏𝑖2superscriptsubscript𝑗1subscript𝑑1superscriptsubscript𝑊𝑖𝑗2italic-ϕsubscript𝑧𝑗𝐱subscript𝑧𝑗𝐱superscriptsubscript𝑏𝑗1superscriptsubscript𝑘1subscript𝑑0superscriptsubscript𝑊𝑗𝑘1subscript𝑥𝑘\hat{y}_{i}(\mathbf{x})=b_{i}^{(2)}+\sum_{j=1}^{d_{1}}W_{ij}^{(2)}\phi\left(z_% {j}(\mathbf{x})\right),\hskip 20.0ptz_{j}(\mathbf{x})=b_{j}^{(1)}+\sum_{k=1}^{% d_{0}}W_{jk}^{(1)}x_{k}.over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) = italic_b start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) ) , italic_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_x ) = italic_b start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_W start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (1)

\displaystyle\bullet Task: We train the network ψ()𝜓\displaystyle\psi(\cdot)italic_ψ ( ⋅ ) to classify the data points 𝐱c,i,c[C],i[nc]formulae-sequencesuperscript𝐱𝑐𝑖𝑐delimited-[]𝐶𝑖delimited-[]subscript𝑛𝑐\displaystyle\mathbf{x}^{c,i},c\in[C],i\in[n_{c}]bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_c ∈ [ italic_C ] , italic_i ∈ [ italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ] to their respective classes. Let 𝐘^,𝐘d2×N^𝐘𝐘superscriptsubscript𝑑2𝑁\displaystyle\hat{\mathbf{Y}},\mathbf{Y}\in\mathbb{R}^{d_{2}\times N}over^ start_ARG bold_Y end_ARG , bold_Y ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT × italic_N end_POSTSUPERSCRIPT denote the prediction and ground truth label matrices respectively:

𝐘^=[ψ(𝐱1,1)ψ(𝐱C,nC)]=[𝐲^1,1𝐲^C,nC],𝐘=[𝐲1,1𝐲C,nC]formulae-sequence^𝐘matrix𝜓superscript𝐱11𝜓superscript𝐱𝐶subscript𝑛𝐶matrixsuperscript^𝐲11superscript^𝐲𝐶subscript𝑛𝐶𝐘matrixsuperscript𝐲11superscript𝐲𝐶subscript𝑛𝐶\displaystyle\displaystyle\hat{\mathbf{Y}}=\begin{bmatrix}\psi(\mathbf{x}^{1,1% })&\cdots&\psi(\mathbf{x}^{C,n_{C}})\end{bmatrix}=\begin{bmatrix}\hat{\mathbf{% y}}^{1,1}&\cdots&\hat{\mathbf{y}}^{C,n_{C}}\end{bmatrix},\hskip 20.0pt\mathbf{% Y}=\begin{bmatrix}\mathbf{y}^{1,1}&\cdots&\mathbf{y}^{C,n_{C}}\end{bmatrix}over^ start_ARG bold_Y end_ARG = [ start_ARG start_ROW start_CELL italic_ψ ( bold_x start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_ψ ( bold_x start_POSTSUPERSCRIPT italic_C , italic_n start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] = [ start_ARG start_ROW start_CELL over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL over^ start_ARG bold_y end_ARG start_POSTSUPERSCRIPT italic_C , italic_n start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] , bold_Y = [ start_ARG start_ROW start_CELL bold_y start_POSTSUPERSCRIPT 1 , 1 end_POSTSUPERSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_y start_POSTSUPERSCRIPT italic_C , italic_n start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ] (2)

We aim to minimize the Mean Squared Error (MSE) between 𝐘^,𝐘^𝐘𝐘\displaystyle\hat{\mathbf{Y}},\mathbf{Y}over^ start_ARG bold_Y end_ARG , bold_Y using the following objective:

(ψ,𝐗,𝐘)=1N𝐘^𝐘F2+λl=12(𝐖(l)F2+𝐛(l)F2).𝜓𝐗𝐘1𝑁superscriptsubscriptnorm^𝐘𝐘𝐹2𝜆superscriptsubscript𝑙12superscriptsubscriptnormsuperscript𝐖𝑙𝐹2superscriptsubscriptnormsuperscript𝐛𝑙𝐹2\mathcal{R}(\psi,\mathbf{X},\mathbf{Y})=\frac{1}{N}\left\|\hat{\mathbf{Y}}-% \mathbf{Y}\right\|_{F}^{2}+\lambda\sum_{l=1}^{2}\left(\left\|\mathbf{W}^{(l)}% \right\|_{F}^{2}+\left\|\mathbf{b}^{(l)}\right\|_{F}^{2}\right).caligraphic_R ( italic_ψ , bold_X , bold_Y ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∥ over^ start_ARG bold_Y end_ARG - bold_Y ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ bold_W start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_b start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (3)

Note that training deep classifiers with MSE loss has been shown to be a useful strategy [51, 4], which is commonly considered in NC analyses [4, 8, 21].

\displaystyle\bullet Pre- and Post-activation Kernels: For any two inputs 𝐱c,i,𝐱c,jd0superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝑑0\displaystyle\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}\in\mathbb{R}^{d_{0}}bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, we denote their corresponding pre- and post-activation features as 𝐳c,i,𝐳c,jd1superscript𝐳𝑐𝑖superscript𝐳superscript𝑐𝑗superscriptsubscript𝑑1\displaystyle\mathbf{z}^{c,i},\mathbf{z}^{c^{\prime},j}\in\mathbb{R}^{d_{1}}bold_z start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_z start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and ϕ(𝐳c,i),ϕ(𝐳c,j)d1italic-ϕsuperscript𝐳𝑐𝑖italic-ϕsuperscript𝐳superscript𝑐𝑗superscriptsubscript𝑑1\displaystyle\phi(\mathbf{z}^{c,i}),\phi(\mathbf{z}^{c^{\prime},j})\in\mathbb{% R}^{d_{1}}italic_ϕ ( bold_z start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) , italic_ϕ ( bold_z start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, respectively. The pre and post-activation kernels corresponding to layer l=1𝑙1\displaystyle l=1italic_l = 1 are given by:

K(1)(𝐱c,i,𝐱c,j)=𝐳c,i𝐳c,j=(𝐛(1)+𝐖(1)𝐱c,i)(𝐛(1)+𝐖(1)𝐱c,j),Q(1)(𝐱c,i,𝐱c,j)=ϕ(𝐳c,i)ϕ(𝐳c,j).formulae-sequencesuperscript𝐾1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscript𝐳𝑐limit-from𝑖topsuperscript𝐳superscript𝑐𝑗superscriptsuperscript𝐛1superscript𝐖1superscript𝐱𝑐𝑖topsuperscript𝐛1superscript𝐖1superscript𝐱superscript𝑐𝑗superscript𝑄1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗italic-ϕsuperscriptsuperscript𝐳𝑐𝑖topitalic-ϕsuperscript𝐳superscript𝑐𝑗\displaystyle\displaystyle\begin{split}K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^% {\prime},j})&=\mathbf{z}^{c,i\top}\mathbf{z}^{c^{\prime},j}=\left(\mathbf{b}^{% (1)}+\mathbf{W}^{(1)}\mathbf{x}^{c,i}\right)^{\top}\left(\mathbf{b}^{(1)}+% \mathbf{W}^{(1)}\mathbf{x}^{c^{\prime},j}\right),\\ Q^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})&=\phi(\mathbf{z}^{c,i})^{% \top}\phi(\mathbf{z}^{c^{\prime},j}).\end{split}start_ROW start_CELL italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = bold_z start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_z start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT = ( bold_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( bold_b start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) , end_CELL end_ROW start_ROW start_CELL italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = italic_ϕ ( bold_z start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_ϕ ( bold_z start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . end_CELL end_ROW (4)

4 Within-Class Variability Metric (NC1) for Kernels

In this section, we derive an NC1 metric that depends on the features only through the kernel function. Let 𝐇𝐇\displaystyle\mathbf{H}bold_H be a matrix encapsulating arbitrary feature vectors associated with samples of the C𝐶\displaystyle Citalic_C classes. We define the within-class covariance 𝚺W(𝐇)subscript𝚺𝑊𝐇\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{H})bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) and between-class covariance 𝚺B(𝐇)subscript𝚺𝐵𝐇\displaystyle\boldsymbol{\Sigma}_{B}(\mathbf{H})bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) matrices of the features as follows:

𝚺W(𝐇)subscript𝚺𝑊𝐇\displaystyle\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{H})bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) =1Nc=1Ci=1nc(𝐡c,i𝐡¯c)(𝐡c,i𝐡¯c);𝚺B(𝐇)=1Cc=1C(𝐡¯c𝐡¯G)(𝐡¯c𝐡¯G),formulae-sequenceabsent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖superscript¯𝐡𝑐superscriptsuperscript𝐡𝑐𝑖superscript¯𝐡𝑐topsubscript𝚺𝐵𝐇1𝐶superscriptsubscript𝑐1𝐶superscript¯𝐡𝑐superscript¯𝐡𝐺superscriptsuperscript¯𝐡𝑐superscript¯𝐡𝐺top\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\left(% \mathbf{h}^{c,i}-\overline{\mathbf{h}}^{c}\right)\left(\mathbf{h}^{c,i}-% \overline{\mathbf{h}}^{c}\right)^{\top};\boldsymbol{\Sigma}_{B}(\mathbf{H})=% \frac{1}{C}\sum_{c=1}^{C}\left(\overline{\mathbf{h}}^{c}-\overline{\mathbf{h}}% ^{G}\right)\left(\overline{\mathbf{h}}^{c}-\overline{\mathbf{h}}^{G}\right)^{% \top},= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT - over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT - over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (5)

where 𝐡¯c=1nci=1nc𝐡c,i,c[C]formulae-sequencesuperscript¯𝐡𝑐1subscript𝑛𝑐superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖for-all𝑐delimited-[]𝐶\displaystyle\overline{\mathbf{h}}^{c}=\frac{1}{n_{c}}\sum\nolimits_{i=1}^{n_{% c}}\mathbf{h}^{c,i},\forall c\in[C]over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , ∀ italic_c ∈ [ italic_C ] and 𝐡¯G=1Nc=1Ci=1nc𝐡c,isuperscript¯𝐡𝐺1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖\displaystyle\overline{\mathbf{h}}^{G}=\frac{1}{N}\sum\nolimits_{c=1}^{C}\sum% \nolimits_{i=1}^{n_{c}}\mathbf{h}^{c,i}over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT represent the class mean vectors and the global mean vector, respectively. Additionally, we consider the total covariance 𝚺~T(𝐇)subscript~𝚺𝑇𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) and non-centered between-class covariance 𝚺~B(𝐇)subscript~𝚺𝐵𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) matrices as follows:

𝚺~T(𝐇)=1Nc=1Ci=1nc𝐡c,i𝐡c,i,𝚺~B(𝐇)=1Cc=1C𝐡¯c𝐡¯c.formulae-sequencesubscript~𝚺𝑇𝐇1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖superscript𝐡𝑐limit-from𝑖topsubscript~𝚺𝐵𝐇1𝐶superscriptsubscript𝑐1𝐶superscript¯𝐡𝑐superscript¯𝐡limit-from𝑐top\displaystyle\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})=% \frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\mathbf{h}^{c,i\top% },\hskip 20.0pt\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})=\frac{1}{C}\sum% _{c=1}^{C}\overline{\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}.over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT , over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c ⊤ end_POSTSUPERSCRIPT . (6)

Based on these formulations, we define the variability metric 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ), introduced in [12] and used also in [14, 46, 52], as:

𝒩𝒞1(𝐇):=tr(𝚺W(𝐇))tr(𝚺B(𝐇)).assign𝒩subscript𝒞1𝐇trsubscript𝚺𝑊𝐇trsubscript𝚺𝐵𝐇\mathcal{N}\mathcal{C}_{1}(\mathbf{H}):=\frac{\mathrm{tr}(\boldsymbol{\Sigma}_% {W}(\mathbf{H}))}{\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))}.caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) := divide start_ARG roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) end_ARG start_ARG roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) end_ARG . (7)
Refer to caption
(a) Balanced classes
Refer to caption
(b) Imbalanced classes
Figure 1: Visualizing the kernel matrix 𝐐𝐐\displaystyle\mathbf{Q}bold_Q for the limiting NNGP post-activation kernel function QGPErf:2×2:subscript𝑄𝐺𝑃𝐸𝑟𝑓superscript2superscript2\displaystyle Q_{GP-Erf}:\mathbb{R}^{2}\times\mathbb{R}^{2}\to\mathbb{R}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → blackboard_R. The data is sampled from two Gaussian distributions in (a) balanced and (b) imbalanced fashion to illustrate the structure of the sub-matrices 𝐐c,c,c,c{1,2}.subscript𝐐𝑐superscript𝑐𝑐superscript𝑐12\displaystyle\mathbf{Q}_{c,c^{\prime}},c,c^{\prime}\in\{1,2\}.bold_Q start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , 2 } .

In the following theorem, we formulate the traces tr(𝚺W(𝐇))trsubscript𝚺𝑊𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) and tr(𝚺B(𝐇))trsubscript𝚺𝐵𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) using an arbitrary kernel function Q:d0×d0:𝑄superscriptsubscript𝑑0superscriptsubscript𝑑0\displaystyle Q:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}italic_Q : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R that expresses inner product of data samples in feature space.

Theorem 4.1.

For any two data points 𝐱c,i,𝐱c,jsuperscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT, let the inner-product of their associated features 𝐡c,i,𝐡c,jsuperscript𝐡𝑐𝑖superscript𝐡superscript𝑐𝑗\displaystyle\mathbf{h}^{c,i},\mathbf{h}^{c^{\prime},j}bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_h start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT be given by a kernel Q:d0×d0:𝑄superscriptsubscript𝑑0superscriptsubscript𝑑0\displaystyle Q:\mathbb{R}^{d_{0}}\times\mathbb{R}^{d_{0}}\to\mathbb{R}italic_Q : blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT × blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT → blackboard_R as Q(𝐱c,i,𝐱c,j)=𝐡c,i𝐡c,j𝑄superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscript𝐡𝑐limit-from𝑖topsuperscript𝐡superscript𝑐𝑗\displaystyle Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=\mathbf{h}^{c,i% \top}\mathbf{h}^{c^{\prime},j}italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT. The traces of covariance matrices tr(𝚺W(𝐇))trsubscript𝚺𝑊𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) and tr(𝚺B(𝐇))trsubscript𝚺𝐵𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) can now be formulated as:

tr(𝚺W(𝐇))trsubscript𝚺𝑊𝐇\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) =1Nc=1Ci=1ncQ(𝐱c,i,𝐱c,i)1Cc=1C1nc2i=1ncj=1ncQ(𝐱c,i,𝐱c,j),absent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑗\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})-\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j}),= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) , (8)
tr(𝚺B(𝐇))trsubscript𝚺𝐵𝐇\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) =1Cc=1C1nc2i=1ncj=1ncQ(𝐱c,i,𝐱c,j)1N2c=1Cc=1Ci=1ncj=1ncQ(𝐱c,i,𝐱c,j).absent1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑗1superscript𝑁2superscriptsubscript𝑐1𝐶superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛superscript𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})-\frac{1}{N^{% 2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{% \prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . (9)

The proof (in Appendix A) leverages matrix trace properties and direct expansions of the covariance matrices 𝚺~T(𝐇),𝚺~B(𝐇),𝚺W(𝐇)subscript~𝚺𝑇𝐇subscript~𝚺𝐵𝐇subscript𝚺𝑊𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H}),\widetilde{% \boldsymbol{\Sigma}}_{B}(\mathbf{H}),\boldsymbol{\Sigma}_{W}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) , over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) , bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) in terms of vector outer-products to arrive at the results.

Observe that Theorem 4.1 allows us to replace Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) with any suitable kernel formulation corresponding to the features. In the following sections, we leverage this flexibility to analyze and compare NC1 for kernels that model the behavior of neural networks.

5 Activation Variability in the Lazy Learning Regime

In the UFM-based analysis of NC, the assumption is that the deepest features 𝐇𝐇\displaystyle\mathbf{H}bold_H associated with the training samples 𝐗𝐗\displaystyle\mathbf{X}bold_X are free optimization variables, thus losing any ability to analyze the effect of the training data, apart from the balancedness of its labels 𝐘𝐘\displaystyle\mathbf{Y}bold_Y. In this section, we address these shortcomings by analyzing the role of data distributions on 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) in the infinite width regime, where the NN behavior can be well modeled by NNGP and NTK. In particular, we focus on the case where data is sampled from a mixture of Gaussians and understand the limits of 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) reduction.

5.1 Limiting NNGP Kernel

Under the NN model and initialization stated in Section 3, as the hidden layer width d1subscript𝑑1\displaystyle d_{1}\to\inftyitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ∞, we can characterize the pre-activation kernel K(1)(𝐱c,i,𝐱c,j)superscript𝐾1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) in terms of the GP limit [26] (commonly referred to as the NNGP limit [27]) as follows:

KGP(1)(𝐱c,i,𝐱c,j)=σb2+σw2d0𝐱c,i𝐱c,j.superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝐱𝑐limit-from𝑖topsuperscript𝐱superscript𝑐𝑗\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d_{0}}\mathbf{x}^{c,i\top}\mathbf{x% }^{c^{\prime},j}.italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT . (10)

In this limit, the post-activation kernel QGP(1)(,)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( ⋅ , ⋅ ) can have a closed form representation depending on the choice of activation function ϕ()italic-ϕ\displaystyle\phi(\cdot)italic_ϕ ( ⋅ ) [27, 31]. The expression for Erf activation [48] is:

QGPErf(1)(𝐱c,i,𝐱c,j)=2πarcsin(2KGP(1)(𝐱c,i,𝐱c,j)1+2KGP(1)(𝐱c,i,𝐱c,i)1+2KGP(1)(𝐱c,j,𝐱c,j)).superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗2𝜋2superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗12superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖12superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}Q_{GP-Erf}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})&=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(\mathbf{x}^% {c,i},\mathbf{x}^{c,i})}\sqrt{1+2K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},% \mathbf{x}^{c^{\prime},j})}}\right).\\ \end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) . end_CELL end_ROW (11)

The formulation for ReLU-based kernel QGPReLU(1)(,)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1\displaystyle Q_{GP-ReLU}^{(1)}(\cdot,\cdot)italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ) [49] is presented in the Appendix B. Given a kernel function Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) and samples 𝐗𝐗\displaystyle\mathbf{X}bold_X, we can formulate the kernel Gram matrix 𝐐N×N𝐐superscript𝑁𝑁\displaystyle\mathbf{Q}\in\mathbb{R}^{N\times N}bold_Q ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT as:

𝐐=[𝐐1,1𝐐1,C𝐐C,1𝐐C,C]N×N,𝐐c,c=[Q(𝐱c,1,𝐱c,1)Q(𝐱c,1,𝐱c,nc)Q(𝐱c,nc,𝐱c,1)Q(𝐱c,nc,𝐱c,nc)]nc×nc.formulae-sequence𝐐subscriptmatrixsubscript𝐐11subscript𝐐1𝐶subscript𝐐𝐶1subscript𝐐𝐶𝐶𝑁𝑁subscript𝐐𝑐superscript𝑐subscriptmatrix𝑄superscript𝐱𝑐1superscript𝐱superscript𝑐1𝑄superscript𝐱𝑐1superscript𝐱superscript𝑐subscript𝑛superscript𝑐𝑄superscript𝐱𝑐subscript𝑛𝑐superscript𝐱superscript𝑐1𝑄superscript𝐱𝑐subscript𝑛𝑐superscript𝐱superscript𝑐subscript𝑛superscript𝑐subscript𝑛𝑐subscript𝑛superscript𝑐\displaystyle\displaystyle\mathbf{Q}=\begin{bmatrix}\mathbf{Q}_{1,1}&\cdots&% \mathbf{Q}_{1,C}\\ \vdots&\ddots&\vdots\\ \mathbf{Q}_{C,1}&\cdots&\mathbf{Q}_{C,C}\\ \end{bmatrix}_{N\times N},\mathbf{Q}_{c,c^{\prime}}=\begin{bmatrix}Q(\mathbf{x% }^{c,1},\mathbf{x}^{c^{\prime},1})&\cdots&Q(\mathbf{x}^{c,1},\mathbf{x}^{c^{% \prime},n_{c^{\prime}}})\\ \vdots&\ddots&\vdots\\ Q(\mathbf{x}^{c,n_{c}},\mathbf{x}^{c^{\prime},1})&\cdots&Q(\mathbf{x}^{c,n_{c}% },\mathbf{x}^{c^{\prime},n_{c^{\prime}}})\\ \end{bmatrix}_{n_{c}\times n_{c^{\prime}}}.bold_Q = [ start_ARG start_ROW start_CELL bold_Q start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_Q start_POSTSUBSCRIPT 1 , italic_C end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL bold_Q start_POSTSUBSCRIPT italic_C , 1 end_POSTSUBSCRIPT end_CELL start_CELL ⋯ end_CELL start_CELL bold_Q start_POSTSUBSCRIPT italic_C , italic_C end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_N × italic_N end_POSTSUBSCRIPT , bold_Q start_POSTSUBSCRIPT italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , 1 end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋱ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , 1 end_POSTSUPERSCRIPT ) end_CELL start_CELL ⋯ end_CELL start_CELL italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT × italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT . (12)

Considering the NNGP kernel in (11), we illustrate an example 𝐐𝐐\displaystyle\mathbf{Q}bold_Q matrix in Figure 1 to visualize the sub-matrices based on the imbalance/balance of class sizes.

5.2 Limiting NTK

In the infinite width limit, we also analyze the NTK [28] to understand the effect of optimization on the NN’s features in the “lazy regime”. Specifically, a well-known result is that in the infinite width limits (with initialization as per Section 3), the deepest feature map** of the NN is fixed during gradient descent optimization with small enough learning rate, and is characterized by the NTK.

Formally, the recursive relationship between the NTK and NNGP [28, 31] can be given as follows:

ΘNTK(2)(𝐱c,i,𝐱c,j)=KGP(2)(𝐱c,i,𝐱c,j)+KGP(1)(𝐱c,i,𝐱c,j)Q˙GP(1)(𝐱c,i,𝐱c,j).subscriptsuperscriptΘ2𝑁𝑇𝐾superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗subscriptsuperscript˙𝑄1𝐺𝑃superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\Theta^{(2)}_{NTK}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP}^{(2)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP}(\mathbf{x}^{c,i},\mathbf{x}^{c^{% \prime},j}).roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . (13)

Here, KGP(2)(𝐱c,i,𝐱c,j)superscriptsubscript𝐾𝐺𝑃2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K_{GP}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) can be defined using the recursive formulation [27, 28]:

KGP(2)(𝐱c,i,𝐱c,j)=σb2+σw2QGP(1)(𝐱c,i,𝐱c,j).superscriptsubscript𝐾𝐺𝑃2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}K_{GP}^{(2)}(\mathbf{x}^{c,i},\mathbf{x% }^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j}).\end{split}start_ROW start_CELL italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . end_CELL end_ROW (14)

Similar to the activation function specific formulations of QGP(1)(,)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( ⋅ , ⋅ ) in (11), we define the Erf based derivative kernel Q˙GPErf(1)(,)subscriptsuperscript˙𝑄1𝐺𝑃𝐸𝑟𝑓\displaystyle\dot{Q}^{(1)}_{GP-Erf}(\cdot,\cdot)over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT ( ⋅ , ⋅ ) as follows:

Q˙GPErf(1)(𝐱c,i,𝐱c,j)=4πdet([1+2KGP(1)(𝐱c,i,𝐱c,i)2KGP(1)(𝐱c,i,𝐱c,j)2KGP(1)(𝐱c,j,𝐱c,i)1+2KGP(1)(𝐱c,j,𝐱c,j)])1/2.superscriptsubscript˙𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗4𝜋superscriptmatrix12superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖2superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗2superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱𝑐𝑖12superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱superscript𝑐𝑗12\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(\mathbf{x}^{c,i}% ,\mathbf{x}^{c^{\prime},j})&=\frac{4}{\pi}\det\left(\begin{bmatrix}1+2K_{GP}^{% (1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,i})&2K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j})\\ 2K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c,i})&1+2K_{GP}^{(1)}(% \mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})\end{bmatrix}\right)^{-1/2% }.\end{split}start_ROW start_CELL over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 4 end_ARG start_ARG italic_π end_ARG roman_det ( [ start_ARG start_ROW start_CELL 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_CELL start_CELL 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL end_ROW end_ARG ] ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT . end_CELL end_ROW (15)

The formulation for ReLU-based kernel Q˙GPReLU(1)(𝐱c,i,𝐱c,j)(,)superscriptsubscript˙𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j% })(\cdot,\cdot)over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ( ⋅ , ⋅ ) is presented in the Appendix B.

Remark on the limiting NTK. Denote by w,ϕ𝑤italic-ϕ\displaystyle w,\phiitalic_w , italic_ϕ the parameters and the activation function of an L𝐿\displaystyle Litalic_L-layer NN ψ()𝜓\displaystyle\psi(\cdot)italic_ψ ( ⋅ ). The limiting NNGP is defined directly on the product of neurons, 𝔼wϕ(zi(𝐱)),ϕ(zi(𝐱~))subscript𝔼𝑤italic-ϕsubscript𝑧𝑖𝐱italic-ϕsubscript𝑧𝑖~𝐱\displaystyle\mathbb{E}_{w}\langle\phi(z_{i}(\mathbf{x})),\phi(z_{i}(\tilde{% \mathbf{x}}))\rangleblackboard_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⟨ italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) ) , italic_ϕ ( italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( over~ start_ARG bold_x end_ARG ) ) ⟩, and provides kernel expressions for the inner product of features at different layers of the network (via recursion similar to (11)). On the other hand, the limiting NTK is defined as the inner product of the output’s gradients 𝔼wwψ(𝐱),wψ(𝐱~)subscript𝔼𝑤subscript𝑤𝜓𝐱subscript𝑤𝜓~𝐱\displaystyle\mathbb{E}_{w}\langle\nabla_{w}\psi(\mathbf{x}),\nabla_{w}\psi(% \tilde{\mathbf{x}})\rangleblackboard_E start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ⟨ ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_ψ ( bold_x ) , ∇ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_ψ ( over~ start_ARG bold_x end_ARG ) ⟩. The NTK theory shows that this can model the inner product only of the deepest features (i.e the output of the penultimate layer)111Note that we denote NNGP (QGP(1)(𝐱c,i,𝐱c,j))superscriptsubscript𝑄𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle(Q_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}))( italic_Q start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) and NTK (ΘNTK(2)(𝐱c,i,𝐱c,j)subscriptsuperscriptΘ2𝑁𝑇𝐾superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\Theta^{(2)}_{NTK}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT )) with different superscripts to be consistent with the literature. Yet both are associated with the output of the single hidden layer. .

5.3 1-D Gaussian Data with 2 Classes

Notice that even for shallow NNs, it is challenging to theoretically analyze NNGP and NTK for general data. Therefore, we consider a simplified setting to analyze the NC1 properties of these kernels. Formally, consider a 11\displaystyle 11-dimensional Gaussian dataset (i.e., d0=1subscript𝑑01\displaystyle d_{0}=1italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1) with C=2𝐶2\displaystyle C=2italic_C = 2 classes. The data points {x1,i},i[n1]superscript𝑥1𝑖for-all𝑖delimited-[]subscript𝑛1\displaystyle\{x^{1,i}\},\forall i\in[n_{1}]{ italic_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT } , ∀ italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] belonging to class c=1𝑐1\displaystyle c=1italic_c = 1 are independently sampled from 𝒩(μ1,σ12)𝒩subscript𝜇1subscriptsuperscript𝜎21\displaystyle\mathcal{N}(\mu_{1},\sigma^{2}_{1})caligraphic_N ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) and have the labels y1,i=1,i[n1]formulae-sequencesuperscript𝑦1𝑖1for-all𝑖delimited-[]subscript𝑛1\displaystyle y^{1,i}=-1,\forall i\in[n_{1}]italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , ∀ italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. Similarly, the data points {x2,j},j[n2]superscript𝑥2𝑗for-all𝑗delimited-[]subscript𝑛2\displaystyle\{x^{2,j}\},\forall j\in[n_{2}]{ italic_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT } , ∀ italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] belonging to class c=2𝑐2\displaystyle c=2italic_c = 2 are independently sampled from 𝒩(μ2,σ22)𝒩subscript𝜇2subscriptsuperscript𝜎22\displaystyle\mathcal{N}(\mu_{2},\sigma^{2}_{2})caligraphic_N ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and have the labels y2,j=1,j[n2]formulae-sequencesuperscript𝑦2𝑗1for-all𝑗delimited-[]subscript𝑛2\displaystyle y^{2,j}=1,\forall j\in[n_{2}]italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , ∀ italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ].

\displaystyle\bullet Assumption 1: For μ1<0,μ2>0formulae-sequencesubscript𝜇10subscript𝜇20\displaystyle\mu_{1}<0,\mu_{2}>0italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0 , italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0, let σ1,σ2>0subscript𝜎1subscript𝜎20\displaystyle\sigma_{1},\sigma_{2}>0italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 be small enough such that |μ1|σ1much-greater-thansubscript𝜇1subscript𝜎1\displaystyle|\mu_{1}|\gg\sigma_{1}| italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ≫ italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, |μ2|σ2much-greater-thansubscript𝜇2subscript𝜎2\displaystyle|\mu_{2}|\gg\sigma_{2}| italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ≫ italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and i[n1],j[n2],x1,ix2,j<0formulae-sequencefor-all𝑖delimited-[]subscript𝑛1formulae-sequence𝑗delimited-[]subscript𝑛2superscript𝑥1𝑖superscript𝑥2𝑗0\displaystyle\forall i\in[n_{1}],j\in[n_{2}],x^{1,i}x^{2,j}<0∀ italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] , italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , italic_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT < 0 almost surely.

\displaystyle\bullet Assumption 2: The dataset 𝐗N×1𝐗superscript𝑁1\displaystyle\mathbf{X}\in\mathbb{R}^{N\times 1}bold_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × 1 end_POSTSUPERSCRIPT consists of large enough samples n1,n21much-greater-thansubscript𝑛1subscript𝑛21\displaystyle n_{1},n_{2}\gg 1italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≫ 1.

\displaystyle\bullet Assumption 3: The 2L-FCN ψ()𝜓\displaystyle\psi(\cdot)italic_ψ ( ⋅ ) has output layer dimension d2=1subscript𝑑21\displaystyle d_{2}=1italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 and σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0.

These assumptions present a scenario where the samples of the two classes are sufficiently far from the origin in opposite directions. Thus, the simplest prediction rule pertains to the sign of a sample.

Theorem 5.1 (ReLU Activation).

Under Assumptions 1-3, let ϕ()italic-ϕ\displaystyle\phi(\cdot)italic_ϕ ( ⋅ ) be the ReLU activation. Denote by 𝐇GP,𝐇NTKsubscript𝐇𝐺𝑃subscript𝐇𝑁𝑇𝐾\displaystyle\mathbf{H}_{GP},\mathbf{H}_{NTK}bold_H start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT , bold_H start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT the features associated with NNGP QGP(1)superscriptsubscript𝑄𝐺𝑃1\displaystyle Q_{GP}^{(1)}italic_Q start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and NTK ΘNTK(2)superscriptsubscriptΘ𝑁𝑇𝐾2\displaystyle\Theta_{NTK}^{(2)}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT, respectively. Then:

𝔼[𝒩𝒞1(𝐇GP)]=𝔼[𝒩𝒞1(𝐇NTK)]=c=12ncμc2+ncσc2Nμc22(c=12μc22nc2μc2N2)2N2c=12ncμc+Δh.o.t𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝐺𝑃𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝑁𝑇𝐾superscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝜇𝑐22superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁22superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% }_{GP})\right]=\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{NTK})% \right]=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-% \frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{% 2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}+% \Delta_{h.o.t}blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ) ] = blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT ) ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (16)

where Δh.o.tsubscriptΔformulae-sequence𝑜𝑡\displaystyle\Delta_{h.o.t}roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT is a term that vanishes as {nc}subscript𝑛𝑐\displaystyle\{n_{c}\}{ italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } increase.

Appendix D presents the proof by calculating the expected values of QGPReLU(1)(𝐱c,i,𝐱c,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) and employing Theorem 4.1. For a better understanding of the result, consider the balanced class scenario with n1=n2=N/2subscript𝑛1subscript𝑛2𝑁2\displaystyle n_{1}=n_{2}=N/2italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N / 2. This gives us: 𝔼[𝒩𝒞1(𝐇GP/NTK)]=2(σ12+σ22)/(μ1μ2)2+Δh.o.t𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝐺𝑃𝑁𝑇𝐾2superscriptsubscript𝜎12superscriptsubscript𝜎22superscriptsubscript𝜇1subscript𝜇22subscriptΔformulae-sequence𝑜𝑡\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{GP/NTK})% \right]=2(\sigma_{1}^{2}+\sigma_{2}^{2})/(\mu_{1}-\mu_{2})^{2}+\Delta_{h.o.t}blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P / italic_N italic_T italic_K end_POSTSUBSCRIPT ) ] = 2 ( italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) / ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT, which intuitively captures the sum of the within-class variance of 𝐗𝐗\displaystyle\mathbf{X}bold_X in the numerator and the between-class variance in the denominator of the first term.

\displaystyle\bullet Erf Activation. We present a similar analysis with the NNGP and NTK with Erf activation in Appendix E (as the terms involved in the formulation are relatively complex than the ReLU case). In the case of ReLU with balanced data, observe that the numerator (corresponding to ΣW(𝐇GP/NTK)subscriptΣ𝑊subscript𝐇𝐺𝑃𝑁𝑇𝐾\displaystyle\Sigma_{W}(\mathbf{H}_{GP/NTK})roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P / italic_N italic_T italic_K end_POSTSUBSCRIPT )) solely depends on σc2superscriptsubscript𝜎𝑐2\displaystyle\sigma_{c}^{2}italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, however, for Erf activation, our analysis shows a dependence on terms σc2μc6proportional-toabsentsuperscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐6\displaystyle\propto\sigma_{c}^{2}\mu_{c}^{-6}∝ italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT i.e, it depends on inverse of higher powers of class means μcsubscript𝜇𝑐\displaystyle\mu_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT as well (see (92) in Appendix E). Similar analysis for 𝚺B(𝐇GP/NTK)subscript𝚺𝐵subscript𝐇𝐺𝑃𝑁𝑇𝐾\displaystyle\boldsymbol{\Sigma}_{B}(\mathbf{H}_{GP/NTK})bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P / italic_N italic_T italic_K end_POSTSUBSCRIPT ) with Erf in (100) shows a dependence on terms σc2μc4proportional-toabsentsuperscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐4\displaystyle\propto\sigma_{c}^{2}\mu_{c}^{-4}∝ italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Importantly, under Assumptions 1-3, we show similar values of the expected NC1 metric for NNGP and NTK even for the Erf activation, which are smaller than the ReLU case.

Remark. These results reveal the effect of the activation function on the NC1 metric when d1subscript𝑑1\displaystyle d_{1}\to\inftyitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ∞. Especially under Assumptions 1-3, the Erf-based kernels reflect a larger extent of ‘variability collapse’ (NC1) of the hidden layer post-activations of our 2L-FCN (both at initialization via NNGP and during training via NTK). Additionally, they indicate that the expected 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the NTK closely approximates the NNGP counterparts on our 1-D Gaussian dataset. Perhaps surprisingly, this shows that NTK does not represent more collapsed features than NNGP, despite being associated with NN gradient-based optimization. Namely, we have established another result that shows that training in the lazy regime provably deviates from the practical feature learning of NNs [41, 42, 43, 44, 45].

5.4 Experiments with High-Dimensional Gaussian Data

Setup: We conduct experiments on datasets with varying sample sizes and input dimensions to verify our theoretical results and show that insights generalize (e.g., beyond d0=1subscript𝑑01\displaystyle d_{0}=1italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1). For C=2𝐶2\displaystyle C=2italic_C = 2, a dataset size N𝑁\displaystyle Nitalic_N chosen from {128,256,512,1024}1282565121024\displaystyle\{128,256,512,1024\}{ 128 , 256 , 512 , 1024 }, and input dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }, we create the data vector and label pairs as follows:

𝒟1(N,d0)={(𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),y1,i=1),i[N/2])}{(𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y1,i=1),j[N/2])}.\displaystyle\displaystyle\begin{split}\mathcal{D}_{1}(N,d_{0})&=\left\{(% \mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*\mathbf{I}_{d_{0}})% ,y^{1,i}=-1),\forall i\in[N/2])\right\}\\ &\hskip 15.0pt\cup\left\{(\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{1,i}=1),\forall j\in[N/2])\right\}.\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL = { ( bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 ) , ∀ italic_i ∈ [ italic_N / 2 ] ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { ( bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = 1 ) , ∀ italic_j ∈ [ italic_N / 2 ] ) } . end_CELL end_ROW (17)

The vectors and labels from the dataset can then be arranged into the matrix form (as described in the setup) for analysis. The sampling procedure is repeated 1010\displaystyle 1010 times for each (N,d0)𝑁subscript𝑑0\displaystyle(N,d_{0})( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT )222The code is available at: https://github.com/kvignesh1420/shallow_nc1 .

Observations: Figure 2 illustrates the mean and standard deviation (std) of log10(NC1(𝐇))subscript10𝑁𝐶1𝐇\displaystyle\log_{10}(NC1(\mathbf{H}))roman_log start_POSTSUBSCRIPT 10 end_POSTSUBSCRIPT ( italic_N italic_C 1 ( bold_H ) ) for the post-activation NNGP kernel QGP(1)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT and NTK ΘNTK(2)subscriptsuperscriptΘ2𝑁𝑇𝐾\displaystyle\Theta^{(2)}_{NTK}roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT for Erf and ReLU activations. For the low-dimensional case of d0=1subscript𝑑01\displaystyle d_{0}=1italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 and Erf activation, observe from Figure 2(a) that 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) has a small value of 102.1absentsuperscript102.1\displaystyle\approx 10^{-2.1}≈ 10 start_POSTSUPERSCRIPT - 2.1 end_POSTSUPERSCRIPT. On the contrary, Figure 2(c) illustrates that for ReLU, 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) is more than an order of magnitude larger (100.95absentsuperscript100.95\displaystyle\approx 10^{-0.95}≈ 10 start_POSTSUPERSCRIPT - 0.95 end_POSTSUPERSCRIPT) than the former. Furthermore, Figures 2(b) and 2(d) corresponding to NTK (with Erf and ReLU respectively) do not exhibit significantly different values from the NNGP counterparts. These observations empirically verify our theoretical results.

As d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT increases, 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) increases at similar rates for QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT and ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT. With ReLU activation, 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) remains almost constant for QGPReLUsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle Q_{GP-ReLU}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT and exhibits an increasing trend for ΘNTKReLUsubscriptΘ𝑁𝑇𝐾𝑅𝑒𝐿𝑈\displaystyle\Theta_{NTK-ReLU}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT — implying less collapse for NTK. All these observations corroborate our theory on the limitation of analyzing NC with NTK. We present additional experimental results for imbalanced datasets in Appendix H and show that for a given N𝑁\displaystyle Nitalic_N, the trends of 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) for increasing d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT can vary based on the imbalance ratio of classes (i.e., n1/n2subscript𝑛1subscript𝑛2\displaystyle n_{1}/n_{2}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Nevertheless, the trends in 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) for NNGP with increasing d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, resemble that of a trained 2L-FCN with a large hidden layer width d1=2000subscript𝑑12000\displaystyle d_{1}=2000italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000 (see Figure 3(b) corresponding to the Erf activation). These observations provide zero-order reasoning for NN behavior where the feature map** is learned based on data properties such as dimension.

Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) QGPReLUsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle Q_{GP-ReLU}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT
Refer to caption
(d) ΘNTKReLUsubscriptΘ𝑁𝑇𝐾𝑅𝑒𝐿𝑈\displaystyle\Theta_{NTK-ReLU}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT
Figure 2: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the post-activation NNGP kernel (QGP(1)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT) and NTK (ΘNTK(2)subscriptsuperscriptΘ2𝑁𝑇𝐾\displaystyle\Theta^{(2)}_{NTK}roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT) for Erf and ReLU activations on dataset 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (as per equation 17).

6 Activation Variability in the Feature Learning Regime

The explicit kernel formulations in the infinite width limit (d1)subscript𝑑1\displaystyle(d_{1}\to\infty)( italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ∞ ) have allowed us to go beyond the unconstrained features assumption and preserve a link between the features and the data. Yet, the NC phenomenon relates to NN training, while NNGP relates to NN at initialization and NTK has been found unsuitable for NC analysis. Thus, we wish to contrast NNGP with a kernel that is an alternative to NTK and takes into account both optimization and data. To this end, we transition to a large but finite width (d11much-greater-thansubscript𝑑11\displaystyle d_{1}\gg 1italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≫ 1) and large sample (N1much-greater-than𝑁1\displaystyle N\gg 1italic_N ≫ 1) setting and analyze the recently introduced ‘adaptive kernels’ approach by Seroussi et al. [40] for fully connected networks.

6.1 Equations of State (EoS)

A transition from the infinite to finite width regime can introduce various corrections to the pre-and post-activations of a L𝐿\displaystyle Litalic_L-layer FCN. In this context, Seroussi et al. [40] have observed the following dominant corrections: (1) The mean and covariance of the pre-activations deviate from that of a random FCN and, (2) the collective effect of activations from the (l+1)thsuperscript𝑙1𝑡\displaystyle(l+1)^{th}( italic_l + 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT and (l1)thsuperscript𝑙1𝑡\displaystyle(l-1)^{th}( italic_l - 1 ) start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layers determine the covariance of activations in the lthsuperscript𝑙𝑡\displaystyle l^{th}italic_l start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT layer. Based on these observations, they employ a Variational Gaussian Approximation (VGA) approach to propose the following system of equations for the pre and post-activation kernels K(l)(,),Q(l)(,),l[L]superscript𝐾𝑙superscript𝑄𝑙𝑙delimited-[]𝐿\displaystyle K^{(l)}(\cdot,\cdot),Q^{(l)}(\cdot,\cdot),l\in[L]italic_K start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ) , italic_Q start_POSTSUPERSCRIPT ( italic_l ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ) , italic_l ∈ [ italic_L ] respectively. We formally define the EoS for a 2-layer FCN as follows (based on specializing the generic L𝐿\displaystyle Litalic_L-layer formulation in equation 5 in [40] to L=2𝐿2\displaystyle L=2italic_L = 2, as done in equation 95 in their arxiv extended version):

Definition 6.1.

The “Equations of State” (EoS) for pre and post-activation kernels of a 22\displaystyle 22-layer FCN with Erf activation, no bias, and d2=1subscript𝑑21\displaystyle d_{2}=1italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 are given by:

𝐟¯=𝐐(1)[σ2𝐈+𝐐(1)]1𝐲[𝐐(1)]ij=σa22πarcsin(2Kij(1)(1+2Kii(1)1+2Kjj(1))1)[𝐂1]ij=d0σw2δij+1d1tr{𝐀(1)Cij𝐐(1)}𝐀(1)=(𝐲𝐟¯)(𝐲𝐟¯)σ4+[𝐐(1)+σ2𝐈]1¯𝐟superscript𝐐1superscriptdelimited-[]superscript𝜎2𝐈superscript𝐐11𝐲subscriptdelimited-[]superscript𝐐1𝑖𝑗superscriptsubscript𝜎𝑎22𝜋2subscriptsuperscript𝐾1𝑖𝑗superscript12subscriptsuperscript𝐾1𝑖𝑖12subscriptsuperscript𝐾1𝑗𝑗1subscriptdelimited-[]superscript𝐂1𝑖𝑗subscript𝑑0superscriptsubscript𝜎𝑤2subscript𝛿𝑖𝑗1subscript𝑑1trsuperscript𝐀1subscriptsubscript𝐶𝑖𝑗superscript𝐐1superscript𝐀1𝐲¯𝐟superscript𝐲¯𝐟topsuperscript𝜎4superscriptdelimited-[]superscript𝐐1superscript𝜎2𝐈1\displaystyle\displaystyle\begin{split}\overline{\mathbf{f}}&=\mathbf{Q}^{(1)}% [\sigma^{2}\mathbf{I}+\mathbf{Q}^{(1)}]^{-1}\mathbf{y}\\ [\mathbf{Q}^{(1)}]_{ij}&=\sigma_{a}^{2}\frac{2}{\pi}\arcsin\left(2K^{(1)}_{ij}% \cdot\left(\sqrt{1+2K^{(1)}_{ii}}\sqrt{1+2K^{(1)}_{jj}}\right)^{-1}\right)\\ [\mathbf{C}^{-1}]_{ij}&=\frac{d_{0}}{\sigma_{w}^{2}}\delta_{ij}+\frac{1}{d_{1}% }\mathrm{tr}\left\{\mathbf{A}^{(1)}\partial_{C_{ij}}\mathbf{Q}^{(1)}\right\}\\ \mathbf{A}^{(1)}&=-(\mathbf{y}-\overline{\mathbf{f}})(\mathbf{y}-\overline{% \mathbf{f}})^{\top}\sigma^{-4}+[\mathbf{Q}^{(1)}+\sigma^{2}\mathbf{I}]^{-1}\\ \end{split}start_ROW start_CELL over¯ start_ARG bold_f end_ARG end_CELL start_CELL = bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT [ italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I + bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y end_CELL end_ROW start_ROW start_CELL [ bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL start_CELL = italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( 2 italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ ( square-root start_ARG 1 + 2 italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT end_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL [ bold_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_CELL start_CELL = divide start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_δ start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG roman_tr { bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∂ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL bold_A start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT end_CELL start_CELL = - ( bold_y - over¯ start_ARG bold_f end_ARG ) ( bold_y - over¯ start_ARG bold_f end_ARG ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT + [ bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_CELL end_ROW (18)

Here, 𝐂d0×d0𝐂superscriptsubscript𝑑0subscript𝑑0\displaystyle\mathbf{C}\in\mathbb{R}^{d_{0}\times d_{0}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT models the statistical covariance of a row of 𝐖(1)superscript𝐖1\displaystyle\mathbf{W}^{(1)}bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT, initialized with (σw2/d0)𝐈superscriptsubscript𝜎𝑤2subscript𝑑0𝐈\displaystyle(\sigma_{w}^{2}/d_{0})\mathbf{I}( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_I, 𝐊(1)=𝐗𝐂𝐗N×Nsuperscript𝐊1superscript𝐗top𝐂𝐗superscript𝑁𝑁\displaystyle\mathbf{K}^{(1)}=\mathbf{X}^{\top}\mathbf{C}\mathbf{X}\in\mathbb{% R}^{N\times N}bold_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_CX ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT, σ>0𝜎0\displaystyle\sigma>0italic_σ > 0 is the regularization parameter, and 𝐟¯N¯𝐟superscript𝑁\displaystyle\overline{\mathbf{f}}\in\mathbb{R}^{N}over¯ start_ARG bold_f end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT corresponds to the prediction of the 2-layer FCN (governed by the EoS). Additionally, 𝐊(1),𝐐(1)N×Nsuperscript𝐊1superscript𝐐1superscript𝑁𝑁\displaystyle\mathbf{K}^{(1)},\mathbf{Q}^{(1)}\in\mathbb{R}^{N\times N}bold_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_N end_POSTSUPERSCRIPT are the kernel matrices associated with kernel functions K(1)(,),Q(1)(,)superscript𝐾1superscript𝑄1\displaystyle K^{(1)}(\cdot,\cdot),Q^{(1)}(\cdot,\cdot)italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ) , italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( ⋅ , ⋅ ).

\displaystyle\bullet Relationship with NNGP: At initialization, we set 𝐂=(σw2/d0)𝐈𝐂superscriptsubscript𝜎𝑤2subscript𝑑0𝐈\displaystyle\mathbf{C}=(\sigma_{w}^{2}/d_{0})\mathbf{I}bold_C = ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_I, which implies that 𝐊(1)=(σw2/d0)𝐗𝐗superscript𝐊1superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝐗top𝐗\displaystyle\mathbf{K}^{(1)}=(\sigma_{w}^{2}/d_{0})\mathbf{X}^{\top}\mathbf{X}bold_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT = ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X. This resulting 𝐊(1)superscript𝐊1\displaystyle\mathbf{K}^{(1)}bold_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT exactly matches the kernel matrix for the pre-activation GP kernel (10) KGP(1)(𝐱c,i,𝐱c,j)=(σw2/d0)𝐱c,i𝐱c,jsuperscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝐱𝑐limit-from𝑖topsuperscript𝐱superscript𝑐𝑗\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=(\sigma% _{w}^{2}/d_{0})\mathbf{x}^{c,i\top}\mathbf{x}^{c^{\prime},j}italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_x start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT as σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0. Similarly, the matrix 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT corresponds to the QGPErf(,)subscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}(\cdot,\cdot)italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT ( ⋅ , ⋅ ) kernel function defined in (11). The EoS provides a mechanism for transitioning from NNGP kernels to finite-width-based kernels that adapt to the data. Intuitively, observe that the predictions 𝐟¯¯𝐟\displaystyle\overline{\mathbf{f}}over¯ start_ARG bold_f end_ARG are formulated based on kernel ridge regression with 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT and 𝐲𝐲\displaystyle\mathbf{y}bold_y. The 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT matrix along with 𝐲𝐲\displaystyle\mathbf{y}bold_y and the initial predictions 𝐟¯¯𝐟\displaystyle\overline{\mathbf{f}}over¯ start_ARG bold_f end_ARG are then used to update the weight covariance matrix 𝐂𝐂\displaystyle\mathbf{C}bold_C. Notice that every entry [𝐂1]ijsubscriptdelimited-[]superscript𝐂1𝑖𝑗\displaystyle[\mathbf{C}^{-1}]_{ij}[ bold_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ] start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT involves a trace operation on a matrix product, resulting in a weighted sum across entries of Cij𝐐(1)subscriptsubscript𝐶𝑖𝑗superscript𝐐1\displaystyle\partial_{C_{ij}}\mathbf{Q}^{(1)}∂ start_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT (i.e the N2superscript𝑁2\displaystyle N^{2}italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT pairs of data samples).

\displaystyle\bullet Finite width corrections of EoS on NC1: We solve the EoS (initialized with 𝐂=(σw2/d0)𝐈d0𝐂superscriptsubscript𝜎𝑤2subscript𝑑0subscript𝐈subscript𝑑0\displaystyle\mathbf{C}=(\sigma_{w}^{2}/d_{0})\mathbf{I}_{d_{0}}bold_C = ( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT) and obtain the stable state using the Newton-Krylov method with an annealing schedule, as originally proposed by [40]. (see Appendix G for details). At initialization, 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT is exactly described by the limiting NNGP kernel matrix, which has been analyzed in the previous section. Now, by solving the EoS with the final annealing factors as 20002000\displaystyle 20002000 and 500500\displaystyle 500500 (which correspond to a 2L-FCN with hidden layer widths d1=2000,500subscript𝑑12000500\displaystyle d_{1}=2000,500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000 , 500 respectively), we illustrate the NC1 metrics of 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT in Figure 3 for the running example of a balanced 2 class dataset 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (17). Notice that for d1=2000subscript𝑑12000\displaystyle d_{1}=2000italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values for 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT in Figure 3(a) closely resemble the plots for the limiting NNGP kernel QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT in Figure 2(a) and the NTK in Figure 2(b). However, we observe noticeable changes in the metrics when d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500. Especially, for d08subscript𝑑08\displaystyle d_{0}\geq 8italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ 8 and N512𝑁512\displaystyle N\geq 512italic_N ≥ 512, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values for the EoS in Figure 3(c) exhibit a noticeable reduction compared to QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT in Figure 2(a) and ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT in Figure 2(b). Based on this ‘kernel vs. kernel’ analysis, the reduction in 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values for EoS reflect the departure of 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT from the initial NNGP state to a feature learning state (based on finite widths).

Refer to caption
(a) EoS d1=2000subscript𝑑12000\displaystyle d_{1}=2000italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000
Refer to caption
(b) 2L-FCN d1=2000subscript𝑑12000\displaystyle d_{1}=2000italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000
Refer to caption
(c) EoS d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500
Refer to caption
(d) 2L-FCN d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500
Figure 3: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT kernel obtained by solving the EoS (a), (c) and 2L-FCN (b), (d) for the same d1subscript𝑑1\displaystyle d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on dataset 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (as per equation 17).

6.2 Activation Variability with Adaptive Kernels (EoS) and 2L-FCN

Setup: We train a 2L-FCN with d1=500,σw=1,σb=0formulae-sequencesubscript𝑑1500formulae-sequencesubscript𝜎𝑤1subscript𝜎𝑏0\displaystyle d_{1}=500,\sigma_{w}=1,\sigma_{b}=0italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 , italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1 , italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0 and Erf activation using (vanilla) Gradient Descent with a learning rate of 103superscript103\displaystyle 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and weight-decay 106superscript106\displaystyle 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT for 10001000\displaystyle 10001000 epochs on datasets described below. For EoS, we employ the same setup described above, with the final annealing factor of d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and σa2=1/128superscriptsubscript𝜎𝑎21128\displaystyle\sigma_{a}^{2}=1/128italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1 / 128 (as per the critical scaling value [40]). Similar to the formulation of 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for C=2𝐶2\displaystyle C=2italic_C = 2, we formulate 𝒟2(N,d0)subscript𝒟2𝑁subscript𝑑0\displaystyle\mathcal{D}_{2}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for C=4𝐶4\displaystyle C=4italic_C = 4 as follows:

𝒟2(N,d0)={(𝐱1,i𝒩(6𝟏d0,0.25𝐈d0),y1,i=3),i[N/4])}{(𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1),j[N/4])}{(𝐱3,k𝒩(2𝟏d0,0.25𝐈d0),y3,k=1),k[N/4])}{(𝐱4,l𝒩(6𝟏d0,0.25𝐈d0),y4,l=3),l[N/4])}.\displaystyle\displaystyle\begin{split}\mathcal{D}_{2}(N,d_{0})&=\left\{(% \mathbf{x}^{1,i}\sim\mathcal{N}(-6*\mathbf{1}_{d_{0}},0.25*\mathbf{I}_{d_{0}})% ,y^{1,i}=-3),\forall i\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{2,j}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}% },0.25*\mathbf{I}_{d_{0}}),y^{2,j}=-1),\forall j\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{3,k}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{3,k}=1),\forall k\in[N/4])\right\}\\ &\hskip 10.0pt\cup\left\{(\mathbf{x}^{4,l}\sim\mathcal{N}(6*\mathbf{1}_{d_{0}}% ,0.25*\mathbf{I}_{d_{0}}),y^{4,l}=3),\forall l\in[N/4])\right\}.\end{split}start_ROW start_CELL caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_CELL start_CELL = { ( bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 3 ) , ∀ italic_i ∈ [ italic_N / 4 ] ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { ( bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = - 1 ) , ∀ italic_j ∈ [ italic_N / 4 ] ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { ( bold_x start_POSTSUPERSCRIPT 3 , italic_k end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 3 , italic_k end_POSTSUPERSCRIPT = 1 ) , ∀ italic_k ∈ [ italic_N / 4 ] ) } end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ∪ { ( bold_x start_POSTSUPERSCRIPT 4 , italic_l end_POSTSUPERSCRIPT ∼ caligraphic_N ( 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 4 , italic_l end_POSTSUPERSCRIPT = 3 ) , ∀ italic_l ∈ [ italic_N / 4 ] ) } . end_CELL end_ROW (19)

\displaystyle\bullet 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of EoS provides a good approximation of FCN. Let us consider the running example with the balanced 2 class dataset 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) for varying N,d0𝑁subscript𝑑0\displaystyle N,d_{0}italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as per (17). The EoS primarily aims to capture the finite-width corrections (as discussed above) depending on the scaling of N,d0𝑁subscript𝑑0\displaystyle N,d_{0}italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and d1subscript𝑑1\displaystyle d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. To this end, observe from Figure 3(c) that for d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500, the trends of 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) for EoS vary depending on the scale of N,d0𝑁subscript𝑑0\displaystyle N,d_{0}italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Figure 3(d) illustrates the the actual 2L-FCN behavior. For d0={1,2}subscript𝑑012\displaystyle d_{0}=\{1,2\}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 1 , 2 }, although N𝑁\displaystyle Nitalic_N is scaled to larger values, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of EoS results in similar values and resemble the 2L-FCN case. However, as the input dimension increases to d0={8,32}subscript𝑑0832\displaystyle d_{0}=\{8,32\}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 8 , 32 }, the larger value of N={1024}𝑁1024\displaystyle N=\{1024\}italic_N = { 1024 } tends to deviate from the 2L-FCN behaviour. Similar deviations were observed even after choosing higher/lower learning rates for 2L-FCN (i.e 5103,2103,5104,1045superscript1032superscript1035superscript104superscript104\displaystyle 5*10^{-3},2*10^{-3},5*10^{-4},10^{-4}5 ∗ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 2 ∗ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 5 ∗ 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT) and weight-decays (ex: 105,104superscript105superscript104\displaystyle 10^{-5},10^{-4}10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT). Nonetheless, for the higher dimension of d0=128subscript𝑑0128\displaystyle d_{0}=128italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 128 notice that the larger values of N={512,1024}𝑁5121024\displaystyle N=\{512,1024\}italic_N = { 512 , 1024 } are required for the EoS to match the 2L-FCN. Additionally, by training deeper FCN’s with L={3,4,5,6}𝐿3456\displaystyle L=\{3,4,5,6\}italic_L = { 3 , 4 , 5 , 6 } layers and hidden layer widths 500500\displaystyle 500500 on 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ), we observed that the EoS trends (which correspond to 2L-FCN) can also be used to estimate the trends of 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) reduction in the penultimate layers of these networks (see Figure 14 in Appendix H).

The trends in 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) for EoS and 2L-FCN hold even for datasets 𝒟2(N,d0)subscript𝒟2𝑁subscript𝑑0\displaystyle\mathcal{D}_{2}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with C>2𝐶2\displaystyle C>2italic_C > 2. Observe from Figure 4 that the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values for the kernels QGPErf,ΘNTKErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓subscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle Q_{GP-Erf},\Theta_{NTK-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT , roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT, EoS and for 2L-FCN are consistently lower than the C=2𝐶2\displaystyle C=2italic_C = 2 case as shown in Figures 2(a), 2(b), 3(c), 3(d), even when comparing them at d0=1subscript𝑑01\displaystyle d_{0}=1italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1. We believe this is because 𝒟2(N,d0)subscript𝒟2𝑁subscript𝑑0\displaystyle\mathcal{D}_{2}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) is constructed by adding 22\displaystyle 22 new classes to 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) whose distance between the means is much larger than the within class co-variances.

\displaystyle\bullet On the effects of class imbalances: One of the key conditions for the EoS to approximate the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) behavior of a 2L-FCN is the presence of sufficient data samples [40] (depending on the scale of the input dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT as shown above). Thus, extreme class imbalances can lead to biased 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) trends in EoS. Consider a collection of imbalanced datasets based on 𝒟1(N,d0)subscript𝒟1𝑁subscript𝑑0\displaystyle\mathcal{D}_{1}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) where N=2048𝑁2048\displaystyle N=2048italic_N = 2048 is split into two classes as follows: Case 1: (768,1280)7681280\displaystyle(768,1280)( 768 , 1280 ), Case 2: (512,1536)5121536\displaystyle(512,1536)( 512 , 1536 ), Case 3: (256,1792)2561792\displaystyle(256,1792)( 256 , 1792 ). We observe that for Case 3 where the imbalance ratio is relatively large, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values in EoS are slightly larger than 2L-FCN (Figure 9 in Appendix H). On the other hand, the imbalance ratios in Case 1 and Case 2 lead to good approximations with 2L-FCN. A similar observation can be made for the dataset 𝒟2(N,d0)subscript𝒟2𝑁subscript𝑑0\displaystyle\mathcal{D}_{2}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) with C=4𝐶4\displaystyle C=4italic_C = 4 and N=1024𝑁1024\displaystyle N=1024italic_N = 1024 from Figure 10.

\displaystyle\bullet Implications: Our analysis showcases the dependence of 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) on the activation functions and indicates that an increase in the data complexity by increasing the dimension of the data typically leads to larger 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values. Furthermore, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values depend heavily on the relative positions of the data points. In particular, when the means {μc}subscript𝜇𝑐\displaystyle\{\mu_{c}\}{ italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } are well separated with smaller {σc}subscript𝜎𝑐\displaystyle\{\sigma_{c}\}{ italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT }, even a large number of classes can lead to smaller 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values. These observations explain the empirical results by Papyan et al. [3] (Figure 6) where extensive experiments on complex datasets (e.g., ImageNet) led to relatively less collapse (i.e larger 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H )) than simpler datasets such as MNIST. We present a broader discussion on the limitations of our work and future efforts in Appendix J, and a discussion on formulating NC1 metrics that consider the structure of the data in Appendix F.

Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 4: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor 500500\displaystyle 500500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation on dataset 𝒟2(N,d0)subscript𝒟2𝑁subscript𝑑0\displaystyle\mathcal{D}_{2}(N,d_{0})caligraphic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (as per equation 19).

7 Conclusion

In this paper, we presented a kernel-based approach to understanding the role of data in the emergence of the Neural Collapse (NC) phenomenon. By considering a general kernel function, we first formulated the trace expressions for the variability collapse (NC1) of the features of the data samples. By leveraging these results, we provided theoretical and empirical results to showcase that the NTK does not represent more collapsed features than the NNGP for various Gaussian datasets. Next, to capture the feature-learning aspects of finite-width neural networks, we switched to an ‘adaptive kernel’ approach whose state equations (EoS) facilitate the transition of the post-activation kernel beyond the GP limit. Through this “kernel vs. kernel” approach for limiting NNGP, NTK, and adaptive kernels, we showcased a promising direction to analyze the properties of data for which the NC1 behavior of an actual FCN can be understood. Thus addressing the limitations of the unconstrained features based analysis of NC and explaining the empirical observations of [3] on datasets of varying complexity. We believe that future work on analyzing the EoS for multi-layer FCN and convolutional networks [40] can provide further insights into the depthwise reduction of NC1 for deeper networks and provide a framework for analyzing datasets with more complex distributions.

Acknowledgments and Disclosure of Funding

The authors would like to thank Zhengdao Chen for helpful discussions during the preparation of this manuscript. The work of Tom Tirer is supported by the ISF grant No. 1940/23.

References

  • Hoffer et al. [2017] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. Advances in neural information processing systems, 30, 2017.
  • Ma et al. [2018] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325–3334. PMLR, 2018.
  • Papyan et al. [2020] Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • Han et al. [2022] XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2022.
  • Kothapalli [2023] Vignesh Kothapalli. Neural collapse: A review on modelling principles and generalization. Transactions on Machine Learning Research, 2023.
  • Fang et al. [2021] Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  • Thrampoulidis et al. [2022] Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia. Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems, 35:27225–27238, 2022.
  • Tirer and Bruna [2022] Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pages 21478–21505. PMLR, 2022.
  • Rangamani et al. [2023] Akshay Rangamani, Marius Lindegaard, Tomer Galanti, and Tomaso A Poggio. Feature learning in deep classifiers through intermediate neural collapse. In International Conference on Machine Learning, pages 28729–28745. PMLR, 2023.
  • Súken\́mathbf{missing}ik et al. [2023] Peter Súken\́mathbf{i}k, Marco Mondelli, and Christoph H Lampert. Deep neural collapse is provably optimal for the deep unconstrained features model. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • He and Su [2023] Hangfeng He and Weijie J Su. A law of data separation in deep learning. Proceedings of the National Academy of Sciences, 120(36):e2221704120, 2023.
  • Tirer et al. [2023] Tom Tirer, Haoxiang Huang, and Jonathan Niles-Weed. Perturbation analysis of neural collapse. In International Conference on Machine Learning, pages 34301–34329. PMLR, 2023.
  • Yang et al. [2023] Yongyi Yang, Jacob Steinhardt, and Wei Hu. Are neurons actually collapsed? on the fine-grained structure in neural representations. In International Conference on Machine Learning, pages 39453–39487. PMLR, 2023.
  • Kothapalli et al. [2023] Vignesh Kothapalli, Tom Tirer, and Joan Bruna. A neural collapse perspective on feature evolution in graph neural networks. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Zhu et al. [2021] Zhihui Zhu, Tianyu Ding, **xin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.
  • Galanti et al. [2022] Tomer Galanti, András György, and Marcus Hutter. On the role of neural collapse in transfer learning. In International Conference on Learning Representations, 2022.
  • Yang et al. [2022] Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network? Advances in Neural Information Processing Systems, 35:37991–38002, 2022.
  • Xu and Liu [2023] **g Xu and Haoxiong Liu. Quantifying the variability collapse of neural networks. In International Conference on Machine Learning, pages 38535–38550. PMLR, 2023.
  • Mixon et al. [2020] Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  • Ji et al. [2022] Wenlong Ji, Yi** Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations, 2022.
  • Zhou et al. [2022] **xin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pages 27179–27202. PMLR, 2022.
  • Yaras et al. [2022] Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. Advances in neural information processing systems, 35:11547–11560, 2022.
  • Dang et al. [2023] Hien Dang, Tan Minh Nguyen, Tho Tran, Hung The Tran, Hung Tran, and Nhat Ho. Neural collapse in deep linear networks: From balanced to imbalanced data. In International Conference on Machine Learning, 2023.
  • Wojtowytsch et al. [2020] Stephan Wojtowytsch et al. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. arXiv preprint arXiv:2012.05420, 2020.
  • Schölkopf et al. [2002] Bernhard Schölkopf, Alexander J Smola, Francis Bach, et al. Learning with kernels: support vector machines, regularization, optimization, and beyond. 2002.
  • Neal [1995] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 1995.
  • Lee et al. [2018] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
  • Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Chizat et al. [2019] Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
  • Arora et al. [2019] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Russ R Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, pages 8141–8150, 2019.
  • Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32, 2019.
  • Matthews et al. [2018] Alexander G de G Matthews, Jiri Hron, Mark Rowland, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. In International Conference on Learning Representations, 2018.
  • Ronen et al. [2019] Basri Ronen, David Jacobs, Yoni Kasten, and Shira Kritchman. The convergence rate of neural networks for learned functions of different frequencies. Advances in Neural Information Processing Systems, 32, 2019.
  • Huang et al. [2020] Kaixuan Huang, Yuqing Wang, Molei Tao, and Tuo Zhao. Why do deep residual networks generalize better than deep feedforward networks?—a neural tangent kernel perspective. Advances in neural information processing systems, 33:2698–2709, 2020.
  • Tirer et al. [2022] Tom Tirer, Joan Bruna, and Raja Giryes. Kernel-based smoothness analysis of residual networks. In Mathematical and Scientific Machine Learning, pages 921–954. PMLR, 2022.
  • Barzilai et al. [2022] Daniel Barzilai, Amnon Geifman, Meirav Galun, and Ronen Basri. A kernel perspective of skip connections in convolutional networks. arXiv preprint arXiv:2211.14810, 2022.
  • Tancik et al. [2020] Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in neural information processing systems, 33:7537–7547, 2020.
  • Hanin and Nica [2019] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019.
  • Lee et al. [2020] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. Advances in Neural Information Processing Systems, 33:15156–15172, 2020.
  • Seroussi et al. [2023] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. Separation of scales and a thermodynamic description of feature learning in some cnns. Nature Communications, 14(1):908, 2023.
  • Woodworth et al. [2020] Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635–3673. PMLR, 2020.
  • Ghorbani et al. [2020] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. When do neural networks outperform kernel methods? Advances in Neural Information Processing Systems, 33:14820–14830, 2020.
  • Wei et al. [2019] Colin Wei, Jason D Lee, Qiang Liu, and Tengyu Ma. Regularization matters: Generalization and optimization of neural nets vs their induced kernel. Advances in Neural Information Processing Systems, 32, 2019.
  • Yehudai and Shamir [2019] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Li et al. [2020] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer neural networks beyond ntk. In Conference on learning theory, pages 2613–2682. PMLR, 2020.
  • Wang et al. [2023] Peng Wang, Xiao Li, Can Yaras, Zhihui Zhu, Laura Balzano, Wei Hu, and Qing Qu. Understanding deep representation learning via layerwise feature compression and discrimination. arXiv preprint arXiv:2311.02960, 2023.
  • Seleznova et al. [2023] Mariia Seleznova, Dana Weitzner, Raja Giryes, Gitta Kutyniok, and Hung-Hsu Chou. Neural (tangent kernel) collapse. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Williams [1996] Christopher Williams. Computing with infinite networks. Advances in neural information processing systems, 9, 1996.
  • Cho and Saul [2009] Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. Advances in neural information processing systems, 22, 2009.
  • Rubin et al. [2024] Noa Rubin, Inbar Seroussi, and Zohar Ringel. Grokking as a first order phase transition in two layer networks. In The Twelfth International Conference on Learning Representations, 2024.
  • Hui and Belkin [2021] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In The Ninth International Conference on Learning Representations (ICLR), 2021.
  • Yaras et al. [2023] Can Yaras, Peng Wang, Wei Hu, Zhihui Zhu, Laura Balzano, and Qing Qu. The law of parsimony in gradient descent for learning deep linear networks. arXiv preprint arXiv:2306.01154, 2023.
  • Seltman [2012] Howard Seltman. Approximations for mean and variance of a ratio. unpublished note, 2012.
  • Vershynin [2012] Roman Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3):655–686, 2012.

Appendix A Proof of Theorem 4.1

To obtain the NC1 formulation corresponding to an arbitrary feature matrix 𝐇𝐇\displaystyle\mathbf{H}bold_H, we start with a simple relationship between 𝚺~T(𝐇),𝚺~B(𝐇),𝚺W(𝐇)subscript~𝚺𝑇𝐇subscript~𝚺𝐵𝐇subscript𝚺𝑊𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H}),\widetilde{% \boldsymbol{\Sigma}}_{B}(\mathbf{H}),\boldsymbol{\Sigma}_{W}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) , over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) , bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) as follows:

𝚺~T(𝐇)=𝚺W(𝐇)+𝚺~B(𝐇)tr(𝚺W(𝐇))=tr(𝚺~T(𝐇))tr(𝚺~B(𝐇)).subscript~𝚺𝑇𝐇subscript𝚺𝑊𝐇subscript~𝚺𝐵𝐇trsubscript𝚺𝑊𝐇trsubscript~𝚺𝑇𝐇trsubscript~𝚺𝐵𝐇\displaystyle\displaystyle\begin{split}\widetilde{\boldsymbol{\Sigma}}_{T}(% \mathbf{H})&=\boldsymbol{\Sigma}_{W}(\mathbf{H})+\widetilde{\boldsymbol{\Sigma% }}_{B}(\mathbf{H})\\ \implies\mathrm{tr}\left(\boldsymbol{\Sigma}_{W}(\mathbf{H})\right)&=\mathrm{% tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})\right)-\mathrm{tr}% \left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right).\end{split}start_ROW start_CELL over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) end_CELL start_CELL = bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) + over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) end_CELL end_ROW start_ROW start_CELL ⟹ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) end_CELL start_CELL = roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) ) - roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) . end_CELL end_ROW (20)

Similarly, by considering 𝚺G(𝐇)=𝐡¯G𝐡¯Gsubscript𝚺𝐺𝐇superscript¯𝐡𝐺superscript¯𝐡limit-from𝐺top\displaystyle\boldsymbol{\Sigma}_{G}(\mathbf{H})=\overline{\mathbf{h}}^{G}% \overline{\mathbf{h}}^{G\top}bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) = over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G ⊤ end_POSTSUPERSCRIPT, we get:

𝚺B(𝐇)=𝚺~B(𝐇)𝚺G(𝐇)tr(𝚺B(𝐇))=tr(𝚺~B(𝐇))tr(𝚺G(𝐇)).subscript𝚺𝐵𝐇subscript~𝚺𝐵𝐇subscript𝚺𝐺𝐇trsubscript𝚺𝐵𝐇trsubscript~𝚺𝐵𝐇trsubscript𝚺𝐺𝐇\displaystyle\displaystyle\begin{split}\boldsymbol{\Sigma}_{B}(\mathbf{H})&=% \widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})-\boldsymbol{\Sigma}_{G}(% \mathbf{H})\\ \implies\mathrm{tr}\left(\boldsymbol{\Sigma}_{B}(\mathbf{H})\right)&=\mathrm{% tr}\left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right)-\mathrm{tr}% \left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right).\end{split}start_ROW start_CELL bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) end_CELL start_CELL = over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) - bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) end_CELL end_ROW start_ROW start_CELL ⟹ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) end_CELL start_CELL = roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) - roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) ) . end_CELL end_ROW (21)

\displaystyle\bullet Formulating tr(𝚺~T(𝐇))trsubscript~𝚺𝑇𝐇\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})\right)roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) ): Expanding 𝚺~T(𝐇)subscript~𝚺𝑇𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{T}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) into individual outer-products of vectors and leveraging the trace properties leads to the following:

tr(𝚺~T(𝐇))trsubscript~𝚺𝑇𝐇\displaystyle\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{T}% (\mathbf{H})\right)roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) ) =tr(1Nc=1Ci=1nc𝐡c,i𝐡c,i)=1Nc=1Ci=1nctr(𝐡c,i𝐡c,i)absenttr1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖superscript𝐡𝑐limit-from𝑖top1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐trsuperscript𝐡𝑐𝑖superscript𝐡𝑐limit-from𝑖top\displaystyle\displaystyle=\mathrm{tr}\left(\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1% }^{n_{c}}\mathbf{h}^{c,i}\mathbf{h}^{c,i\top}\right)=\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}\mathrm{tr}\left(\mathbf{h}^{c,i}\mathbf{h}^{c,i\top}\right)= roman_tr ( divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_tr ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT )
=1Nc=1Ci=1nctr(𝐡c,i𝐡c,i)absent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐trsuperscript𝐡𝑐limit-from𝑖topsuperscript𝐡𝑐𝑖\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\mathrm{% tr}\left(\mathbf{h}^{c,i\top}\mathbf{h}^{c,i}\right)= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_tr ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT )
=1Nc=1Ci=1ncQ(𝐱c,i,𝐱c,i)absent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT )

\displaystyle\bullet Formulating tr(𝚺~B(𝐇))trsubscript~𝚺𝐵𝐇\displaystyle\mathrm{tr}\left(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})\right)roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ): Similar to the above analysis, we can reformulate the trace of non-centered between-class covariance matrix 𝚺~B(𝐇)subscript~𝚺𝐵𝐇\displaystyle\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H})over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) as:

tr(𝚺~B)trsubscript~𝚺𝐵\displaystyle\displaystyle\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B})roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) =tr(1Cc=1C𝐡¯c𝐡¯c)=1Cc=1Ctr(𝐡¯c𝐡¯c)=1Cc=1Ctr(𝐡¯c𝐡¯c)absenttr1𝐶superscriptsubscript𝑐1𝐶superscript¯𝐡𝑐superscript¯𝐡limit-from𝑐top1𝐶superscriptsubscript𝑐1𝐶trsuperscript¯𝐡𝑐superscript¯𝐡limit-from𝑐top1𝐶superscriptsubscript𝑐1𝐶trsuperscript¯𝐡limit-from𝑐topsuperscript¯𝐡𝑐\displaystyle\displaystyle=\mathrm{tr}\left(\frac{1}{C}\sum_{c=1}^{C}\overline% {\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}\right)=\frac{1}{C}\sum_{c=1}^{C}% \mathrm{tr}\left(\overline{\mathbf{h}}^{c}\overline{\mathbf{h}}^{c\top}\right)% =\frac{1}{C}\sum_{c=1}^{C}\mathrm{tr}\left(\overline{\mathbf{h}}^{c\top}% \overline{\mathbf{h}}^{c}\right)= roman_tr ( divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c ⊤ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_tr ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c ⊤ end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_tr ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT )
=1Cc=1Ctr([1nci=1nc𝐡c,i][1nci=1nc𝐡c,i])absent1𝐶superscriptsubscript𝑐1𝐶trsuperscriptdelimited-[]1subscript𝑛𝑐superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖topdelimited-[]1subscript𝑛𝑐superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\mathrm{tr}\left(\left[% \frac{1}{n_{c}}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]^{\top}\left[\frac{1}{% n_{c}}\sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]\right)= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT roman_tr ( [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ] )
=1Cc=1C1nc2tr(i=1ncj=1nc𝐡c,i𝐡c,j)=1Cc=1C1nc2i=1ncj=1nctr(𝐡c,i𝐡c,j)absent1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2trsuperscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐superscript𝐡𝑐limit-from𝑖topsuperscript𝐡𝑐𝑗1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐trsuperscript𝐡𝑐limit-from𝑖topsuperscript𝐡𝑐𝑗\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\mathrm% {tr}\left(\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}\mathbf{h}^{c,i\top}\mathbf{h}^{% c,j}\right)=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum% _{j=1}^{n_{c}}\mathrm{tr}\left(\mathbf{h}^{c,i\top}\mathbf{h}^{c,j}\right)= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_tr ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_tr ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT )
=1Cc=1C1nc2i=1ncj=1ncQ(𝐱c,i,𝐱c,j)absent1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑗\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT )

\displaystyle\bullet Formulating tr(𝚺G(𝐇))trsubscript𝚺𝐺𝐇\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) ): Reformulation of tr(𝚺G(𝐇))trsubscript𝚺𝐺𝐇\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) ) can be approached along the same lines:

tr(𝚺G(𝐇))trsubscript𝚺𝐺𝐇\displaystyle\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{G}(\mathbf{H})\right)roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) ) =tr(𝐡¯G𝐡¯G)=tr(𝐡¯G𝐡¯G)absenttrsuperscript¯𝐡𝐺superscript¯𝐡limit-from𝐺toptrsuperscript¯𝐡limit-from𝐺topsuperscript¯𝐡𝐺\displaystyle\displaystyle=\mathrm{tr}\left(\overline{\mathbf{h}}^{G}\overline% {\mathbf{h}}^{G\top}\right)=\mathrm{tr}\left(\overline{\mathbf{h}}^{G\top}% \overline{\mathbf{h}}^{G}\right)= roman_tr ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G ⊤ end_POSTSUPERSCRIPT ) = roman_tr ( over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G ⊤ end_POSTSUPERSCRIPT over¯ start_ARG bold_h end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT )
=tr([1Nc=1Ci=1nc𝐡c,i][1Nc=1Cj=1nc𝐡c,j])absenttrsuperscriptdelimited-[]1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐡𝑐𝑖topdelimited-[]1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑗1subscript𝑛𝑐superscript𝐡𝑐𝑗\displaystyle\displaystyle=\mathrm{tr}\left(\left[\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}\mathbf{h}^{c,i}\right]^{\top}\left[\frac{1}{N}\sum_{c=1}^{C% }\sum_{j=1}^{n_{c}}\mathbf{h}^{c,j}\right]\right)= roman_tr ( [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ] )
=1N2tr(c=1Ci=1ncc=1Cj=1nc𝐡c,i𝐡c,j)=1N2c=1Ci=1ncc=1Cj=1nctr(𝐡c,i𝐡c,j)absent1superscript𝑁2trsuperscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑗1subscript𝑛superscript𝑐superscript𝐡𝑐limit-from𝑖topsuperscript𝐡superscript𝑐𝑗1superscript𝑁2superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑗1subscript𝑛superscript𝑐trsuperscript𝐡𝑐limit-from𝑖topsuperscript𝐡superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{N^{2}}\mathrm{tr}\left(\sum_{c=1}^{C}\sum_% {i=1}^{n_{c}}\sum_{c^{\prime}=1}^{C}\sum_{j=1}^{n_{c^{\prime}}}\mathbf{h}^{c,i% \top}\mathbf{h}^{c^{\prime},j}\right)=\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{i=1}^% {n_{c}}\sum_{c^{\prime}=1}^{C}\sum_{j=1}^{n_{c^{\prime}}}\mathrm{tr}\left(% \mathbf{h}^{c,i\top}\mathbf{h}^{c^{\prime},j}\right)= divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_tr ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_tr ( bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT )
=1N2c=1Cc=1Ci=1ncj=1ncQ(𝐱c,i,𝐱c,j)absent1superscript𝑁2superscriptsubscript𝑐1𝐶superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛superscript𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{\prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c% ^{\prime},j})= divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT )

By using these intermediate results, we can formulate tr(𝚺W(𝐇)),tr(𝚺B(𝐇))trsubscript𝚺𝑊𝐇trsubscript𝚺𝐵𝐇\displaystyle\mathrm{tr}\left(\boldsymbol{\Sigma}_{W}(\mathbf{H})\right),% \mathrm{tr}\left(\boldsymbol{\Sigma}_{B}(\mathbf{H})\right)roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) , roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) as:

tr(𝚺W(𝐇))trsubscript𝚺𝑊𝐇\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) =tr(𝚺~T(𝐇))tr(𝚺~B(𝐇))absenttrsubscript~𝚺𝑇𝐇trsubscript~𝚺𝐵𝐇\displaystyle\displaystyle=\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{T}(% \mathbf{H}))-\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B}(\mathbf{H}))= roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_H ) ) - roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) )
=1Nc=1Ci=1ncQ(𝐱c,i,𝐱c,i)1Cc=1C1nc2i=1ncj=1ncQ(𝐱c,i,𝐱c,j)absent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑗\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}Q(% \mathbf{x}^{c,i},\mathbf{x}^{c,i})-\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}% }\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT )
tr(𝚺B(𝐇))trsubscript𝚺𝐵𝐇\displaystyle\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) =tr(𝚺~B(𝐇))tr(𝚺G(𝐇)))\displaystyle\displaystyle=\mathrm{tr}(\widetilde{\boldsymbol{\Sigma}}_{B}(% \mathbf{H}))-\mathrm{tr}(\boldsymbol{\Sigma}_{G}(\mathbf{H})))= roman_tr ( over~ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) - roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ( bold_H ) ) )
=1Cc=1C1nc2i=1ncj=1ncQ(𝐱c,i,𝐱c,j)1N2c=1Cc=1Ci=1ncj=1ncQ(𝐱c,i,𝐱c,j).absent1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱𝑐𝑗1superscript𝑁2superscriptsubscript𝑐1𝐶superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛superscript𝑐𝑄superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{C}\sum_{c=1}^{C}\frac{1}{n_{c}^{2}}\sum_{i% =1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c,j})-\frac{1}{N^{% 2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{% \prime}}}Q(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).= divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) .

Hence, proving the theorem.

Appendix B Limiting NNGP and NTK for ReLU

Consider the GP limit characterization of the pre-activation kernel K(1)(𝐱c,i,𝐱c,j)superscript𝐾1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) as follows:

KGP(1)(𝐱c,i,𝐱c,j)=σb2+σw2d0𝐱c,i𝐱c,j.superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝐱𝑐limit-from𝑖topsuperscript𝐱superscript𝑐𝑗\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d_{0}}\mathbf{x}^{c,i\top}\mathbf{x% }^{c^{\prime},j}.italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG bold_x start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT . (22)

Observe that KGP(1)(𝐱c,i,𝐱c,j)superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) is independent of the activation function. Now, the closed form representation of the post-activation NNGP kernel QGP(1)(,)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}(\cdot,\cdot)italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ( ⋅ , ⋅ ) for the ReLU activation is given by:

QGPReLU(1)(𝐱c,i,𝐱c,j)=τ(xc,i,xc,j)2πKGP(1)(𝐱c,i,𝐱c,i)KGP(1)(𝐱c,j,𝐱c,j),τ(xc,i,xc,j)=sinθc,ic,j+(πθc,ic,j)cosθc,ic,jθc,ic,j=arccos(KGP(1)(𝐱c,i,𝐱c,j)KGP(1)(𝐱c,i,𝐱c,i)KGP(1)(𝐱c,j,𝐱c,j)).formulae-sequencesuperscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗𝜏superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗2𝜋superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱superscript𝑐𝑗𝜏superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗𝜋superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})&=\frac{\tau(x^{c,i},x^{c^{\prime},j})}{2\pi}\sqrt{K% _{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,i})K_{GP}^{(1)}(\mathbf{x}^{c^{% \prime},j},\mathbf{x}^{c^{\prime},j})},\\ \tau(x^{c,i},x^{c^{\prime},j})&=\sin\theta_{c,i}^{c^{\prime},j}+\left(\pi-% \theta_{c,i}^{c^{\prime},j}\right)\cos\theta_{c,i}^{c^{\prime},j}\\ \theta_{c,i}^{c^{\prime},j}&=\arccos\left(\frac{K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})}{\sqrt{K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,% i})K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})}}\right).% \end{split}start_ROW start_CELL italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG italic_τ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_π end_ARG square-root start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG , end_CELL end_ROW start_ROW start_CELL italic_τ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = roman_sin italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT + ( italic_π - italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) roman_cos italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL = roman_arccos ( divide start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) . end_CELL end_ROW (23)

Next, we define the ReLU based derivative kernel Q˙GPReLU(1)(,)subscriptsuperscript˙𝑄1𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle\dot{Q}^{(1)}_{GP-ReLU}(\cdot,\cdot)over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( ⋅ , ⋅ ) as follows:

Q˙GPReLU(1)(𝐱c,i,𝐱c,j)=12π(πθ)superscriptsubscript˙𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗12𝜋𝜋𝜃\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i% },\mathbf{x}^{c^{\prime},j})&=\frac{1}{2\pi}\left(\pi-\theta\right)\end{split}start_ROW start_CELL over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG ( italic_π - italic_θ ) end_CELL end_ROW (24)

Finally, the NTK can be formulated as follows:

ΘNTKReLU(2)(𝐱c,i,𝐱c,j)=KGPReLU(2)(𝐱c,i,𝐱c,j)+KGP(1)(𝐱c,i,𝐱c,j)Q˙GPReLU(1)(𝐱c,i,𝐱c,j).subscriptsuperscriptΘ2𝑁𝑇𝐾𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗subscriptsuperscript˙𝑄1𝐺𝑃𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP-ReLU% }^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c% ,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j}).roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . (25)

Here, KGPReLU(2)(𝐱c,i,𝐱c,j)superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) can be defined using the recursive formulation:

KGPReLU(2)(𝐱c,i,𝐱c,j)=σb2+σw2QGPReLU(1)(𝐱c,i,𝐱c,j).superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).\end{split}start_ROW start_CELL italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . end_CELL end_ROW (26)

Appendix C General Results for NC1 with Kernels

In this section, we present some general results to calculate the expected value of 𝔼[𝒩𝒞1(𝐇)]𝔼delimited-[]𝒩subscript𝒞1𝐇\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] for any given kernel function Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) that is associated with the features 𝐇𝐇\displaystyle\mathbf{H}bold_H. To begin with, we consider a generic formulation of the three cases for 𝔼[Q(xc,i,xc,j)]𝔼delimited-[]𝑄superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[Q(x^{c,i},x^{c^{\prime},j})\right]blackboard_E [ italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ]:

𝔼[Q(xc,i,xc,j)]={V(1)(c)if c=c,i=jV(2)(c)if c=c,ijV(3)(c,c)if cc.𝔼delimited-[]𝑄superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗casessuperscript𝑉1𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗superscript𝑉2𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗superscript𝑉3𝑐superscript𝑐if 𝑐superscript𝑐\displaystyle\displaystyle\begin{split}\mathbb{E}\left[Q(x^{c,i},x^{c^{\prime}% ,j})\right]=\begin{cases}V^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ V^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ V^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}start_ROW start_CELL blackboard_E [ italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] = { start_ROW start_CELL italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . end_CELL end_ROW (27)
Lemma C.1.

Given the cases for the expected values of a kernel function Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) as per (27), the 𝔼[tr(𝚺W(𝐇))]𝔼delimited-[]trsubscript𝚺𝑊𝐇\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))\right]blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) ] is given by:

𝔼[tr(𝚺W(𝐇))]=c=12ncNV(1)(c)12nc2(nc(nc1)V(2)(c)+ncV(1)(c))𝔼delimited-[]trsubscript𝚺𝑊𝐇superscriptsubscript𝑐12subscript𝑛𝑐𝑁superscript𝑉1𝑐12superscriptsubscript𝑛𝑐2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐\displaystyle\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{W}(% \mathbf{H}))\right]=\sum_{c=1}^{2}\frac{n_{c}}{N}V^{(1)}(c)-\frac{1}{2n_{c}^{2% }}\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) ] = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) - divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) (28)
Proof.

By leveraging Theorem 4.1, we can compute the expected value of tr(𝚺W(𝐇))trsubscript𝚺𝑊𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) as follows:

𝔼[tr(𝚺W(𝐇))]=𝔼[1Nc=1Ci=1ncQ(xc,i,xc,i)]𝔼[1Cc=1C1nc2i=1ncj=1ncQ(xc,i,xc,j)]=1Nc=12i=1nc𝔼[Q(xc,i,xc,i)]12c=121nc2i=1ncj=1nc𝔼[Q(xc,i,xc,j)]=1Nc=12i=1ncV(1)(c)12c=121nc2(nc(nc1)V(2)(c)+ncV(1)(c))=c=12ncNV(1)(c)12nc2(nc(nc1)V(2)(c)+ncV(1)(c)).𝔼delimited-[]trsubscript𝚺𝑊𝐇𝔼delimited-[]1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐𝑄superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖𝔼delimited-[]1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝑄superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗1𝑁superscriptsubscript𝑐12superscriptsubscript𝑖1subscript𝑛𝑐𝔼delimited-[]𝑄superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12superscriptsubscript𝑐121superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝔼delimited-[]𝑄superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗1𝑁superscriptsubscript𝑐12superscriptsubscript𝑖1subscript𝑛𝑐superscript𝑉1𝑐12superscriptsubscript𝑐121superscriptsubscript𝑛𝑐2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐superscriptsubscript𝑐12subscript𝑛𝑐𝑁superscript𝑉1𝑐12superscriptsubscript𝑛𝑐2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathrm{tr}(\boldsymbol% {\Sigma}_{W}(\mathbf{H}))\right]&=\mathbb{E}\left[\frac{1}{N}\sum_{c=1}^{C}% \sum_{i=1}^{n_{c}}Q(x^{c,i},x^{c,i})\right]-\mathbb{E}\left[\frac{1}{C}\sum_{c% =1}^{C}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}Q(x^{c,i},x^{c,j% })\right]\\ &=\frac{1}{N}\sum_{c=1}^{2}\sum_{i=1}^{n_{c}}\mathbb{E}\left[Q(x^{c,i},x^{c,i}% )\right]-\frac{1}{2}\sum_{c=1}^{2}\frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j% =1}^{n_{c}}\mathbb{E}\left[Q(x^{c,i},x^{c,j})\right]\\ &=\frac{1}{N}\sum_{c=1}^{2}\sum_{i=1}^{n_{c}}V^{(1)}(c)-\frac{1}{2}\sum_{c=1}^% {2}\frac{1}{n_{c}^{2}}\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\\ &=\sum_{c=1}^{2}\frac{n_{c}}{N}V^{(1)}(c)-\frac{1}{2n_{c}^{2}}\left(n_{c}(n_{c% }-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right).\end{split}start_ROW start_CELL blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) ] end_CELL start_CELL = blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] - blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E [ italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT blackboard_E [ italic_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG italic_N end_ARG italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) - divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) . end_CELL end_ROW (29)

Lemma C.2.

Given the cases for the expected values of a kernel function Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) as per (27), the 𝔼[tr(𝚺B(𝐇))]𝔼delimited-[]trsubscript𝚺𝐵𝐇\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))\right]blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] is given by:

𝔼[tr(𝚺B(𝐇))]=[c=12(12nc21N2)(nc(nc1)V(2)(c)+ncV(1)(c))]2n1n2N2V(3)(1,2)𝔼delimited-[]trsubscript𝚺𝐵𝐇delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2subscript𝑛1subscript𝑛2superscript𝑁2superscript𝑉312\displaystyle\displaystyle\mathbb{E}\left[\mathrm{tr}(\boldsymbol{\Sigma}_{B}(% \mathbf{H}))\right]=\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^% {2}}\right)\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{% 2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] = [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) (30)
Proof.

The expected value of tr(𝚺B(𝐇))trsubscript𝚺𝐵𝐇\displaystyle\mathrm{tr}(\boldsymbol{\Sigma}_{B}(\mathbf{H}))roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) can be computed using Theorem 4.1 as:

𝔼[tr(𝚺B(𝐇))]=𝔼[1Cc=1C1nc2i=1ncj=1nc𝐐(xc,i,xc,j)]𝔼[1N2c=1Cc=1Ci=1ncj=1nc𝐐(xc,i,xc,j)]=[12c=121nc2(nc(nc1)V(2)(c)+ncV(1)(c))]1N2[c=12(nc(nc1)V(2)(c)+ncV(1)(c))]1N2[2i=1n1j=1n2V(3)(c=1,c=2)]=[c=12(12nc21N2)(nc(nc1)V(2)(c)+ncV(1)(c))]2n1n2N2V(3)(1,2)𝔼delimited-[]trsubscript𝚺𝐵𝐇𝔼delimited-[]1𝐶superscriptsubscript𝑐1𝐶1superscriptsubscript𝑛𝑐2superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛𝑐𝐐superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗𝔼delimited-[]1superscript𝑁2superscriptsubscript𝑐1𝐶superscriptsubscriptsuperscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscriptsubscript𝑗1subscript𝑛superscript𝑐𝐐superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗delimited-[]12superscriptsubscript𝑐121superscriptsubscript𝑛𝑐2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐1superscript𝑁2delimited-[]superscriptsubscript𝑐12subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐1superscript𝑁2delimited-[]2superscriptsubscript𝑖1subscript𝑛1superscriptsubscript𝑗1subscript𝑛2superscript𝑉3formulae-sequence𝑐1superscript𝑐2delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2subscript𝑛1subscript𝑛2superscript𝑁2superscript𝑉312\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathrm{tr}(\boldsymbol% {\Sigma}_{B}(\mathbf{H}))\right]&=\mathbb{E}\left[\frac{1}{C}\sum_{c=1}^{C}% \frac{1}{n_{c}^{2}}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c}}\mathbf{Q}(x^{c,i},x^{c% ,j})\right]-\mathbb{E}\left[\frac{1}{N^{2}}\sum_{c=1}^{C}\sum_{c^{\prime}=1}^{% C}\sum_{i=1}^{n_{c}}\sum_{j=1}^{n_{c^{\prime}}}\mathbf{Q}(x^{c,i},x^{c^{\prime% },j})\right]\\ &=\left[\frac{1}{2}\sum_{c=1}^{2}\frac{1}{n_{c}^{2}}\left(n_{c}(n_{c}-1)V^{(2)% }(c)+n_{c}V^{(1)}(c)\right)\right]\\ &\hskip 20.0pt-\frac{1}{N^{2}}\left[\sum_{c=1}^{2}\left(n_{c}(n_{c}-1)V^{(2)}(% c)+n_{c}V^{(1)}(c)\right)\right]\\ &\hskip 20.0pt-\frac{1}{N^{2}}\left[2\sum_{i=1}^{n_{1}}\sum_{j=1}^{n_{2}}V^{(3% )}(c=1,c^{\prime}=2)\right]\\ &=\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n% _{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}% }V^{(3)}(1,2)\end{split}start_ROW start_CELL blackboard_E [ roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] end_CELL start_CELL = blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] - blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_Q ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ 2 ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c = 1 , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = 2 ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_CELL end_ROW (31)

Lemma C.3.

Given the cases for the expected values of a kernel function Q(,)𝑄\displaystyle Q(\cdot,\cdot)italic_Q ( ⋅ , ⋅ ) as per (27), the 𝔼[𝒩𝒞1(𝐇)]𝔼delimited-[]𝒩subscript𝒞1𝐇\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] is given by:

𝔼[𝒩𝒞1(𝐇)]=c=12ncV(1)(c)N(nc(nc1)V(2)(c)+ncV(1)(c))2nc2[c=12(12nc21N2)(nc(nc1)V(2)(c)+ncV(1)(c))]2n1n2V(3)(1,2)N2+Δh.o.t𝔼delimited-[]𝒩subscript𝒞1𝐇superscriptsubscript𝑐12subscript𝑛𝑐superscript𝑉1𝑐𝑁subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2superscriptsubscript𝑛𝑐2delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2subscript𝑛1subscript𝑛2superscript𝑉312superscript𝑁2subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% })\right]=\frac{\sum\limits_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{\left(n_{% c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)}{2n_{c}^{2}}}{\left[\sum\limits_{% c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n_{c}(n_{c}-1)V% ^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}V^{(3)}(1,2)}{N^{2}}}% +\Delta_{h.o.t}blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG italic_N end_ARG - divide start_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (32)
Proof.

Note that the expectation of the ratios can be given as:

𝔼[𝒩𝒞1(𝐇)]=𝔼[tr(ΣW(𝐇))]𝔼[tr(ΣB(𝐇))]+Δh.o.t𝔼delimited-[]𝒩subscript𝒞1𝐇𝔼delimited-[]trsubscriptΣ𝑊𝐇𝔼delimited-[]trsubscriptΣ𝐵𝐇subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H% })\right]=\frac{\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{H}))\right]}{% \mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))\right]}+\Delta_{h.o.t}blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] = divide start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) ] end_ARG start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (33)
=c=12ncV(1)(c)N(nc(nc1)V(2)(c)+ncV(1)(c))2nc2[c=12(12nc21N2)(nc(nc1)V(2)(c)+ncV(1)(c))]2n1n2V(3)(1,2)N2+Δh.o.tabsentsuperscriptsubscript𝑐12subscript𝑛𝑐superscript𝑉1𝑐𝑁subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2superscriptsubscript𝑛𝑐2delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2subscript𝑛𝑐subscript𝑛𝑐1superscript𝑉2𝑐subscript𝑛𝑐superscript𝑉1𝑐2subscript𝑛1subscript𝑛2superscript𝑉312superscript𝑁2subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{\sum\limits_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N% }-\frac{\left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)}{2n_{c}^{2}}}{% \left[\sum\limits_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)% \left(n_{c}(n_{c}-1)V^{(2)}(c)+n_{c}V^{(1)}(c)\right)\right]-\frac{2n_{1}n_{2}% V^{(3)}(1,2)}{N^{2}}}+\Delta_{h.o.t}= divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG italic_N end_ARG - divide start_ARG ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - 1 ) italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (34)

Here, Δh.o.tsubscriptΔformulae-sequence𝑜𝑡\displaystyle\Delta_{h.o.t}roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT corresponds to higher order terms given by [53]:

Δh.o.t=Var(tr(𝚺B(𝐇)))𝔼[tr(ΣW(𝐇))]𝔼[tr(ΣB(𝐇))]3Cov(tr(𝚺W(𝐇)),tr(𝚺B(𝐇)))𝔼[tr(ΣB(𝐇))]2,subscriptΔformulae-sequence𝑜𝑡𝑉𝑎𝑟trsubscript𝚺𝐵𝐇𝔼delimited-[]trsubscriptΣ𝑊𝐇𝔼superscriptdelimited-[]trsubscriptΣ𝐵𝐇3𝐶𝑜𝑣trsubscript𝚺𝑊𝐇trsubscript𝚺𝐵𝐇𝔼superscriptdelimited-[]trsubscriptΣ𝐵𝐇2\displaystyle\displaystyle\Delta_{h.o.t}=\frac{Var(\mathrm{tr}(\boldsymbol{% \Sigma}_{B}(\mathbf{H})))\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{H}))% \right]}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))\right]^{3}}-\frac% {Cov(\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{H})),\mathrm{tr}(\boldsymbol{% \Sigma}_{B}(\mathbf{H})))}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{H}))% \right]^{2}},roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT = divide start_ARG italic_V italic_a italic_r ( roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ) blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) ] end_ARG start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_C italic_o italic_v ( roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_H ) ) , roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ) end_ARG start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_H ) ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (35)

where, based on the well-studied concentration of sample covariance matrices around the statistical covariance [54], Δh.o.tsubscriptΔformulae-sequence𝑜𝑡\displaystyle\Delta_{h.o.t}roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT tend to 00\displaystyle 0 for large ncsubscript𝑛𝑐\displaystyle n_{c}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT values.

Lemma C.4.

For a random variable xc,i𝒩(μc,σc2)similar-tosuperscript𝑥𝑐𝑖𝒩subscript𝜇𝑐superscriptsubscript𝜎𝑐2\displaystyle x^{c,i}\sim\mathcal{N}(\mu_{c},\sigma_{c}^{2})italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) which represents the ithsuperscript𝑖𝑡\displaystyle i^{th}italic_i start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT sample of class c𝑐\displaystyle citalic_c (as per notation in Section 5.3), the expected value 𝔼[1(xc,i)2]𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] is given by:

T(c)=𝔼[1(xc,i)2]=1(μc2+σc2)+2σc4+4σc2μc2(μc2+σc2)3𝑇𝑐𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖21superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsubscript𝜎𝑐44superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐23\displaystyle\displaystyle T(c)=\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]% =\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2\sigma_{c}^{4}+4\sigma_{c}^{2}% \mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}italic_T ( italic_c ) = blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] = divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (36)
Proof.

Based on the standard result on the expectation of ratios [53], we get:

𝔼[1(xc,i)2]𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] =1𝔼[(xc,i)2]+Var((xc,i)2)𝔼[(xc,i)2]3absent1𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖2𝑉𝑎𝑟superscriptsuperscript𝑥𝑐𝑖2𝔼superscriptdelimited-[]superscriptsuperscript𝑥𝑐𝑖23\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[(x^{c,i})^{2}\right]}+% \frac{Var((x^{c,i})^{2})}{\mathbb{E}\left[(x^{c,i})^{2}\right]^{3}}= divide start_ARG 1 end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG + divide start_ARG italic_V italic_a italic_r ( ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (37)
=1(μc2+σc2)+𝔼[(xc,i)4](μc2+σc2)2(μc2+σc2)3absent1superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐2𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖4superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐23\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{% \mathbb{E}[(x^{c,i})^{4}]-(\mu_{c}^{2}+\sigma_{c}^{2})^{2}}{(\mu_{c}^{2}+% \sigma_{c}^{2})^{3}}= divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] - ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (38)

Based on the results from the moment-generating function, we know that:

𝔼[(xc,i)4]=3σc4+6σc2μc2+μc4,𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖43superscriptsubscript𝜎𝑐46superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsubscript𝜇𝑐4\displaystyle\displaystyle\mathbb{E}[(x^{c,i})^{4}]=3\sigma_{c}^{4}+6\sigma_{c% }^{2}\mu_{c}^{2}+\mu_{c}^{4},blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] = 3 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 6 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , (39)

which gives us:

𝔼[1(xc,i)2]𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] =1(μc2+σc2)+3σc4+6σc2μc2+μc4(μc2+σc2)2(μc2+σc2)3absent1superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐23superscriptsubscript𝜎𝑐46superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsubscript𝜇𝑐4superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐23\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{3% \sigma_{c}^{4}+6\sigma_{c}^{2}\mu_{c}^{2}+\mu_{c}^{4}-(\mu_{c}^{2}+\sigma_{c}^% {2})^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}= divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 3 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 6 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT - ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (40)
=1(μc2+σc2)+2σc4+4σc2μc2(μc2+σc2)3.absent1superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsubscript𝜎𝑐44superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐23\displaystyle\displaystyle=\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2% \sigma_{c}^{4}+4\sigma_{c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}.= divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG . (41)

Hence proving the lemma.

Appendix D Proof of Theorem 5.1

D.1 NC1 of limiting NNGP with ReLU activation

In the limit d1subscript𝑑1\displaystyle d_{1}\to\inftyitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ∞, we leverage the kernels in the GP limit as per (10), (23). Observe that for any two data points xc,i,xc,jsuperscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle x^{c,i},x^{c^{\prime},j}\in\mathbb{R}italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ∈ blackboard_R, the value of θc,ic,jsuperscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗\displaystyle\theta_{c,i}^{c^{\prime},j}italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT can be given as:

θc,ic,jsuperscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT =arccos(KGP(1)(xc,i,xc,j)KGP(1)(xc,i,xc,i)KGP(1)(xc,j,xc,j))absentsuperscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖superscriptsubscript𝐾𝐺𝑃1superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\arccos\left(\frac{K_{GP}^{(1)}(x^{c,i},x^{c^{% \prime},j})}{\sqrt{K_{GP}^{(1)}(x^{c,i},x^{c,i})K_{GP}^{(1)}(x^{c^{\prime},j},% x^{c^{\prime},j})}}\right)= roman_arccos ( divide start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG )
=arccos(σb2+σw2d0xc,ixc,j(σb2+σw2d0xc,ixc,i)(σb2+σw2d0xc,jxc,j)).absentsuperscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2subscript𝑑0superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\arccos\left(\frac{\sigma_{b}^{2}+\frac{\sigma_{w}^% {2}}{d_{0}}x^{c,i}x^{c^{\prime},j}}{\sqrt{\left(\sigma_{b}^{2}+\frac{\sigma_{w% }^{2}}{d_{0}}x^{c,i}x^{c,i}\right)\left(\sigma_{b}^{2}+\frac{\sigma_{w}^{2}}{d% _{0}}x^{c^{\prime},j}x^{c^{\prime},j}\right)}}\right).= roman_arccos ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG ( italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ( italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) .

Since σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0, the value of θc,ic,jsuperscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗\displaystyle\theta_{c,i}^{c^{\prime},j}italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT simplifies to:

θc,ic,j={0if c=cπif cc,superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗cases0if 𝑐superscript𝑐𝜋if 𝑐superscript𝑐\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}=\begin{cases}0&\text{if % }c=c^{\prime}\\ \pi&\text{if }c\neq c^{\prime}\\ \end{cases},italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_π end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW , (42)

which follows from xc,ixc,jxc,ixc,ixc,jxc,j=sign(xc,i)sign(xc,j)superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗\displaystyle\frac{x^{c,i}x^{c^{\prime},j}}{\sqrt{x^{c,i}x^{c,i}}\sqrt{x^{c^{% \prime},j}x^{c^{\prime},j}}}=\operatorname{sign}(x^{c,i})\operatorname{sign}(x% ^{c^{\prime},j})divide start_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT end_ARG square-root start_ARG italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG end_ARG = roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) and x1,i<0,x2,j>0formulae-sequencesuperscript𝑥1𝑖0superscript𝑥2𝑗0\displaystyle x^{1,i}<0,x^{2,j}>0italic_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT < 0 , italic_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT > 0 almost surely. This leads to:

QGPReLU(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =12πσw4(xc,i)2(xc,j)2(sinθc,ic,j+(πθc,ic,j)cosθc,ic,j)absent12𝜋superscriptsubscript𝜎𝑤4superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗2superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗𝜋superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{2\pi}\sqrt{\sigma_{w}^{4}(x^{c,i})^{2}(x^{% c^{\prime},j})^{2}}\bigg{(}\sin\theta_{c,i}^{c^{\prime},j}+\left(\pi-\theta_{c% ,i}^{c^{\prime},j}\right)\cos\theta_{c,i}^{c^{\prime},j}\bigg{)}= divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG square-root start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( roman_sin italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT + ( italic_π - italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) roman_cos italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) (43)
QGPReLU(1)(xc,i,xc,j)absentsuperscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\implies Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})⟹ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ={σw22|xc,i||xc,j|if c=c0if ccabsentcasessuperscriptsubscript𝜎𝑤22superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗if 𝑐superscript𝑐0if 𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}\left|x^{c,i}% \right|\left|x^{c^{\prime},j}\right|&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}= { start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG | italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT | | italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT | end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (44)

For the c=c𝑐superscript𝑐\displaystyle c=c^{\prime}italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT case, the value of the kernel boils down to the product of norms of independent random variables drawn from the same distribution. Since we assume xc,ixc,j>0superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗0\displaystyle x^{c,i}x^{c^{\prime},j}>0italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT > 0 if c=c𝑐superscript𝑐\displaystyle c=c^{\prime}italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, the equation 44 can be rewritten as:

QGPReLU(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ={σw22xc,ixc,jif c=c0if ccabsentcasessuperscriptsubscript𝜎𝑤22superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗if 𝑐superscript𝑐0if 𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c^{% \prime},j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}= { start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (45)

Additionally, since xc,isuperscript𝑥𝑐𝑖\displaystyle x^{c,i}italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT are random variables, the expected value of the kernel can be formulated as:

𝔼[QGPReLU(1)(xc,i,xc,j)]={σw22(σc2+μc2)if c=c,i=jσw22μc2if c=c,ij0if cc𝔼delimited-[]superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗casessuperscriptsubscript𝜎𝑤22superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2formulae-sequenceif 𝑐superscript𝑐𝑖𝑗superscriptsubscript𝜎𝑤22superscriptsubscript𝜇𝑐2formulae-sequenceif 𝑐superscript𝑐𝑖𝑗0if 𝑐superscript𝑐\displaystyle\displaystyle\mathbb{E}\left[Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{% \prime},j})\right]=\begin{cases}\frac{\sigma_{w}^{2}}{2}\left(\sigma_{c}^{2}+% \mu_{c}^{2}\right)&\text{if }c=c^{\prime},i=j\\ \frac{\sigma_{w}^{2}}{2}\mu_{c}^{2}&\text{if }c=c^{\prime},i\neq j\\ 0&\text{if }c\neq c^{\prime}\end{cases}blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] = { start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (46)

Thus, based on our generic formulation of cases in (27) in Appendix C, we get:

V(1)(c)=σw22(σc2+μc2);V(2)(c)=σw22μc2;V(3)(c,c)=0.formulae-sequencesuperscript𝑉1𝑐superscriptsubscript𝜎𝑤22superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2formulae-sequencesuperscript𝑉2𝑐superscriptsubscript𝜎𝑤22superscriptsubscript𝜇𝑐2superscript𝑉3𝑐superscript𝑐0\displaystyle\displaystyle V^{(1)}(c)=\frac{\sigma_{w}^{2}}{2}\left(\sigma_{c}% ^{2}+\mu_{c}^{2}\right);\hskip 10.0ptV^{(2)}(c)=\frac{\sigma_{w}^{2}}{2}\mu_{c% }^{2};\hskip 10.0ptV^{(3)}(c,c^{\prime})=0.italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ( italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ; italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ; italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = 0 . (47)

As N1much-greater-than𝑁1\displaystyle N\gg 1italic_N ≫ 1 and nc1,c{1,2}formulae-sequencemuch-greater-thansubscript𝑛𝑐1for-all𝑐12\displaystyle n_{c}\gg 1,\forall c\in\{1,2\}italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≫ 1 , ∀ italic_c ∈ { 1 , 2 }, Lemma C.3 gives us:

𝔼[𝒩𝒞1(𝐇GP)]=c=12ncV(1)(c)NV(2)(c)2[c=12(12nc21N2)(nc2V(2)(c))]2n1n2N2V(3)(1,2)+Δh.o.t𝔼[𝒩𝒞1(𝐇GP)]=c=12ncμc2+ncσc2Nμc22(c=12μc22nc2μc2N2)+Δh.o.t.𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝐺𝑃superscriptsubscript𝑐12subscript𝑛𝑐superscript𝑉1𝑐𝑁superscript𝑉2𝑐2delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2superscriptsubscript𝑛𝑐2superscript𝑉2𝑐2subscript𝑛1subscript𝑛2superscript𝑁2superscript𝑉312subscriptΔformulae-sequence𝑜𝑡𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝐺𝑃superscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝜇𝑐22superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H}_{GP})]&=\frac{\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)}% (c)}{2}}{\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)% \left(n_{c}^{2}V^{(2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)}% +\Delta_{h.o.t}\\ \implies\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(\mathbf{H}_{GP})]&=\frac{\sum_{c% =1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{% \left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}% \right)}+\Delta_{h.o.t}.\end{split}start_ROW start_CELL blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ) ] end_CELL start_CELL = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG 2 end_ARG end_ARG start_ARG [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL ⟹ blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT ) ] end_CELL start_CELL = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT . end_CELL end_ROW (48)

D.2 NC1 of limiting NTK with ReLU activation

The recursive relationship between the NTK and NNGP [31, 35] can be given as follows (13):

ΘNTKReLU(2)(𝐱c,i,𝐱c,j)=KGPReLU(2)(𝐱c,i,𝐱c,j)+KGP(1)(𝐱c,i,𝐱c,j)Q˙GPReLU(1)(𝐱c,i,𝐱c,j)subscriptsuperscriptΘ2𝑁𝑇𝐾𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗subscriptsuperscript˙𝑄1𝐺𝑃𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=K_{GP-ReLU% }^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c% ,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^{(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf% {x}^{c^{\prime},j})roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) (49)

Here, KGPReLU(2)(𝐱c,i,𝐱c,j)superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) can be defined using the following recursive formulation:

KGPReLU(2)(𝐱c,i,𝐱c,j)=σb2+σw2QGPReLU(1)(𝐱c,i,𝐱c,j).superscriptsubscript𝐾𝐺𝑃𝑅𝑒𝐿𝑈2superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}K_{GP-ReLU}^{(2)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(% \mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).\end{split}start_ROW start_CELL italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . end_CELL end_ROW (50)

Based on (24), the derivative Q˙GPReLU(1)subscriptsuperscript˙𝑄1𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle\dot{Q}^{(1)}_{GP-ReLU}over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT can be given as follows:

Q˙GPReLU(1)(𝐱c,i,𝐱c,j)=12π(πθc,ic,j)θc,ic,j=arccos(KGP(1)(𝐱c,i,𝐱c,j)KGP(1)(𝐱c,i,𝐱c,i)KGP(1)(𝐱c,j,𝐱c,j)).superscriptsubscript˙𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗12𝜋𝜋superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱𝑐𝑖superscriptsubscript𝐾𝐺𝑃1superscript𝐱superscript𝑐𝑗superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i% },\mathbf{x}^{c^{\prime},j})&=\frac{1}{2\pi}\left(\pi-\theta_{c,i}^{c^{\prime}% ,j}\right)\\ \theta_{c,i}^{c^{\prime},j}&=\arccos\left(\frac{K_{GP}^{(1)}(\mathbf{x}^{c,i},% \mathbf{x}^{c^{\prime},j})}{\sqrt{K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c,% i})K_{GP}^{(1)}(\mathbf{x}^{c^{\prime},j},\mathbf{x}^{c^{\prime},j})}}\right).% \end{split}start_ROW start_CELL over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 italic_π end_ARG ( italic_π - italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL = roman_arccos ( divide start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) . end_CELL end_ROW (51)

We build on the results from the NNGP analysis (with QGPReLU(1)(𝐱c,i,𝐱c,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT )) for computing the variability collapse with the limiting NTK. First, note that:

θc,ic,j={0if c=cπif cc.superscriptsubscript𝜃𝑐𝑖superscript𝑐𝑗cases0if 𝑐superscript𝑐𝜋if 𝑐superscript𝑐\displaystyle\displaystyle\theta_{c,i}^{c^{\prime},j}=\begin{cases}0&\text{if % }c=c^{\prime}\\ \pi&\text{if }c\neq c^{\prime}\\ \end{cases}.italic_θ start_POSTSUBSCRIPT italic_c , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT = { start_ROW start_CELL 0 end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_π end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . (52)

When σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0, we get:

ΘNTKReLU(2)(𝐱c,i,𝐱c,j)=σw2QGPReLU(1)(𝐱c,i,𝐱c,j)+KGP(1)(𝐱c,i,𝐱c,j)Q˙GPReLU(1)(𝐱c,i,𝐱c,j).subscriptsuperscriptΘ2𝑁𝑇𝐾𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗subscriptsuperscript˙𝑄1𝐺𝑃𝑅𝑒𝐿𝑈superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗\displaystyle\displaystyle\Theta^{(2)}_{NTK-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^% {c^{\prime},j})=\sigma_{w}^{2}Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c% ^{\prime},j})+K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})\dot{Q}^% {(1)}_{GP-ReLU}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j}).roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) . (53)

From (45), we know that:

QGPReLU(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c,j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ={σw22xc,ixc,jif c=c0if ccabsentcasessuperscriptsubscript𝜎𝑤22superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗if 𝑐superscript𝑐0if 𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c^{% \prime},j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}= { start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (54)
ΘNTKReLU(2)(xc,i,xc,j)absentsuperscriptsubscriptΘ𝑁𝑇𝐾𝑅𝑒𝐿𝑈2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\implies\Theta_{NTK-ReLU}^{(2)}(x^{c,i},x^{c^{\prime% },j})⟹ roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ={σw42xc,ixc,j+σw22xc,ixc,jif c=c0if cc,absentcasessuperscriptsubscript𝜎𝑤42superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗superscriptsubscript𝜎𝑤22superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗if 𝑐superscript𝑐0if 𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\frac{\sigma_{w}^{4}}{2}x^{c,i}x^{c,j}% +\frac{\sigma_{w}^{2}}{2}x^{c,i}x^{c,j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases},= { start_ROW start_CELL divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW , (55)
={(σw42+σw22)xc,ixc,jif c=c0if cc.absentcasessuperscriptsubscript𝜎𝑤42superscriptsubscript𝜎𝑤22superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗if 𝑐superscript𝑐0if 𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\left(\frac{\sigma_{w}^{4}}{2}+\frac{% \sigma_{w}^{2}}{2}\right)x^{c,i}x^{c,j}&\text{if }c=c^{\prime}\\ 0&\text{if }c\neq c^{\prime}\end{cases}.= { start_ROW start_CELL ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL 0 end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . (56)

Notice that ΘNTKReLU(2)(xc,i,xc,j)superscriptsubscriptΘ𝑁𝑇𝐾𝑅𝑒𝐿𝑈2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\Theta_{NTK-ReLU}^{(2)}(x^{c,i},x^{c^{\prime},j})roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) is a scaled version of QGPReLU(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle Q_{GP-ReLU}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) (as per (45)). Thus, we end up with the same result as (48) :

𝔼[𝒩𝒞1(𝐇NTK)]=c=12ncμc2+ncσc2Nμc22(c=12μc22nc2μc2N2)+Δh.o.t.𝔼delimited-[]𝒩subscript𝒞1subscript𝐇𝑁𝑇𝐾superscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝜇𝑐22superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathcal{N}\mathcal{C}_% {1}(\mathbf{H}_{NTK})\right]=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2% }}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}+\Delta_{h.o.t}.\end{split}start_ROW start_CELL blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H start_POSTSUBSCRIPT italic_N italic_T italic_K end_POSTSUBSCRIPT ) ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT . end_CELL end_ROW (57)

Appendix E Results for NC1 with Erf activation

E.1 NC1 of Limiting NNGP with Erf activation

Under the Assumptions described in Section 5.3 with d0=1,d1formulae-sequencesubscript𝑑01subscript𝑑1\displaystyle d_{0}=1,d_{1}\to\inftyitalic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT → ∞, observe that (11) gives us:

QGPErf(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =2πarcsin(2KGP(1)(xc,i,xc,j)1+2KGP(1)(xc,i,xc,i)1+2KGP(1)(xc,j,xc,j))absent2𝜋2superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗12superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12superscriptsubscript𝐾𝐺𝑃1superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(x^{c,% i},x^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})}\sqrt{1+2K_{GP}^{% (1)}(x^{c^{\prime},j},x^{c^{\prime},j})}}\right)= divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) (58)
=2πarcsin(2σb2+2σw2xc,ixc,j1+2σb2+2σw2(xc,i)21+2σb2+2σw2(xc,j)2)absent2𝜋2superscriptsubscript𝜎𝑏22superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗12superscriptsubscript𝜎𝑏22superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖212superscriptsubscript𝜎𝑏22superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2\sigma_{b}^{2}+2% \sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\sqrt{1+2\sigma_{b}^{2}+2\sigma_{w}^{2}% (x^{c,i})^{2}}\sqrt{1+2\sigma_{b}^{2}+2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}\right)= divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) (59)

Considering σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0:

QGPErf(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =2πarcsin(2σw2xc,ixc,j1+2σw2(xc,i)21+2σw2(xc,j)2)absent2𝜋2superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖212superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2\sigma_{w}^{2}x^{c% ,i}x^{c^{\prime},j}}{\sqrt{1+2\sigma_{w}^{2}(x^{c,i})^{2}}\sqrt{1+2\sigma_{w}^% {2}(x^{c^{\prime},j})^{2}}}\right)= divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) (60)
=2πarcsin(sign(xc,i)sign(xc,j)1+12σw2(xc,i)21+12σw2(xc,j)2),absent2𝜋signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{\operatorname{sign}% (x^{c,i})\operatorname{sign}(x^{c^{\prime},j})}{\sqrt{1+\frac{1}{2\sigma_{w}^{% 2}(x^{c,i})^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}% \right),= divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG ) , (61)

where the last equality comes from:

xc,ixc,j|xc,i|1+12σw2(xc,i)2|xc,j|1+12σw2(xc,j)2=sign(xc,i)sign(xc,j)1+12σw2(xc,i)21+12σw2(xc,j)2.superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscript𝑥𝑐𝑖112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscript𝑥superscript𝑐𝑗112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle\frac{x^{c,i}x^{c^{\prime},j}}{|x^{c,i}|\sqrt{1+% \frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}\cdot|x^{c^{\prime},j}|\sqrt{1+\frac{1}% {2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}=\frac{\operatorname{sign}(x^{c,i})% \operatorname{sign}(x^{c^{\prime},j})}{\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i% })^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}}.divide start_ARG italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG | italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT | square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ⋅ | italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT | square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG = divide start_ARG roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG . (62)

For notational simplicity, consider:

ρ(xc,i,xc,j)=1+12σw2(xc,i)21+12σw2(xc,j)2,𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle\rho(x^{c,i},x^{c^{\prime},j})=\sqrt{1+\frac{1}{2% \sigma_{w}^{2}(x^{c,i})^{2}}}\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j}% )^{2}}},italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG , (63)

and represent QGPErf(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) as:

QGPErf(1)(xc,i,xc,j)=2πarcsin(sign(xc,i)sign(xc,j)ρ(xc,i,xc,j)).superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗2𝜋signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\frac{2}% {\pi}\arcsin\left(\frac{\operatorname{sign}(x^{c,i})\operatorname{sign}(x^{c^{% \prime},j})}{\rho(x^{c,i},x^{c^{\prime},j})}\right).italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ) . (64)

Based on Assumption 1, we know that x1,i<0,x2,j>0formulae-sequencesuperscript𝑥1𝑖0superscript𝑥2𝑗0\displaystyle x^{1,i}<0,x^{2,j}>0italic_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT < 0 , italic_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT > 0 almost surely. This leads to:

QGPErf(1)(xc,i,xc,j)={2πarcsin(1ρ(xc,i,xc,j))if c=c2πarcsin(1ρ(xc,i,xc,j))if cc.superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗cases2𝜋1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗if 𝑐superscript𝑐2𝜋1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗if 𝑐superscript𝑐\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\begin{% cases}\frac{2}{\pi}\arcsin\left(\frac{1}{\rho(x^{c,i},x^{c,j})}\right)&\text{% if }c=c^{\prime}\\ -\frac{2}{\pi}\arcsin\left(\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right)&% \text{if }c\neq c^{\prime}\\ \end{cases}.italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = { start_ROW start_CELL divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL - divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ) end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . (65)

E.1.1 Calculating 𝔼[QGPErf(1)(xc,i,xc,j)]𝔼delimited-[]superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})\right]blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ]

For |u|1𝑢1\displaystyle|u|\leq 1| italic_u | ≤ 1, we consider the expansion of arcsin(u)=u+u36+𝑢𝑢superscript𝑢36\displaystyle\arcsin(u)=u+\frac{u^{3}}{6}+\cdotsroman_arcsin ( italic_u ) = italic_u + divide start_ARG italic_u start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_ARG 6 end_ARG + ⋯ to obtain:

𝔼[arcsin(1ρ(xc,i,xc,j))]=𝔼[1ρ(xc,i,xc,j)]+𝔼[16ρ(xc,i,xc,j)3]+.𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝔼delimited-[]16𝜌superscriptsuperscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗3\displaystyle\displaystyle\mathbb{E}\left[\arcsin\left(\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right)\right]=\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{% \prime},j})}\right]+\mathbb{E}\left[\frac{1}{6\rho(x^{c,i},x^{c^{\prime},j})^{% 3}}\right]+\cdots.blackboard_E [ roman_arcsin ( divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ) ] = blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG 6 italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] + ⋯ . (66)

To this end, based on Assumption 1 of large enough |μ1|,|μ2|subscript𝜇1subscript𝜇2\displaystyle|\mu_{1}|,|\mu_{2}|| italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT |, we approximate the expectation with only the first term and denote ξh.o.tsubscript𝜉formulae-sequence𝑜𝑡\displaystyle\xi_{h.o.t}italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT to capture the effects of the higher order terms. Notice that since ρ(xc,i,xc,j)>1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗1\displaystyle\rho(x^{c,i},x^{c^{\prime},j})>1italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) > 1 for finite (xc,i,xc,j)superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle(x^{c,i},x^{c^{\prime},j})( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ), the effects of ξh.o.tsubscript𝜉formulae-sequence𝑜𝑡\displaystyle\xi_{h.o.t}italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT are finite but decay rapidly compared to the first term. To this end, we get:

𝔼[arcsin(1ρ(xc,i,xc,j))]=𝔼[1ρ(xc,i,xc,j)]+ξh.o.t𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle\mathbb{E}\left[\arcsin\left(\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right)\right]=\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{% \prime},j})}\right]+\xi_{h.o.t}blackboard_E [ roman_arcsin ( divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ) ] = blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] + italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (67)

Calculating the expectation 𝔼[1ρ(xc,i,xc,j)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] can now be split based on c,c𝑐superscript𝑐\displaystyle c,c^{\prime}italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

\displaystyle\bullet Case c=c,i=jformulae-sequence𝑐superscript𝑐𝑖𝑗\displaystyle c=c^{\prime},i=jitalic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j:

ρ(xc,i,xc,i)𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\rho(x^{c,i},x^{c,i})italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) =1+12σw2(xc,i)2absent112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}= 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (68)
𝔼[ρ(xc,i,xc,i)]absent𝔼delimited-[]𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c,i})\right]⟹ blackboard_E [ italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] =1+12σw2𝔼[1(xc,i)2]absent112superscriptsubscript𝜎𝑤2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=1+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}\left[\frac{1}% {(x^{c,i})^{2}}\right]= 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] (69)
=1+T(c)2σw2.absent1𝑇𝑐2superscriptsubscript𝜎𝑤2\displaystyle\displaystyle=1+\frac{T(c)}{2\sigma_{w}^{2}}.= 1 + divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (70)

The last equality is based on Lemma C.4 which gives the expanded version of T(c)𝑇𝑐\displaystyle T(c)italic_T ( italic_c ).

Finally, the value of 𝔼[1ρ(xc,i,xc,i)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] can be given as:

𝔼[1ρ(xc,i,xc,i)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] =1𝔼[ρ(xc,i,xc,i)]+Var(ρ(xc,i,xc,i))𝔼[ρ(xc,i,xc,i)]3absent1𝔼delimited-[]𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖𝑉𝑎𝑟𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖𝔼superscriptdelimited-[]𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖3\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[\rho(x^{c,i},x^{c,i})% \right]}+\frac{Var(\rho(x^{c,i},x^{c,i}))}{\mathbb{E}\left[\rho(x^{c,i},x^{c,i% })\right]^{3}}= divide start_ARG 1 end_ARG start_ARG blackboard_E [ italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] end_ARG + divide start_ARG italic_V italic_a italic_r ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) end_ARG start_ARG blackboard_E [ italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (71)
=11+T(c)2σw2+δh.o.t(ρ(xc,i,xc,i))absent11𝑇𝑐2superscriptsubscript𝜎𝑤2subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)}{2\sigma_{w}^{2}}}+\delta_{h.% o.t}(\rho(x^{c,i},x^{c,i}))= divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) (72)

Notice that even in this simple case, the expressions are non-trivial to fully expand. Nonetheless, along with Assumption 1, we consider large enough |μ1|,|μ2|subscript𝜇1subscript𝜇2\displaystyle|\mu_{1}|,|\mu_{2}|| italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | such that:

T(c)2σw2=12σw2[1(μc2+σc2)+2σc4+4σc2μc2(μc2+σc2)3]<1.𝑇𝑐2superscriptsubscript𝜎𝑤212superscriptsubscript𝜎𝑤2delimited-[]1superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsubscript𝜎𝑐44superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐231\displaystyle\displaystyle\frac{T(c)}{2\sigma_{w}^{2}}=\frac{1}{2\sigma_{w}^{2% }}\left[\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2})}+\frac{2\sigma_{c}^{4}+4\sigma_{% c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{2})^{3}}\right]<1.divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] < 1 . (73)

Thus, based on the expansion of (1+u)1=1u+u2u3+superscript1𝑢11𝑢superscript𝑢2superscript𝑢3\displaystyle(1+u)^{-1}=1-u+u^{2}-u^{3}+\cdots( 1 + italic_u ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 - italic_u + italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯, we obtain the following cleaner approximation of:

𝔼[1ρ(xc,i,xc,i)]=1T(c)2σw2+Δh.o.t(1)(c).𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖1𝑇𝑐2superscriptsubscript𝜎𝑤2superscriptsubscriptΔformulae-sequence𝑜𝑡1𝑐\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,i})}% \right]=1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t}^{(1)}(c).blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] = 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) . (74)

Here Δh.o.t(1)(c)superscriptsubscriptΔformulae-sequence𝑜𝑡1𝑐\displaystyle\Delta_{h.o.t}^{(1)}(c)roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) captures all the higher order terms corresponding to (T(c)2σw2)2(T(c)2σw2)3+superscript𝑇𝑐2superscriptsubscript𝜎𝑤22superscript𝑇𝑐2superscriptsubscript𝜎𝑤23\displaystyle\left(\frac{T(c)}{2\sigma_{w}^{2}}\right)^{2}-\left(\frac{T(c)}{2% \sigma_{w}^{2}}\right)^{3}+\cdots( divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯ and δh.o.t(ρ(xc,i,xc,i))subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\delta_{h.o.t}(\rho(x^{c,i},x^{c,i}))italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) as denoted above.

\displaystyle\bullet Case c=c,ijformulae-sequence𝑐superscript𝑐𝑖𝑗\displaystyle c=c^{\prime},i\neq jitalic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j:

In the case of c=c,ijformulae-sequence𝑐superscript𝑐𝑖𝑗\displaystyle c=c^{\prime},i\neq jitalic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j, the expectations on the square roots do not have a particular closed form. To this, end we leverage Assumption 1 to obtain the following approximation:

ρ(xc,i,xc,j)𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\rho(x^{c,i},x^{c,j})italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) =1+12σw2(xc,i)21+12σw2(xc,j)2absent112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑗2\displaystyle\displaystyle=\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}% \sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,j})^{2}}}= square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (75)
=(1+14σw2(xc,i)2+h.o.t)(1+14σw2(xc,j)2+h.o.t)\displaystyle\displaystyle=\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})^{2}}+h.o.% t\right)\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,j})^{2}}+h.o.t\right)= ( 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ) ( 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ) (76)
𝔼[ρ(xc,i,xc,j)]absent𝔼delimited-[]𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c,j})\right]⟹ blackboard_E [ italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] =𝔼[1+14σw2(xc,i)2+h.o.t]𝔼[1+14σw2(xc,j)2+h.o.t]\displaystyle\displaystyle=\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})% ^{2}}+h.o.t\right]\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,j})^{2}}+h.o% .t\right]= blackboard_E [ 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ] blackboard_E [ 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ] (77)

Observe that the inner terms in the expectations are scaled versions of the above case. To this end, we approximate 𝔼[1ρ(xc,i,xc,j)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ] as:

𝔼[1ρ(xc,i,xc,j)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c,j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ] 1(1+T(c)4σw2)2+δh.o.t(ρ(xc,i,xc,j))absent1superscript1𝑇𝑐4superscriptsubscript𝜎𝑤22subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\approx\frac{1}{\left(1+\frac{T(c)}{4\sigma_{w}^{2}}% \right)^{2}}+\delta_{h.o.t}(\rho(x^{c,i},x^{c,j}))≈ divide start_ARG 1 end_ARG start_ARG ( 1 + divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ) (78)
=11+T(c)2σw2+T(c)216σw4+δh.o.t(ρ(xc,i,xc,j))absent11𝑇𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐216superscriptsubscript𝜎𝑤4subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)}{2\sigma_{w}^{2}}+\frac{T(c)^% {2}}{16\sigma_{w}^{4}}}+\delta_{h.o.t}(\rho(x^{c,i},x^{c,j}))= divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ) (79)

Similar to the assumption that led to (74), we get:

𝔼[1ρ(xc,i,xc,j)]1T(c)2σw2T(c)216σw4+Δh.o.t(2)(c).𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗1𝑇𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡2𝑐\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x% ^{c,j})}\right]&\approx 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16% \sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(c).\end{split}start_ROW start_CELL blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ] end_CELL start_CELL ≈ 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) . end_CELL end_ROW (80)

\displaystyle\bullet Case cc𝑐superscript𝑐\displaystyle c\neq c^{\prime}italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT

A similar analysis as above applies in this case:

ρ(xc,i,xc,j)𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\rho(x^{c,i},x^{c^{\prime},j})italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =1+12σw2(xc,i)21+12σw2(xc,j)2absent112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2112superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle=\sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}}}% \sqrt{1+\frac{1}{2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}= square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG square-root start_ARG 1 + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (81)
=(1+14σw2(xc,i)2+h.o.t)(1+14σw2(xc,j)2+h.o.t)\displaystyle\displaystyle=\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})^{2}}+h.o.% t\right)\left(1+\frac{1}{4\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}+h.o.t\right)= ( 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ) ( 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ) (82)
𝔼[ρ(xc,i,xc,j)]absent𝔼delimited-[]𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\implies\mathbb{E}\left[\rho(x^{c,i},x^{c^{\prime},j% })\right]⟹ blackboard_E [ italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] =𝔼[1+14σw2(xc,i)2+h.o.t]𝔼[1+14σw2(xc,j)2+h.o.t]\displaystyle\displaystyle=\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c,i})% ^{2}}+h.o.t\right]\mathbb{E}\left[1+\frac{1}{4\sigma_{w}^{2}(x^{c^{\prime},j})% ^{2}}+h.o.t\right]= blackboard_E [ 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ] blackboard_E [ 1 + divide start_ARG 1 end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_h . italic_o . italic_t ] (83)

Observe that the inner terms in the expectations are similar to the above case. To this end, we approximate 𝔼[1ρ(xc,i,xc,j)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] as:

𝔼[1ρ(xc,i,xc,j)]𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x^{c^{\prime},% j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] 1(1+T(c)4σw2)(1+T(c)4σw2)+δh.o.t(ρ(xc,i,xc,j))absent11𝑇𝑐4superscriptsubscript𝜎𝑤21𝑇superscript𝑐4superscriptsubscript𝜎𝑤2subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\approx\frac{1}{\left(1+\frac{T(c)}{4\sigma_{w}^{2}}% \right)\left(1+\frac{T(c^{\prime})}{4\sigma_{w}^{2}}\right)}+\delta_{h.o.t}(% \rho(x^{c,i},x^{c^{\prime},j}))≈ divide start_ARG 1 end_ARG start_ARG ( 1 + divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( 1 + divide start_ARG italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) (84)
=11+T(c)+T(c)4σw2+T(c)T(c)16σw4+δh.o.t(ρ(xc,i,xc,j))absent11𝑇𝑐𝑇superscript𝑐4superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐16superscriptsubscript𝜎𝑤4subscript𝛿formulae-sequence𝑜𝑡𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\frac{1}{1+\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2% }}+\frac{T(c)T(c^{\prime})}{16\sigma_{w}^{4}}}+\delta_{h.o.t}(\rho(x^{c,i},x^{% c^{\prime},j}))= divide start_ARG 1 end_ARG start_ARG 1 + divide start_ARG italic_T ( italic_c ) + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) (85)

Similar to the assumption that led to (74), we get:

𝔼[1ρ(xc,i,xc,j)]1T(c)+T(c)4σw2T(c)T(c)16σw4+Δh.o.t(3)(c,c).𝔼delimited-[]1𝜌superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗1𝑇𝑐𝑇superscript𝑐4superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐16superscriptsubscript𝜎𝑤4subscriptsuperscriptΔ3formulae-sequence𝑜𝑡𝑐superscript𝑐\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\frac{1}{\rho(x^{c,i},x% ^{c^{\prime},j})}\right]&\approx 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-% \frac{T(c)T(c^{\prime})}{16\sigma_{w}^{4}}+\Delta^{(3)}_{h.o.t}(c,c^{\prime}).% \end{split}start_ROW start_CELL blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_ρ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] end_CELL start_CELL ≈ 1 - divide start_ARG italic_T ( italic_c ) + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) . end_CELL end_ROW (86)

Finally, based on (74), (80), (86) we obtain the following result for 𝔼[QGPErf(1)(xc,i,xc,j)]𝔼delimited-[]superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}[Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})]blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] as :

𝔼[QGPErf(1)(xc,i,xc,j)]{1T(c)2σw2+Δh.o.t(1)(c)if c=c,i=j1T(c)2σw2T(c)216σw4+Δh.o.t(2)(c)if c=c,ij1T(c)+T(c)4σw2T(c)T(c)16σw4+Δh.o.t(3)(c,c)if cc.𝔼delimited-[]superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗cases1𝑇𝑐2superscriptsubscript𝜎𝑤2superscriptsubscriptΔformulae-sequence𝑜𝑡1𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗1𝑇𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡2𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗1𝑇𝑐𝑇superscript𝑐4superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐16superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡3𝑐superscript𝑐if 𝑐superscript𝑐\displaystyle\displaystyle\begin{split}&\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})\right]\\ &\hskip 40.0pt\approx\begin{cases}1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t% }^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t% }^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c^{\prime})}{16\sigma% _{w}^{4}}+\Delta_{h.o.t}^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ { start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j end_CELL end_ROW start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . end_CELL end_ROW (87)

Here Δh.o.t(1)(c),Δh.o.t(2)(c),Δh.o.t(3)(c,c)superscriptsubscriptΔformulae-sequence𝑜𝑡1𝑐superscriptsubscriptΔformulae-sequence𝑜𝑡2𝑐superscriptsubscriptΔformulae-sequence𝑜𝑡3𝑐superscript𝑐\displaystyle\Delta_{h.o.t}^{(1)}(c),\Delta_{h.o.t}^{(2)}(c),\Delta_{h.o.t}^{(% 3)}(c,c^{\prime})roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) , roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) , roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) are the collective higher order terms that tend to 00\displaystyle 0 as |μc|subscript𝜇𝑐\displaystyle|\mu_{c}|| italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | increases relative to smaller values of σcsubscript𝜎𝑐\displaystyle\sigma_{c}italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT. These cases can now be plugged into our generic formulation of expected values of a kernel function (i.e V(1)(c),V(2)(c),V(3)(c,c)superscript𝑉1𝑐superscript𝑉2𝑐superscript𝑉3𝑐superscript𝑐\displaystyle V^{(1)}(c),V^{(2)}(c),V^{(3)}(c,c^{\prime})italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) , italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) , italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )) as per (27) in Appendix C. Thus, based on Lemma C.3 for sufficiently large {nc}subscript𝑛𝑐\displaystyle\{n_{c}\}{ italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT } we get :

𝔼[𝒩𝒞1(𝐇)]=c=12ncV(1)(c)NV(2)(c)2[c=12(12nc21N2)(nc2V(2)(c))]2n1n2N2V(3)(1,2)+Δh.o.t𝔼delimited-[]𝒩subscript𝒞1𝐇superscriptsubscript𝑐12subscript𝑛𝑐superscript𝑉1𝑐𝑁superscript𝑉2𝑐2delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2superscriptsubscript𝑛𝑐2superscript𝑉2𝑐2subscript𝑛1subscript𝑛2superscript𝑁2superscript𝑉312subscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle\mathbb{E}[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})]=% \frac{\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)}(c)}{2}}{\left[\sum% _{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac{1}{N^{2}}\right)\left(n_{c}^{2}V^{(% 2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^{2}}V^{(3)}(1,2)}+\Delta_{h.o.t}blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG 2 end_ARG end_ARG start_ARG [ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (88)

\displaystyle\bullet Numerator in the balanced class setting.

To better understand the result, let’s consider the balanced class scenario with n1=n2=N/2subscript𝑛1subscript𝑛2𝑁2\displaystyle n_{1}=n_{2}=N/2italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N / 2, for which the numerator simplifies to:

c=12ncV(1)(c)NV(2)(c)2superscriptsubscript𝑐12subscript𝑛𝑐superscript𝑉1𝑐𝑁superscript𝑉2𝑐2\displaystyle\displaystyle\sum_{c=1}^{2}\frac{n_{c}V^{(1)}(c)}{N}-\frac{V^{(2)% }(c)}{2}∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG 2 end_ARG =c=12V(1)(c)V(2)(c)2absentsuperscriptsubscript𝑐12superscript𝑉1𝑐superscript𝑉2𝑐2\displaystyle\displaystyle=\sum_{c=1}^{2}\frac{V^{(1)}(c)-V^{(2)}(c)}{2}= ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) - italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_ARG start_ARG 2 end_ARG (89)
=c=12T(c)216σw4+Δh.o.t(1)(c)Δh.o.t(2)(c)2.absentsuperscriptsubscript𝑐12𝑇superscript𝑐216superscriptsubscript𝜎𝑤4subscriptsuperscriptΔ1formulae-sequence𝑜𝑡𝑐subscriptsuperscriptΔ2formulae-sequence𝑜𝑡𝑐2\displaystyle\displaystyle=\sum_{c=1}^{2}\frac{\frac{T(c)^{2}}{16\sigma_{w}^{4% }}+\Delta^{(1)}_{h.o.t}(c)-\Delta^{(2)}_{h.o.t}(c)}{2}.= ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_c ) - roman_Δ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_c ) end_ARG start_ARG 2 end_ARG . (90)

If we were to ignore the effects of the higher order terms, then observe that the numerator primarily depends on T(c)2𝑇superscript𝑐2\displaystyle T(c)^{2}italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, which can be given based on Lemma C.4 as:

T(c)2=[1(μc2+σc2)+2σc4+4σc2μc2(μc2+σc2)3]2𝑇superscript𝑐2superscriptdelimited-[]1superscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐22superscriptsubscript𝜎𝑐44superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsuperscriptsubscript𝜇𝑐2superscriptsubscript𝜎𝑐232\displaystyle\displaystyle T(c)^{2}=\left[\frac{1}{(\mu_{c}^{2}+\sigma_{c}^{2}% )}+\frac{2\sigma_{c}^{4}+4\sigma_{c}^{2}\mu_{c}^{2}}{(\mu_{c}^{2}+\sigma_{c}^{% 2})^{3}}\right]^{2}italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = [ divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (91)

Thus, showcasing the dependence on μc,σcsubscript𝜇𝑐subscript𝜎𝑐\displaystyle\mu_{c},\sigma_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in determining the extent of collapse. For sufficiently large |μc|σcmuch-greater-thansubscript𝜇𝑐subscript𝜎𝑐\displaystyle|\mu_{c}|\gg\sigma_{c}| italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ≫ italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, we can approximate this value to:

T(c)2[1μc2+4σc2μc4]2=1μc4[1+4σc2μc2]2=1μc4[1+8σc2μc2+16σc4μc4]𝑇superscript𝑐2superscriptdelimited-[]1superscriptsubscript𝜇𝑐24superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐421superscriptsubscript𝜇𝑐4superscriptdelimited-[]14superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐221superscriptsubscript𝜇𝑐4delimited-[]18superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐216superscriptsubscript𝜎𝑐4superscriptsubscript𝜇𝑐4\displaystyle\displaystyle T(c)^{2}\approx\left[\frac{1}{\mu_{c}^{2}}+\frac{4% \sigma_{c}^{2}}{\mu_{c}^{4}}\right]^{2}=\frac{1}{\mu_{c}^{4}}\left[1+\frac{4% \sigma_{c}^{2}}{\mu_{c}^{2}}\right]^{2}=\frac{1}{\mu_{c}^{4}}\left[1+\frac{8% \sigma_{c}^{2}}{\mu_{c}^{2}}+\frac{16\sigma_{c}^{4}}{\mu_{c}^{4}}\right]italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ [ divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG [ 1 + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG [ 1 + divide start_ARG 8 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ] (92)

\displaystyle\bullet Denominator in the balanced class setting.

Similar to the numerator analysis, observe that when n1=n2=N/2subscript𝑛1subscript𝑛2𝑁2\displaystyle n_{1}=n_{2}=N/2italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N / 2, the denominator can be given as:

[c=12(12nc21N2)(nc2V(2)(c))]2n1n2N2V(3)(1,2)delimited-[]superscriptsubscript𝑐1212superscriptsubscript𝑛𝑐21superscript𝑁2superscriptsubscript𝑛𝑐2superscript𝑉2𝑐2subscript𝑛1subscript𝑛2superscript𝑁2superscript𝑉312\displaystyle\displaystyle\left[\sum_{c=1}^{2}\left(\frac{1}{2n_{c}^{2}}-\frac% {1}{N^{2}}\right)\left(n_{c}^{2}V^{(2)}(c)\right)\right]-\frac{2n_{1}n_{2}}{N^% {2}}V^{(3)}(1,2)[ ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ( italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) ) ] - divide start_ARG 2 italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) (93)
=V(2)(1)+V(2)(2)2V(3)(1,2)4absentsuperscript𝑉21superscript𝑉222superscript𝑉3124\displaystyle\displaystyle\hskip 10.0pt=\frac{V^{(2)}(1)+V^{(2)}(2)-2V^{(3)}(1% ,2)}{4}= divide start_ARG italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 1 ) + italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 2 ) - 2 italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG 4 end_ARG (94)
=T(1)2σw2T(1)216σw4+Δh.o.t(2)(1)T(2)2σw2T(2)216σw4+Δh.o.t(2)(2)4absent𝑇12superscriptsubscript𝜎𝑤2𝑇superscript1216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡21𝑇22superscriptsubscript𝜎𝑤2𝑇superscript2216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡224\displaystyle\displaystyle\hskip 10.0pt=\frac{-\frac{T(1)}{2\sigma_{w}^{2}}-% \frac{T(1)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(1)-\frac{T(2)}{2\sigma_% {w}^{2}}-\frac{T(2)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)}(2)}{4}= divide start_ARG - divide start_ARG italic_T ( 1 ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 1 ) - divide start_ARG italic_T ( 2 ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 2 ) end_ARG start_ARG 4 end_ARG (95)
+2T(1)+T(2)4σw2+2T(1)T(2)16σw42Δh.o.t(3)(1,2)42𝑇1𝑇24superscriptsubscript𝜎𝑤22𝑇1𝑇216superscriptsubscript𝜎𝑤42superscriptsubscriptΔformulae-sequence𝑜𝑡3124\displaystyle\displaystyle\hskip 30.0pt+\frac{2\frac{T(1)+T(2)}{4\sigma_{w}^{2% }}+2\frac{T(1)T(2)}{16\sigma_{w}^{4}}-2\Delta_{h.o.t}^{(3)}(1,2)}{4}+ divide start_ARG 2 divide start_ARG italic_T ( 1 ) + italic_T ( 2 ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 divide start_ARG italic_T ( 1 ) italic_T ( 2 ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - 2 roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG 4 end_ARG (96)
=T(1)216σw4T(2)216σw4+2T(1)T(2)16σw4+Δh.o.t(2)(1)+Δh.o.t(2)(2)2Δh.o.t(3)(1,2)4absent𝑇superscript1216superscriptsubscript𝜎𝑤4𝑇superscript2216superscriptsubscript𝜎𝑤42𝑇1𝑇216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡21superscriptsubscriptΔformulae-sequence𝑜𝑡222superscriptsubscriptΔformulae-sequence𝑜𝑡3124\displaystyle\displaystyle=\frac{-\frac{T(1)^{2}}{16\sigma_{w}^{4}}-\frac{T(2)% ^{2}}{16\sigma_{w}^{4}}+2\frac{T(1)T(2)}{16\sigma_{w}^{4}}+\Delta_{h.o.t}^{(2)% }(1)+\Delta_{h.o.t}^{(2)}(2)-2\Delta_{h.o.t}^{(3)}(1,2)}{4}= divide start_ARG - divide start_ARG italic_T ( 1 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( 2 ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + 2 divide start_ARG italic_T ( 1 ) italic_T ( 2 ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 1 ) + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 2 ) - 2 roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG 4 end_ARG (97)
=(T(1)T(2)4σw2)2+Δh.o.t(2)(1)+Δh.o.t(2)(2)2Δh.o.t(3)(1,2)4.absentsuperscript𝑇1𝑇24superscriptsubscript𝜎𝑤22superscriptsubscriptΔformulae-sequence𝑜𝑡21superscriptsubscriptΔformulae-sequence𝑜𝑡222superscriptsubscriptΔformulae-sequence𝑜𝑡3124\displaystyle\displaystyle=\frac{-\left(\frac{T(1)-T(2)}{4\sigma_{w}^{2}}% \right)^{2}+\Delta_{h.o.t}^{(2)}(1)+\Delta_{h.o.t}^{(2)}(2)-2\Delta_{h.o.t}^{(% 3)}(1,2)}{4}.= divide start_ARG - ( divide start_ARG italic_T ( 1 ) - italic_T ( 2 ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 1 ) + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( 2 ) - 2 roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( 1 , 2 ) end_ARG start_ARG 4 end_ARG . (98)

Observe that the term T(1)T(2)𝑇1𝑇2\displaystyle T(1)-T(2)italic_T ( 1 ) - italic_T ( 2 ) represents:

T(1)T(2)=[1(μ12+σ12)+2σ14+4σ12μ12(μ12+σ12)3][1(μ22+σ22)+2σ24+4σ22μ22(μ22+σ22)3]𝑇1𝑇2delimited-[]1superscriptsubscript𝜇12superscriptsubscript𝜎122superscriptsubscript𝜎144superscriptsubscript𝜎12superscriptsubscript𝜇12superscriptsuperscriptsubscript𝜇12superscriptsubscript𝜎123delimited-[]1superscriptsubscript𝜇22superscriptsubscript𝜎222superscriptsubscript𝜎244superscriptsubscript𝜎22superscriptsubscript𝜇22superscriptsuperscriptsubscript𝜇22superscriptsubscript𝜎223\displaystyle\displaystyle T(1)-T(2)=\left[\frac{1}{(\mu_{1}^{2}+\sigma_{1}^{2% })}+\frac{2\sigma_{1}^{4}+4\sigma_{1}^{2}\mu_{1}^{2}}{(\mu_{1}^{2}+\sigma_{1}^% {2})^{3}}\right]-\left[\frac{1}{(\mu_{2}^{2}+\sigma_{2}^{2})}+\frac{2\sigma_{2% }^{4}+4\sigma_{2}^{2}\mu_{2}^{2}}{(\mu_{2}^{2}+\sigma_{2}^{2})^{3}}\right]italic_T ( 1 ) - italic_T ( 2 ) = [ divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] - [ divide start_ARG 1 end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + divide start_ARG 2 italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 4 italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] (99)

and for sufficiently large |μc|σcmuch-greater-thansubscript𝜇𝑐subscript𝜎𝑐\displaystyle|\mu_{c}|\gg\sigma_{c}| italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ≫ italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, essentially represents:

T(1)T(2)1μ12+4σ12μ141μ224σ22μ24.𝑇1𝑇21superscriptsubscript𝜇124superscriptsubscript𝜎12superscriptsubscript𝜇141superscriptsubscript𝜇224superscriptsubscript𝜎22superscriptsubscript𝜇24\displaystyle\displaystyle T(1)-T(2)\approx\frac{1}{\mu_{1}^{2}}+\frac{4\sigma% _{1}^{2}}{\mu_{1}^{4}}-\frac{1}{\mu_{2}^{2}}-\frac{4\sigma_{2}^{2}}{\mu_{2}^{4% }}.italic_T ( 1 ) - italic_T ( 2 ) ≈ divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 4 italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG . (100)

E.2 NC1 of Limiting NTK with Erf activation

Recall from (13) that the recursive relationship between the NTK and NNGP can be given as follows:

ΘNTKErf(2)(xc,i,xc,j)=KGPErf(2)(xc,i,xc,j)+KGP(1)(xc,i,xc,j)Q˙GPErf(1)(xc,i,xc,j),superscriptsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗subscriptsuperscript˙𝑄1𝐺𝑃𝐸𝑟𝑓superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})=K_{GP-Erf}^{(2)}(x^{c,i},x^{c% ^{\prime},j})+K_{GP}^{(1)}(x^{c,i},x^{c^{\prime},j})\dot{Q}^{(1)}_{GP-Erf}(x^{% c,i},x^{c^{\prime},j}),roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) , (101)

where:

KGPErf(2)(xc,i,xc,j)superscriptsubscript𝐾𝐺𝑃𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle K_{GP-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =σb2+σw2QGPErf(1)(xc,i,xc,j)absentsuperscriptsubscript𝜎𝑏2superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\sigma_{b}^{2}+\sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})= italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) (102)
QGPErf(1)(xc,i,xc,j)superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =2πarcsin(2KGP(1)(xc,i,xc,j)1+2KGP(1)(xc,i,xc,i)1+2KGP(1)(xc,j,xc,j))absent2𝜋2superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗12superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12superscriptsubscript𝐾𝐺𝑃1superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\frac{2}{\pi}\arcsin\left(\frac{2K_{GP}^{(1)}(x^{c,% i},x^{c^{\prime},j})}{\sqrt{1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})}\sqrt{1+2K_{GP}^{% (1)}(x^{c^{\prime},j},x^{c^{\prime},j})}}\right)= divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG square-root start_ARG 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG end_ARG ) (103)
Q˙GPErf(1)(xc,i,xc,j)=4π[(1+2KGP(1)(xc,i,xc,i))(1+2KGP(1)(xc,j,xc,j))(2KGP(1)(xc,i,xc,j))2]1/2superscriptsubscript˙𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗4𝜋superscriptlimit-from12superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12superscriptsubscript𝐾𝐺𝑃1superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗superscript2superscriptsubscript𝐾𝐺𝑃1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗212\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{% \prime},j})&=\frac{4}{\pi}\left[\left(1+2K_{GP}^{(1)}(x^{c,i},x^{c,i})\right)% \left(1+2K_{GP}^{(1)}(x^{c^{\prime},j},x^{c^{\prime},j})\right)-\right.\\ &\left.\hskip 20.0pt\left(2K_{GP}^{(1)}(x^{c,i},x^{c^{\prime},j})\right)^{2}% \right]^{-1/2}\end{split}start_ROW start_CELL over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 4 end_ARG start_ARG italic_π end_ARG [ ( 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) ( 1 + 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) - end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ( 2 italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW (104)

Considering σb0subscript𝜎𝑏0\displaystyle\sigma_{b}\to 0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT → 0, d0=1subscript𝑑01\displaystyle d_{0}=1italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 1 (as per the setting and assumptions), we get:

KGP(1)(𝐱c,i,𝐱c,j)=σw2xc,ixc,jsuperscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})=\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT (105)
QGPErf(1)(xc,i,xc,j)=2πarcsin(2σw2xc,ixc,j1+2σw2(xc,i)21+2σw2(xc,j)2).superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗2𝜋2superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖212superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})=\frac{2}% {\pi}\arcsin\left(\frac{2\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\sqrt{1+2% \sigma_{w}^{2}(x^{c,i})^{2}}\sqrt{1+2\sigma_{w}^{2}(x^{c^{\prime},j})^{2}}}% \right).italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = divide start_ARG 2 end_ARG start_ARG italic_π end_ARG roman_arcsin ( divide start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG ) . (106)
Q˙GPErf(1)(xc,i,xc,j)=4π((1+2σw2xc,ixc,i)(1+2σw2xc,jxc,j)(2σw2xc,ixc,j)2)1/2=4π1+2σw2(xc,i)2+2σw2(xc,j)2.superscriptsubscript˙𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗4𝜋superscript12superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12superscriptsubscript𝜎𝑤2superscript𝑥superscript𝑐𝑗superscript𝑥superscript𝑐𝑗superscript2superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗2124𝜋12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖22superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle\begin{split}\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{% \prime},j})&=\frac{4}{\pi}\left(\left(1+2\sigma_{w}^{2}x^{c,i}x^{c,i}\right)% \left(1+2\sigma_{w}^{2}x^{c^{\prime},j}x^{c^{\prime},j}\right)-\left(2\sigma_{% w}^{2}x^{c,i}x^{c^{\prime},j}\right)^{2}\right)^{-1/2}\\ &=\frac{4}{\pi\sqrt{1+2\sigma_{w}^{2}\cdot(x^{c,i})^{2}+2\sigma_{w}^{2}\cdot(x% ^{c^{\prime},j})^{2}}}.\end{split}start_ROW start_CELL over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL = divide start_ARG 4 end_ARG start_ARG italic_π end_ARG ( ( 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ( 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) - ( 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 4 end_ARG start_ARG italic_π square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG . end_CELL end_ROW (107)

This gives us:

KGP(1)(𝐱c,i,𝐱c,j)Q˙GPErf(1)(xc,i,xc,j)superscriptsubscript𝐾𝐺𝑃1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscriptsubscript˙𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle K_{GP}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime% },j})\dot{Q}_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})italic_K start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) over˙ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =4σw2xc,ixc,jπ1+2σw2(xc,i)2+2σw2(xc,j)2absent4superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝜋12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖22superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥superscript𝑐𝑗2\displaystyle\displaystyle=\frac{4\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\pi% \sqrt{1+2\sigma_{w}^{2}\cdot(x^{c,i})^{2}+2\sigma_{w}^{2}\cdot(x^{c^{\prime},j% })^{2}}}= divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG italic_π square-root start_ARG 1 + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (108)
=4σw2xc,ixc,jπσw|xc,i||xc,j|1σw2(xc,i)2(xc,j)2+2(xc,j)2+2(xc,i)2absent4superscriptsubscript𝜎𝑤2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝜋subscript𝜎𝑤superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\frac{4\sigma_{w}^{2}x^{c,i}x^{c^{\prime},j}}{\pi% \sigma_{w}|x^{c,i}||x^{c^{\prime},j}|\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2% }(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,i})^{% 2}}}}= divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT end_ARG start_ARG italic_π italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT | italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT | | italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT | square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG (109)
=4σwsign(xc,i)sign(xc,j)π1σw2(xc,i)2(xc,j)2+2(xc,j)2+2(xc,i)2absent4subscript𝜎𝑤signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗𝜋1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\frac{4\sigma_{w}\operatorname{sign}(x^{c,i})% \operatorname{sign}(x^{c^{\prime},j})}{\pi\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{2}(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,% i})^{2}}}}= divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG end_ARG (110)

For notational simplicity, consider:

κ(xc,i,xc,j)=1σw2(xc,i)2(xc,j)2+2(xc,j)2+2(xc,i)2𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle\kappa(x^{c,i},x^{c^{\prime},j})=\sqrt{\frac{1}{% \sigma_{w}^{2}(x^{c,i})^{2}(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})% ^{2}}+\frac{2}{(x^{c,i})^{2}}}italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (111)

which simplifies the kernel formulation to:

ΘNTKErf(2)(xc,i,xc,j)=σw2QGPErf(1)(xc,i,xc,j)+4σwsign(xc,i)sign(xc,j)πκ(xc,i,xc,j)superscriptsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗4subscript𝜎𝑤signsuperscript𝑥𝑐𝑖signsuperscript𝑥superscript𝑐𝑗𝜋𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})=% \sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})+\frac{4\sigma_{w}% \operatorname{sign}(x^{c,i})\operatorname{sign}(x^{c^{\prime},j})}{\pi\kappa(x% ^{c,i},x^{c^{\prime},j})}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) roman_sign ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG start_ARG italic_π italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG (112)

E.2.1 Calculating 𝔼[ΘNTKErf(2)(xc,i,xc,j)]𝔼delimited-[]superscriptsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^{\prime},j})\right]blackboard_E [ roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ]

Similar to the NNGP analysis, we break down the calculation of 𝔼[κ(xc,i,xc,j)]𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c^{\prime},j})\right]blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] into three cases.

\displaystyle\bullet Case c=c,i=jformulae-sequence𝑐superscript𝑐𝑖𝑗\displaystyle c=c^{\prime},i=jitalic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j

κ(xc,i,xc,i)𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\kappa(x^{c,i},x^{c,i})italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) =1σw2(xc,i)4+4(xc,i)2=1(11σw2(xc,i)44(xc,i)2)absent1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖44superscriptsuperscript𝑥𝑐𝑖2111superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖44superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{4}}+\frac{4% }{(x^{c,i})^{2}}}=\sqrt{1-\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i})^{4}}-\frac{% 4}{(x^{c,i})^{2}}\right)}= square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 4 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG = square-root start_ARG 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 4 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG (113)
=112(11σw2(xc,i)44(xc,i)2)+ξh.o.tabsent11211superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖44superscriptsuperscript𝑥𝑐𝑖2subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=1-\frac{1}{2}\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{4}}-\frac{4}{(x^{c,i})^{2}}\right)+\xi_{h.o.t}= 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 4 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (114)
=12+12σw2(xc,i)4+2(xc,i)2+ξh.o.tabsent1212superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖42superscriptsuperscript𝑥𝑐𝑖2subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{4}}+% \frac{2}{(x^{c,i})^{2}}+\xi_{h.o.t}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (115)

This gives us:

𝔼[κ(xc,i,xc,i)]=12+𝔼[12σw2(xc,i)4]+𝔼[2(xc,i)2]+𝔼[ξh.o.t]=12+12σw2[1𝔼[(xc,i)4]+Var((xc,i)4)𝔼[(xc,i)4]3]+2[1𝔼[(xc,i)2]+Var((xc,i)2)𝔼[(xc,i)2]3]+𝔼[ξh.o.t]𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖12𝔼delimited-[]12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖4𝔼delimited-[]2superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]subscript𝜉formulae-sequence𝑜𝑡1212superscriptsubscript𝜎𝑤2delimited-[]1𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖4𝑉𝑎𝑟superscriptsuperscript𝑥𝑐𝑖4𝔼superscriptdelimited-[]superscriptsuperscript𝑥𝑐𝑖432delimited-[]1𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖2𝑉𝑎𝑟superscriptsuperscript𝑥𝑐𝑖2𝔼superscriptdelimited-[]superscriptsuperscript𝑥𝑐𝑖23𝔼delimited-[]subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]&=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{4}}% \right]+\mathbb{E}\left[\frac{2}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi_{h.o.t}]% \\ &=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\left[\frac{1}{\mathbb{E}\left[(x^{c,i}% )^{4}\right]}+\frac{Var((x^{c,i})^{4})}{\mathbb{E}\left[(x^{c,i})^{4}\right]^{% 3}}\right]+2\left[\frac{1}{\mathbb{E}\left[(x^{c,i})^{2}\right]}+\frac{Var((x^% {c,i})^{2})}{\mathbb{E}\left[(x^{c,i})^{2}\right]^{3}}\right]\\ &\hskip 20.0pt+\mathbb{E}[\xi_{h.o.t}]\end{split}start_ROW start_CELL blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + blackboard_E [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ divide start_ARG 1 end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] end_ARG + divide start_ARG italic_V italic_a italic_r ( ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] + 2 [ divide start_ARG 1 end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG + divide start_ARG italic_V italic_a italic_r ( ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL + blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] end_CELL end_ROW (116)

Based on the results from the moment-generating function, we know that:

𝔼[(xc,i)4]=3σc4+6σc2μc2+μc4,𝔼delimited-[]superscriptsuperscript𝑥𝑐𝑖43superscriptsubscript𝜎𝑐46superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2superscriptsubscript𝜇𝑐4\displaystyle\displaystyle\mathbb{E}[(x^{c,i})^{4}]=3\sigma_{c}^{4}+6\sigma_{c% }^{2}\mu_{c}^{2}+\mu_{c}^{4},blackboard_E [ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ] = 3 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT + 6 italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , (117)

which can be used along with Lemma C.4 to obtain:

𝔼[κ(xc,i,xc,i)]=12+2T(c)+𝔼[ξh.o.t]𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖122𝑇𝑐𝔼delimited-[]subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]&=\frac{1}{2}+2T(c)+\mathbb{E}[\xi_{h.o.t}]\end{split}start_ROW start_CELL blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] end_CELL start_CELL = divide start_ARG 1 end_ARG start_ARG 2 end_ARG + 2 italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] end_CELL end_ROW (118)

For notational simplicity, we define a helper function as follows:

S(μc,σc)=12+2T(c)+𝔼[ξh.o.t],𝑆subscript𝜇𝑐subscript𝜎𝑐122𝑇𝑐𝔼delimited-[]subscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}S(\mu_{c},\sigma_{c})&=-\frac{1}{2}+2T(% c)+\mathbb{E}[\xi_{h.o.t}],\end{split}start_ROW start_CELL italic_S ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_CELL start_CELL = - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + 2 italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] , end_CELL end_ROW (119)

which gives us:

𝔼[κ(xc,i,xc,i)]=1+S(μc,σc)𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖1𝑆subscript𝜇𝑐subscript𝜎𝑐\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})\right]=1+S(% \mu_{c},\sigma_{c})blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] = 1 + italic_S ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) (120)

Finally, the value of 𝔼[1κ(xc,i,xc,i)]𝔼delimited-[]1𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] can be given as:

𝔼[1κ(xc,i,xc,i)]𝔼delimited-[]1𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] =1𝔼[κ(xc,i,xc,i)]+Var(κ(xc,i,xc,i))𝔼[κ(xc,i,xc,i)]3absent1𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖𝑉𝑎𝑟𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖𝔼superscriptdelimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖3\displaystyle\displaystyle=\frac{1}{\mathbb{E}\left[\kappa(x^{c,i},x^{c,i})% \right]}+\frac{Var(\kappa(x^{c,i},x^{c,i}))}{\mathbb{E}\left[\kappa(x^{c,i},x^% {c,i})\right]^{3}}= divide start_ARG 1 end_ARG start_ARG blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] end_ARG + divide start_ARG italic_V italic_a italic_r ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) end_ARG start_ARG blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ] start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG (121)
=11+S(μc,σc)+δh.o.t(κ(xc,i,xc,i))absent11𝑆subscript𝜇𝑐subscript𝜎𝑐subscript𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle=\frac{1}{1+S(\mu_{c},\sigma_{c})}+\delta_{h.o.t}(% \kappa(x^{c,i},x^{c,i}))= divide start_ARG 1 end_ARG start_ARG 1 + italic_S ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) end_ARG + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) (122)

Notice that even in this simple case, the expressions are non-trivial to fully expand. Nonetheless, along with Assumption 1, we consider large enough |μ1|,|μ2|subscript𝜇1subscript𝜇2\displaystyle|\mu_{1}|,|\mu_{2}|| italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | , | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | such that:

S(μc,σc)<1.𝑆subscript𝜇𝑐subscript𝜎𝑐1\displaystyle\displaystyle S(\mu_{c},\sigma_{c})<1.italic_S ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) < 1 . (123)

Thus, based on the expansion of (1+u)1=1u+u2u3+superscript1𝑢11𝑢superscript𝑢2superscript𝑢3\displaystyle(1+u)^{-1}=1-u+u^{2}-u^{3}+\cdots( 1 + italic_u ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT = 1 - italic_u + italic_u start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_u start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT + ⋯, we obtain the following cleaner approximation of:

𝔼[1κ(xc,i,xc,i)]=1S(μc,σc)+δ~h.o.t(κ(xc,i,xc,i))𝔼delimited-[]1𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖1𝑆subscript𝜇𝑐subscript𝜎𝑐subscript~𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑖\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,i})}% \right]=1-S(\mu_{c},\sigma_{c})+\widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c% ,i}))blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) end_ARG ] = 1 - italic_S ( italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) ) (124)

\displaystyle\bullet Case c=c,ijformulae-sequence𝑐superscript𝑐𝑖𝑗\displaystyle c=c^{\prime},i\neq jitalic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j:

κ(xc,i,xc,j)𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\kappa(x^{c,i},x^{c,j})italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) =1σw2(xc,i)2(xc,j)2+2(xc,j)2+2(xc,i)2absent1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2}(x^{c,j})% ^{2}}+\frac{2}{(x^{c,j})^{2}}+\frac{2}{(x^{c,i})^{2}}}= square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (125)
=1(11σw2(xc,i)2(xc,j)22(xc,j)22(xc,i)2)absent111superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\sqrt{1-\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2% }(x^{c,j})^{2}}-\frac{2}{(x^{c,j})^{2}}-\frac{2}{(x^{c,i})^{2}}\right)}= square-root start_ARG 1 - ( 1 - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG (126)
=112(11σw2(xc,i)2(xc,j)22(xc,j)22(xc,i)2)+ξh.o.tabsent11211superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=1-\frac{1}{2}\left(1-\frac{1}{\sigma_{w}^{2}(x^{c,i% })^{2}(x^{c,j})^{2}}-\frac{2}{(x^{c,j})^{2}}-\frac{2}{(x^{c,i})^{2}}\right)+% \xi^{\prime}_{h.o.t}= 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (127)
=12+12σw2(xc,i)2(xc,j)2+1(xc,j)2+1(xc,i)2+ξh.o.tabsent1212superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥𝑐𝑗21superscriptsuperscript𝑥𝑐𝑗21superscriptsuperscript𝑥𝑐𝑖2subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}(x% ^{c,j})^{2}}+\frac{1}{(x^{c,j})^{2}}+\frac{1}{(x^{c,i})^{2}}+\xi^{\prime}_{h.o% .t}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (128)

Thus, based on Lemma C.4, we get:

𝔼[κ(xc,i,xc,j)]𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c,j})\right]blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] =12+𝔼[12σw2(xc,i)2(xc,j)2]+𝔼[1(xc,j)2]+𝔼[1(xc,i)2]+𝔼[ξh.o.t]absent12𝔼delimited-[]12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2% }(x^{c,i})^{2}(x^{c,j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}% \right]+\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime}% _{h.o.t}]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + blackboard_E [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] (129)
=12+12σw2𝔼[1(xc,i)2]𝔼[1(xc,j)2]+𝔼[1(xc,j)2]+𝔼[1(xc,i)2]+𝔼[ξh.o.t]absent1212superscriptsubscript𝜎𝑤2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}% \left[\frac{1}{(x^{c,i})^{2}}\right]\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}% \right]+\mathbb{E}\left[\frac{1}{(x^{c,j})^{2}}\right]+\mathbb{E}\left[\frac{1% }{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime}_{h.o.t}]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] (130)
=12+T(c)22σw2+2T(c)+𝔼[ξh.o.t]absent12𝑇superscript𝑐22superscriptsubscript𝜎𝑤22𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{T(c)^{2}}{2\sigma_{w}^{2}}+2T(c)+% \mathbb{E}[\xi^{\prime}_{h.o.t}]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] (131)

This leads to:

𝔼[1κ(xc,i,xc,j)]𝔼delimited-[]1𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c,j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ] =𝔼[11+(12+T(c)22σw2+2T(c)+𝔼[ξh.o.t])]absent𝔼delimited-[]1112𝑇superscript𝑐22superscriptsubscript𝜎𝑤22𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉formulae-sequence𝑜𝑡\displaystyle\displaystyle=\mathbb{E}\left[\frac{1}{1+\left(-\frac{1}{2}+\frac% {T(c)^{2}}{2\sigma_{w}^{2}}+2T(c)+\mathbb{E}[\xi^{\prime}_{h.o.t}]\right)}\right]= blackboard_E [ divide start_ARG 1 end_ARG start_ARG 1 + ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] ) end_ARG ] (132)
=1(12+T(c)22σw2+2T(c)+𝔼[ξh.o.t])+δh.o.t(κ(xc,i,xc,j))absent112𝑇superscript𝑐22superscriptsubscript𝜎𝑤22𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉formulae-sequence𝑜𝑡superscriptsubscript𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle=1-\left(-\frac{1}{2}+\frac{T(c)^{2}}{2\sigma_{w}^{2% }}+2T(c)+\mathbb{E}[\xi^{\prime}_{h.o.t}]\right)+\delta_{h.o.t}^{\prime}(% \kappa(x^{c,i},x^{c,j}))= 1 - ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + 2 italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] ) + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ) (133)
=32T(c)22σw22T(c)+δ~h.o.t(κ(xc,i,xc,j))absent32𝑇superscript𝑐22superscriptsubscript𝜎𝑤22𝑇𝑐subscript~𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle=\frac{3}{2}-\frac{T(c)^{2}}{2\sigma_{w}^{2}}-2T(c)+% \widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c,j}))= divide start_ARG 3 end_ARG start_ARG 2 end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 italic_T ( italic_c ) + over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ) (134)

\displaystyle\bullet Case cc𝑐superscript𝑐\displaystyle c\neq c^{\prime}italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT:

κ(xc,i,xc,j)𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\kappa(x^{c,i},x^{c^{\prime},j})italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) =1σw2(xc,i)2(xc,j)2+2(xc,j)2+2(xc,i)2absent1superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥superscript𝑐𝑗22superscriptsuperscript𝑥𝑐𝑖2\displaystyle\displaystyle=\sqrt{\frac{1}{\sigma_{w}^{2}(x^{c,i})^{2}(x^{c^{% \prime},j})^{2}}+\frac{2}{(x^{c^{\prime},j})^{2}}+\frac{2}{(x^{c,i})^{2}}}= square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 2 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG (135)
=12+12σw2(xc,i)2(xc,j)2+1(xc,j)2+1(xc,i)2+ξh.o.t′′absent1212superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗21superscriptsuperscript𝑥superscript𝑐𝑗21superscriptsuperscript𝑥𝑐𝑖2subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}(x^{c,i})^{2}(x% ^{c^{\prime},j})^{2}}+\frac{1}{(x^{c^{\prime},j})^{2}}+\frac{1}{(x^{c,i})^{2}}% +\xi^{\prime\prime}_{h.o.t}= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (136)

Thus, based on Lemma C.4, we get:

𝔼[κ(xc,i,xc,j)]𝔼delimited-[]𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\kappa(x^{c,i},x^{c^{\prime},j})\right]blackboard_E [ italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] =12+𝔼[12σw2(xc,i)2(xc,j)2]+𝔼[1(xc,j)2]+𝔼[1(xc,i)2]+𝔼[ξh.o.t′′]absent12𝔼delimited-[]12superscriptsubscript𝜎𝑤2superscriptsuperscript𝑥𝑐𝑖2superscriptsuperscript𝑥superscript𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥superscript𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\mathbb{E}\left[\frac{1}{2\sigma_{w}^{2% }(x^{c,i})^{2}(x^{c^{\prime},j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c^{% \prime},j})^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb% {E}[\xi^{\prime\prime}_{h.o.t}]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + blackboard_E [ divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] (137)
=12+12σw2𝔼[1(xc,i)2]𝔼[1(xc,j)2]+𝔼[1(xc,j)2]+𝔼[1(xc,i)2]+𝔼[ξh.o.t′′]absent1212superscriptsubscript𝜎𝑤2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]1superscriptsuperscript𝑥superscript𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥superscript𝑐𝑗2𝔼delimited-[]1superscriptsuperscript𝑥𝑐𝑖2𝔼delimited-[]subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{1}{2\sigma_{w}^{2}}\mathbb{E}% \left[\frac{1}{(x^{c,i})^{2}}\right]\mathbb{E}\left[\frac{1}{(x^{c^{\prime},j}% )^{2}}\right]+\mathbb{E}\left[\frac{1}{(x^{c^{\prime},j})^{2}}\right]+\mathbb{% E}\left[\frac{1}{(x^{c,i})^{2}}\right]+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}]= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ divide start_ARG 1 end_ARG start_ARG ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] (138)
=12+T(c)T(c)2σw2+T(c)+T(c)+𝔼[ξh.o.t′′].absent12𝑇𝑐𝑇superscript𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{1}{2}+\frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2% }}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}].= divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] . (139)

This gives us:

𝔼[1κ(xc,i,xc,j)]𝔼delimited-[]1𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\frac{1}{\kappa(x^{c,i},x^{c^{\prime% },j})}\right]blackboard_E [ divide start_ARG 1 end_ARG start_ARG italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] =𝔼[11+(12+T(c)T(c)2σw2+T(c)+T(c)+𝔼[ξh.o.t′′])]absent𝔼delimited-[]1112𝑇𝑐𝑇superscript𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡\displaystyle\displaystyle=\mathbb{E}\left[\frac{1}{1+\left(-\frac{1}{2}+\frac% {T(c)T(c^{\prime})}{2\sigma_{w}^{2}}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime% \prime}_{h.o.t}]\right)}\right]= blackboard_E [ divide start_ARG 1 end_ARG start_ARG 1 + ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] ) end_ARG ] (140)
=1(12+T(c)T(c)2σw2+T(c)+T(c)+𝔼[ξh.o.t′′])+δh.o.t(κ(xc,i,xc,j))absent112𝑇𝑐𝑇superscript𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐𝑇𝑐𝔼delimited-[]subscriptsuperscript𝜉′′formulae-sequence𝑜𝑡superscriptsubscript𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗\displaystyle\displaystyle=1-\left(-\frac{1}{2}+\frac{T(c)T(c^{\prime})}{2% \sigma_{w}^{2}}+T(c^{\prime})+T(c)+\mathbb{E}[\xi^{\prime\prime}_{h.o.t}]% \right)+\delta_{h.o.t}^{\prime}(\kappa(x^{c,i},x^{c,j}))= 1 - ( - divide start_ARG 1 end_ARG start_ARG 2 end_ARG + divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + italic_T ( italic_c ) + blackboard_E [ italic_ξ start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ] ) + italic_δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ) (141)
=32T(c)T(c)2σw2T(c)T(c)+δ~h.o.t(κ(xc,i,xc,j))absent32𝑇𝑐𝑇superscript𝑐2superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐subscript~𝛿formulae-sequence𝑜𝑡𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle=\frac{3}{2}-\frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2% }}-T(c)-T(c^{\prime})+\widetilde{\delta}_{h.o.t}(\kappa(x^{c,i},x^{c^{\prime},% j}))= divide start_ARG 3 end_ARG start_ARG 2 end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_T ( italic_c ) - italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) + over~ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT ( italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ) (142)

Finally, the cases for the expected value of the kernel can be given as:

𝔼[ΘNTKErf(2)(xc,i,xc,j)]𝔼delimited-[]superscriptsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^% {\prime},j})\right]blackboard_E [ roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] ={𝔼[σw2QGPErf(1)(xc,i,xc,j)]+𝔼[4σwπκ(xc,i,xc,j)]c=c𝔼[σw2QGPErf(1)(xc,i,xc,j)]𝔼[4σwπκ(xc,i,xc,j)]cc,absentcases𝔼delimited-[]superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗𝔼delimited-[]4subscript𝜎𝑤𝜋𝜅superscript𝑥𝑐𝑖superscript𝑥𝑐𝑗𝑐superscript𝑐𝔼delimited-[]superscriptsubscript𝜎𝑤2superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝔼delimited-[]4subscript𝜎𝑤𝜋𝜅superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗𝑐superscript𝑐\displaystyle\displaystyle=\begin{cases}\mathbb{E}\left[\sigma_{w}^{2}Q_{GP-% Erf}^{(1)}(x^{c,i},x^{c,j})\right]+\mathbb{E}\left[\frac{4\sigma_{w}}{\pi% \kappa(x^{c,i},x^{c,j})}\right]&c=c^{\prime}\\ \mathbb{E}\left[\sigma_{w}^{2}Q_{GP-Erf}^{(1)}(x^{c,i},x^{c^{\prime},j})\right% ]-\mathbb{E}\left[\frac{4\sigma_{w}}{\pi\kappa(x^{c,i},x^{c^{\prime},j})}% \right]&c\neq c^{\prime}\end{cases},= { start_ROW start_CELL blackboard_E [ italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) ] + blackboard_E [ divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_π italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c , italic_j end_POSTSUPERSCRIPT ) end_ARG ] end_CELL start_CELL italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL blackboard_E [ italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] - blackboard_E [ divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_π italic_κ ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) end_ARG ] end_CELL start_CELL italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW , (143)

From (87), we know that:

𝔼[QGPErf(1)(xc,i,xc,j)]{1T(c)2σw2+Δh.o.t(1)(c)if c=c,i=j1T(c)2σw2T(c)216σw4+Δh.o.t(2)(c)if c=c,ij1T(c)+T(c)4σw2T(c)T(c)16σw4+Δh.o.t(3)(c,c)if cc.𝔼delimited-[]superscriptsubscript𝑄𝐺𝑃𝐸𝑟𝑓1superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗cases1𝑇𝑐2superscriptsubscript𝜎𝑤2superscriptsubscriptΔformulae-sequence𝑜𝑡1𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗1𝑇𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐216superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡2𝑐formulae-sequenceif 𝑐superscript𝑐𝑖𝑗1𝑇𝑐𝑇superscript𝑐4superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐16superscriptsubscript𝜎𝑤4superscriptsubscriptΔformulae-sequence𝑜𝑡3𝑐superscript𝑐if 𝑐superscript𝑐\displaystyle\displaystyle\begin{split}&\mathbb{E}\left[Q_{GP-Erf}^{(1)}(x^{c,% i},x^{c^{\prime},j})\right]\\ &\hskip 40.0pt\approx\begin{cases}1-\frac{T(c)}{2\sigma_{w}^{2}}+\Delta_{h.o.t% }^{(1)}(c)&\text{if }c=c^{\prime},i=j\\ 1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w}^{4}}+\Delta_{h.o.t% }^{(2)}(c)&\text{if }c=c^{\prime},i\neq j\\ 1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c^{\prime})}{16\sigma% _{w}^{4}}+\Delta_{h.o.t}^{(3)}(c,c^{\prime})&\text{if }c\neq c^{\prime}\\ \end{cases}.\end{split}start_ROW start_CELL end_CELL start_CELL blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≈ { start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j end_CELL end_ROW start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL 1 - divide start_ARG italic_T ( italic_c ) + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG + roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW . end_CELL end_ROW (144)

To simplify the presentation, we can ignore the higher-order terms and obtain:

𝔼[ΘNTKErf(2)(xc,i,xc,j)]𝔼delimited-[]superscriptsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓2superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle\mathbb{E}\left[\Theta_{NTK-Erf}^{(2)}(x^{c,i},x^{c^% {\prime},j})\right]blackboard_E [ roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] (145)
{σw2(1T(c)2σw2)+4σwπ(322T(c))c=c;i=jσw2(1T(c)2σw2T(c)216σw4)+4σwπ(32T(c)22σw22T(c)),c=c,ijσw2(1T(c)+T(c)4σw2T(c)T(c)16σw4)4σwπ(32T(c)T(c)2σw2T(c)T(c)),ccabsentcasessuperscriptsubscript𝜎𝑤21𝑇𝑐2superscriptsubscript𝜎𝑤24subscript𝜎𝑤𝜋322𝑇𝑐formulae-sequence𝑐superscript𝑐𝑖𝑗superscriptsubscript𝜎𝑤21𝑇𝑐2superscriptsubscript𝜎𝑤2𝑇superscript𝑐216superscriptsubscript𝜎𝑤44subscript𝜎𝑤𝜋32𝑇superscript𝑐22superscriptsubscript𝜎𝑤22𝑇𝑐formulae-sequence𝑐superscript𝑐𝑖𝑗superscriptsubscript𝜎𝑤21𝑇𝑐𝑇superscript𝑐4superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐16superscriptsubscript𝜎𝑤44subscript𝜎𝑤𝜋32𝑇𝑐𝑇superscript𝑐2superscriptsubscript𝜎𝑤2𝑇𝑐𝑇superscript𝑐𝑐superscript𝑐\displaystyle\displaystyle\hskip 10.0pt\approx\begin{cases}\sigma_{w}^{2}\left% (1-\frac{T(c)}{2\sigma_{w}^{2}}\right)+\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2% }-2T(c)\right)&c=c^{\prime};i=j\\ \sigma_{w}^{2}\left(1-\frac{T(c)}{2\sigma_{w}^{2}}-\frac{T(c)^{2}}{16\sigma_{w% }^{4}}\right)+\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2}-\frac{T(c)^{2}}{2\sigma% _{w}^{2}}-2T(c)\right),&c=c^{\prime},i\neq j\\ \sigma_{w}^{2}\left(1-\frac{T(c)+T(c^{\prime})}{4\sigma_{w}^{2}}-\frac{T(c)T(c% ^{\prime})}{16\sigma_{w}^{4}}\right)-\frac{4\sigma_{w}}{\pi}\left(\frac{3}{2}-% \frac{T(c)T(c^{\prime})}{2\sigma_{w}^{2}}-T(c)-T(c^{\prime})\right),&c\neq c^{% \prime}\\ \end{cases}≈ { start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_π end_ARG ( divide start_ARG 3 end_ARG start_ARG 2 end_ARG - 2 italic_T ( italic_c ) ) end_CELL start_CELL italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; italic_i = italic_j end_CELL end_ROW start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_T ( italic_c ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) + divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_π end_ARG ( divide start_ARG 3 end_ARG start_ARG 2 end_ARG - divide start_ARG italic_T ( italic_c ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - 2 italic_T ( italic_c ) ) , end_CELL start_CELL italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 - divide start_ARG italic_T ( italic_c ) + italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 16 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 4 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG start_ARG italic_π end_ARG ( divide start_ARG 3 end_ARG start_ARG 2 end_ARG - divide start_ARG italic_T ( italic_c ) italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - italic_T ( italic_c ) - italic_T ( italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) , end_CELL start_CELL italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (146)

Observe that the order of the T(c)𝑇𝑐\displaystyle T(c)italic_T ( italic_c ) terms involved here resemble that of the NNGP scenario in (87). Thus, we can make similar conclusions regarding the role of the order of μc,σcsubscript𝜇𝑐subscript𝜎𝑐\displaystyle\mu_{c},\sigma_{c}italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT in determining the value of 𝔼[𝒩𝒞1(𝐇)]𝔼delimited-[]𝒩subscript𝒞1𝐇\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ].

Appendix F Activation Variability Relative to Data

In this section, we introduce a relative measure of activation variability collapse with respect to the data. First, we begin by defining the within-class and between-class data covariance matrices 𝚺W(𝐗),𝚺B(𝐗)d0×d0subscript𝚺𝑊𝐗subscript𝚺𝐵𝐗superscriptsubscript𝑑0subscript𝑑0\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{X}),\boldsymbol{\Sigma}_{B}(% \mathbf{X})\in\mathbb{R}^{d_{0}\times d_{0}}bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_X ) , bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT for the data samples as:

𝚺W(𝐗)subscript𝚺𝑊𝐗\displaystyle\displaystyle\boldsymbol{\Sigma}_{W}(\mathbf{X})bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_X ) =1Nc=1Ci=1nc(𝐱c,i𝐱¯c)(𝐱c,i𝐱¯c);𝚺B(𝐗)=1Cc=1C(𝐱¯c𝐱¯G)(𝐱¯c𝐱¯G),formulae-sequenceabsent1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐱𝑐𝑖superscript¯𝐱𝑐superscriptsuperscript𝐱𝑐𝑖superscript¯𝐱𝑐topsubscript𝚺𝐵𝐗1𝐶superscriptsubscript𝑐1𝐶superscript¯𝐱𝑐superscript¯𝐱𝐺superscriptsuperscript¯𝐱𝑐superscript¯𝐱𝐺top\displaystyle\displaystyle=\frac{1}{N}\sum_{c=1}^{C}\sum_{i=1}^{n_{c}}\left(% \mathbf{x}^{c,i}-\overline{\mathbf{x}}^{c}\right)\left(\mathbf{x}^{c,i}-% \overline{\mathbf{x}}^{c}\right)^{\top};\hskip 10.0pt\boldsymbol{\Sigma}_{B}(% \mathbf{X})=\frac{1}{C}\sum_{c=1}^{C}\left(\overline{\mathbf{x}}^{c}-\overline% {\mathbf{x}}^{G}\right)\left(\overline{\mathbf{x}}^{c}-\overline{\mathbf{x}}^{% G}\right)^{\top},= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT - over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT - over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ; bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_X ) = divide start_ARG 1 end_ARG start_ARG italic_C end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ( over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) ( over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT - over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , (147)

where 𝐱¯c=1nci=1nc𝐱c,i,c[C]formulae-sequencesuperscript¯𝐱𝑐1subscript𝑛𝑐superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐱𝑐𝑖for-all𝑐delimited-[]𝐶\displaystyle\overline{\mathbf{x}}^{c}=\frac{1}{n_{c}}\sum\nolimits_{i=1}^{n_{% c}}\mathbf{x}^{c,i},\forall c\in[C]over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , ∀ italic_c ∈ [ italic_C ] and 𝐱¯G=1Nc=1Ci=1nc𝐱c,isuperscript¯𝐱𝐺1𝑁superscriptsubscript𝑐1𝐶superscriptsubscript𝑖1subscript𝑛𝑐superscript𝐱𝑐𝑖\displaystyle\overline{\mathbf{x}}^{G}=\frac{1}{N}\sum\nolimits_{c=1}^{C}\sum% \nolimits_{i=1}^{n_{c}}\mathbf{x}^{c,i}over¯ start_ARG bold_x end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT represent the data class mean vectors and the data global mean vector respectively.

Definition F.1.

Set a small τ>0𝜏0\displaystyle\tau>0italic_τ > 0. The variability collapse relative to the data is given by:

𝒩𝒞1(𝐇|𝐗):=𝒩𝒞1(𝐇)𝒩𝒞1(𝐗)+τ,where 𝒩𝒞1(𝐗):=tr(𝚺W(𝐗))tr(𝚺B(𝐗))formulae-sequenceassign𝒩subscript𝒞1conditional𝐇𝐗𝒩subscript𝒞1𝐇𝒩subscript𝒞1𝐗𝜏assignwhere 𝒩subscript𝒞1𝐗trsubscript𝚺𝑊𝐗trsubscript𝚺𝐵𝐗\displaystyle\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X}):=% \frac{\mathcal{N}\mathcal{C}_{1}(\mathbf{H})}{\mathcal{N}\mathcal{C}_{1}(% \mathbf{X})+\tau},\hskip 10.0pt\textit{where }\mathcal{N}\mathcal{C}_{1}(% \mathbf{X}):=\frac{\mathrm{tr}(\boldsymbol{\Sigma}_{W}(\mathbf{X}))}{\mathrm{% tr}(\boldsymbol{\Sigma}_{B}(\mathbf{X}))}caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X ) := divide start_ARG caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) end_ARG start_ARG caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) + italic_τ end_ARG , where caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) := divide start_ARG roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_X ) ) end_ARG start_ARG roman_tr ( bold_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_X ) ) end_ARG (148)

The constant τ𝜏\displaystyle\tauitalic_τ prevents numerical instabilities. Through this approach, we capture the extent of variability collapse of activation features relative to the variability collapse of the data samples itself.

Corollary F.2.

Under Assumptions 1-3 (as per Section 5.3), let ϕ()italic-ϕ\displaystyle\phi(\cdot)italic_ϕ ( ⋅ ) be the ReLU activation, and the limiting NNGP kernel be QGPReLU(1)(𝐱c,i,𝐱c,j)=𝐡c,i𝐡c,jsuperscriptsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈1superscript𝐱𝑐𝑖superscript𝐱superscript𝑐𝑗superscript𝐡𝑐limit-from𝑖topsuperscript𝐡superscript𝑐𝑗\displaystyle Q_{GP-ReLU}^{(1)}(\mathbf{x}^{c,i},\mathbf{x}^{c^{\prime},j})=% \mathbf{h}^{c,i\top}\mathbf{h}^{c^{\prime},j}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = bold_h start_POSTSUPERSCRIPT italic_c , italic_i ⊤ end_POSTSUPERSCRIPT bold_h start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT, then:

𝔼[𝒩𝒞1(𝐇)]𝔼[𝒩𝒞1(𝐗)]12N2c=12ncμc(c=12μc22nc2μc2N2)𝔼delimited-[]𝒩subscript𝒞1𝐇𝔼delimited-[]𝒩subscript𝒞1𝐗12superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2\displaystyle\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H})\right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})% \right]}\approx 1-\frac{\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}{\left(\sum% _{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}divide start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] end_ARG start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) ] end_ARG ≈ 1 - divide start_ARG divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG (149)
Proof.

To keep the derivation similar to those for the kernel formulation in equation 45, we consider a simplified kernel on 𝐗𝐗\displaystyle\mathbf{X}bold_X (identity feature map):

Kdata(xc,i,xc,j)=xc,ixc,j.subscript𝐾𝑑𝑎𝑡𝑎superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗\displaystyle\displaystyle K_{data}(x^{c,i},x^{c^{\prime},j})=x^{c,i}x^{c^{% \prime},j}.italic_K start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) = italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT . (150)

Additionally, since 𝐱c,isuperscript𝐱𝑐𝑖\displaystyle\mathbf{x}^{c,i}bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT are 1-d random variables, the expected value of the kernel is given by:

𝔼[Kdata(xc,i,xc,j)]={σc2+μc2if c=c,i=jμc2if c=c,ijμcμcif cc𝔼delimited-[]subscript𝐾𝑑𝑎𝑡𝑎superscript𝑥𝑐𝑖superscript𝑥superscript𝑐𝑗casessuperscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2formulae-sequenceif 𝑐superscript𝑐𝑖𝑗superscriptsubscript𝜇𝑐2formulae-sequenceif 𝑐superscript𝑐𝑖𝑗subscript𝜇𝑐subscript𝜇superscript𝑐if 𝑐superscript𝑐\displaystyle\displaystyle\mathbb{E}\left[K_{data}(x^{c,i},x^{c^{\prime},j})% \right]=\begin{cases}\sigma_{c}^{2}+\mu_{c}^{2}&\text{if }c=c^{\prime},i=j\\ \mu_{c}^{2}&\text{if }c=c^{\prime},i\neq j\\ \mu_{c}\mu_{c^{\prime}}&\text{if }c\neq c^{\prime}\\ \end{cases}blackboard_E [ italic_K start_POSTSUBSCRIPT italic_d italic_a italic_t italic_a end_POSTSUBSCRIPT ( italic_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT , italic_x start_POSTSUPERSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_j end_POSTSUPERSCRIPT ) ] = { start_ROW start_CELL italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i = italic_j end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_CELL start_CELL if italic_c = italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_i ≠ italic_j end_CELL end_ROW start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL if italic_c ≠ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_CELL end_ROW (151)

We use Lemma C.3 with cases V(1)(c)=σc2+μc2superscript𝑉1𝑐superscriptsubscript𝜎𝑐2superscriptsubscript𝜇𝑐2\displaystyle V^{(1)}(c)=\sigma_{c}^{2}+\mu_{c}^{2}italic_V start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ( italic_c ) = italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, V(2)(c)=μc2superscript𝑉2𝑐superscriptsubscript𝜇𝑐2\displaystyle V^{(2)}(c)=\mu_{c}^{2}italic_V start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT ( italic_c ) = italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and V(3)(c,c)=μcμcsuperscript𝑉3𝑐superscript𝑐subscript𝜇𝑐subscript𝜇superscript𝑐\displaystyle V^{(3)}(c,c^{\prime})=\mu_{c}\mu_{c^{\prime}}italic_V start_POSTSUPERSCRIPT ( 3 ) end_POSTSUPERSCRIPT ( italic_c , italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT to obtain:

𝔼[𝒩𝒞1(𝐗)]=𝔼[tr(ΣW(𝐗))]𝔼[tr(ΣB(𝐗))]=c=12ncμc2+ncσc2Nnc2μc2+ncσc22nc2(c=12nc2μc2+ncσc22nc2nc2μc2+ncσc2N2)2N2c=12ncμc+Δh.o.tX𝔼delimited-[]𝒩subscript𝒞1𝐗𝔼delimited-[]trsubscriptΣ𝑊𝐗𝔼delimited-[]trsubscriptΣ𝐵𝐗superscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝑐12superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2superscript𝑁22superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐subscriptsuperscriptΔ𝑋formulae-sequence𝑜𝑡\displaystyle\displaystyle\begin{split}\mathbb{E}\left[\mathcal{N}\mathcal{C}_% {1}(\mathbf{X})\right]&=\frac{\mathbb{E}\left[\mathrm{tr}(\Sigma_{W}(\mathbf{X% }))\right]}{\mathbb{E}\left[\mathrm{tr}(\Sigma_{B}(\mathbf{X}))\right]}=\frac{% \sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{N}-\frac{n_{c}^{2}% \mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{2n_{c}^{2}}}{\left(\sum_{c=1}^{2}\frac{n_{c}^% {2}\mu_{c}^{2}+n_{c}\sigma_{c}^{2}}{2n_{c}^{2}}-\frac{n_{c}^{2}\mu_{c}^{2}+n_{% c}\sigma_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}+% \Delta^{X}_{h.o.t}\\ \end{split}start_ROW start_CELL blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) ] end_CELL start_CELL = divide start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT ( bold_X ) ) ] end_ARG start_ARG blackboard_E [ roman_tr ( roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ( bold_X ) ) ] end_ARG = divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG + roman_Δ start_POSTSUPERSCRIPT italic_X end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT end_CELL end_ROW (152)

Finally, the ratio 𝔼[𝒩𝒞1(𝐇)]𝔼[𝒩𝒞1(𝐗)]𝔼delimited-[]𝒩subscript𝒞1𝐇𝔼delimited-[]𝒩subscript𝒞1𝐗\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})% \right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})\right]}divide start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] end_ARG start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) ] end_ARG for ReLU (Theorem 5.1) with large enough nc1much-greater-thansubscript𝑛𝑐1\displaystyle n_{c}\gg 1italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ≫ 1 is given by:

𝔼[𝒩𝒞1(𝐇)]𝔼[𝒩𝒞1(𝐗)]𝔼delimited-[]𝒩subscript𝒞1𝐇𝔼delimited-[]𝒩subscript𝒞1𝐗\displaystyle\displaystyle\frac{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(% \mathbf{H})\right]}{\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{X})% \right]}divide start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] end_ARG start_ARG blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ) ] end_ARG =c=12ncμc2+ncσc2Nμc22(c=12μc22nc2μc2N2)(c=12μc22nc2μc2N2)2N2c=12ncμcc=12ncμc2+ncσc2Nμc22+Δh.o.t.absentsuperscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝜇𝑐22superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁22superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐superscriptsubscript𝑐12subscript𝑛𝑐superscriptsubscript𝜇𝑐2subscript𝑛𝑐superscriptsubscript𝜎𝑐2𝑁superscriptsubscript𝜇𝑐22subscriptsuperscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2% }}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)}\cdot\frac{\left(\sum_{c=1}^{2% }\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2% }}\prod_{c=1}^{2}n_{c}\mu_{c}}{\sum_{c=1}^{2}\frac{n_{c}\mu_{c}^{2}+n_{c}% \sigma_{c}^{2}}{N}-\frac{\mu_{c}^{2}}{2}}+\Delta^{\prime}_{h.o.t}.= divide start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG ⋅ divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N end_ARG - divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_ARG + roman_Δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT . (153)
=(c=12μc22nc2μc2N2)2N2c=12ncμc(c=12μc22nc2μc2N2)+Δh.o.tabsentsuperscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁22superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2subscriptsuperscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle=\frac{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-% \frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}\right)-\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}% \mu_{c}}{\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}% {N^{2}}\right)}+\Delta^{\prime}_{h.o.t}= divide start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) - divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG + roman_Δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (154)
=12N2c=12ncμc(c=12μc22nc2μc2N2)+Δh.o.tabsent12superscript𝑁2superscriptsubscriptproduct𝑐12subscript𝑛𝑐subscript𝜇𝑐superscriptsubscript𝑐12superscriptsubscript𝜇𝑐22superscriptsubscript𝑛𝑐2superscriptsubscript𝜇𝑐2superscript𝑁2subscriptsuperscriptΔformulae-sequence𝑜𝑡\displaystyle\displaystyle=1-\frac{\frac{2}{N^{2}}\prod_{c=1}^{2}n_{c}\mu_{c}}% {\left(\sum_{c=1}^{2}\frac{\mu_{c}^{2}}{2}-\frac{n_{c}^{2}\mu_{c}^{2}}{N^{2}}% \right)}+\Delta^{\prime}_{h.o.t}= 1 - divide start_ARG divide start_ARG 2 end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∏ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG start_ARG ( ∑ start_POSTSUBSCRIPT italic_c = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) end_ARG + roman_Δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT (155)

To better understand the result, let us consider the balanced class scenario where n1=n2=n=N/2subscript𝑛1subscript𝑛2𝑛𝑁2\displaystyle n_{1}=n_{2}=n=N/2italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_n = italic_N / 2. This results in a ratio of 1(2μ1μ2)/(μ12+μ22)absent12subscript𝜇1subscript𝜇2superscriptsubscript𝜇12superscriptsubscript𝜇22\displaystyle\approx 1-(2\mu_{1}\mu_{2})/(\mu_{1}^{2}+\mu_{2}^{2})≈ 1 - ( 2 italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) / ( italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Furthermore, if |μ1|=|μ2|subscript𝜇1subscript𝜇2\displaystyle|\mu_{1}|=|\mu_{2}|| italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | = | italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | (so μ1=μ2subscript𝜇1subscript𝜇2\displaystyle\mu_{1}=-\mu_{2}italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = - italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), then the ratio 2absent2\displaystyle\approx 2≈ 2. Thus, it emphasizes the interplay between class imbalance/balance and the values of expected class means on the relative variability collapse.

\displaystyle\bullet Addressing misleading 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values: Consider the case where σ1,σ20subscript𝜎1subscript𝜎20\displaystyle\sigma_{1},\sigma_{2}\to 0italic_σ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → 0. Then Theorem 5.1 for QGPReLUsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle Q_{GP-ReLU}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT indicates that 𝔼[𝒩𝒞1(𝐇)]0𝔼delimited-[]𝒩subscript𝒞1𝐇0\displaystyle\mathbb{E}\left[\mathcal{N}\mathcal{C}_{1}(\mathbf{H})\right]\to 0blackboard_E [ caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) ] → 0 (considering smaller fluctuations from Δh.o.tsubscriptΔformulae-sequence𝑜𝑡\displaystyle\Delta_{h.o.t}roman_Δ start_POSTSUBSCRIPT italic_h . italic_o . italic_t end_POSTSUBSCRIPT) in the balanced class setting. Such an observation can be misleading if one were to ignore 𝒩𝒞1(𝐗)𝒩subscript𝒞1𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_X ). For instance, such an empirical result while training deep neural networks fails to differentiate between settings where the network learned meaningful features and learned to classify complex datasets or was simply able to leverage the already collapsed data vectors. This applies to Erf activation as well. We justify this argument with the following experiment. For a sample size N𝑁\displaystyle Nitalic_N chosen from {128,256,512,1024}1282565121024\displaystyle\{128,256,512,1024\}{ 128 , 256 , 512 , 1024 }, and input dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }, we sample the vectors 𝐱1,i𝒩(10𝟏d0,𝐈d0),i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩10subscript1subscript𝑑0subscript𝐈subscript𝑑0𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-10*\mathbf{1}_{d_{0}},\mathbf{I}% _{d_{0}}),i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 10 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(10𝟏d0,𝐈d0),j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩10subscript1subscript𝑑0subscript𝐈subscript𝑑0𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(10*\mathbf{1}_{d_{0}},\mathbf{I}_% {d_{0}}),j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 10 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22 as our dataset. From Figure 5(a), 5(b), observe that 𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X ) values for QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT can be orders of magnitude larger than 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ), and for high-dimensions 𝒩𝒞1(𝐇|𝐗)>1𝒩subscript𝒞1conditional𝐇𝐗1\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})>1caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X ) > 1. Essentially, the raw data is ‘more’ collapsed than the activations in these settings. Similar observations can be made for the NTK ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT in Figure 5(c), 5(d).

Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT, 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H )
Refer to caption
(b) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT, 𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X )
Refer to caption
(c) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT, 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H )
Refer to caption
(d) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT, 𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X )
Figure 5: 𝒩𝒞1(𝐇),𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1𝐇𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}),\mathcal{N}\mathcal{C}_{1}% (\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) , caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X ) of QGPErf(1)subscriptsuperscript𝑄1𝐺𝑃𝐸𝑟𝑓\displaystyle Q^{(1)}_{GP-Erf}italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT and ΘNTKErf(2)subscriptsuperscriptΘ2𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta^{(2)}_{NTK-Erf}roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a particular N𝑁\displaystyle Nitalic_N, we sample the vectors 𝐱1,i𝒩(10𝟏d0,𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩10subscript1subscript𝑑0subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-10*\mathbf{1}_{d_{0}},\mathbf{I}% _{d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 10 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(10𝟏d0,𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩10subscript1subscript𝑑0subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(10*\mathbf{1}_{d_{0}},\mathbf{I}_% {d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 10 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22.
Refer to caption
(a) EoS 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H )
Refer to caption
(b) EoS 𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X )
Refer to caption
(c) 2L-FCN 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H )
Refer to caption
(d) 2L-FCN 𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X )
Figure 6: 𝒩𝒞1(𝐇),𝒩𝒞1(𝐇|𝐗)𝒩subscript𝒞1𝐇𝒩subscript𝒞1conditional𝐇𝐗\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H}),\mathcal{N}\mathcal{C}_{1}% (\mathbf{H}|\mathbf{X})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) , caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H | bold_X ) of the adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a tuple (n1,n2)subscript𝑛1subscript𝑛2\displaystyle(n_{1},n_{2})( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that n1+n2=N=1024subscript𝑛1subscript𝑛2𝑁1024\displaystyle n_{1}+n_{2}=N=1024italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N = 1024, we sample the vectors 𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),y1,i=1,i[n1]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]subscript𝑛1\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-1,i\in[n_{1}]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1,j[n2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]subscript𝑛2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{2,j}=1,j\in[n_{2}]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for class 22\displaystyle 22.

Appendix G Numerical solutions of EoS

We solve the EoS using the Newton-Krylov method with an annealing schedule (as originally proposed by [40]) using the scipy.optimize.newton_krylov python API. We initialize 𝐂𝐂\displaystyle\mathbf{C}bold_C with the GP limit value of (σw2/d0)𝐈d0superscriptsubscript𝜎𝑤2subscript𝑑0subscript𝐈subscript𝑑0\displaystyle(\sigma_{w}^{2}/d_{0})\mathbf{I}_{d_{0}}( italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and choose a large annealing factor (ex: 105superscript105\displaystyle 10^{5}10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT) as the value for d1subscript𝑑1\displaystyle d_{1}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. The result of optimizing with newton_krylov is a new 𝐂𝐂\displaystyle\mathbf{C}bold_C, which in addition to a lower annealing factor is used as an input for the next newton_krylov function call. This loop is repeated until the end of an annealing schedule. For instance, to analyze the EoS corresponding to d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500, we choose the following list of step-wise annealing factors:

factors=[105,9104,,2104step=104,104,9103,,2103step=103,103,,500step=102].factorssubscriptsuperscript1059superscript1042superscript104stepsuperscript104subscriptsuperscript1049superscript1032superscript103stepsuperscript103subscriptsuperscript103500stepsuperscript102\displaystyle\displaystyle\texttt{factors}=[\underbrace{10^{5},9*10^{4},\cdots% ,2*10^{4}}_{\texttt{step}=-10^{4}},\underbrace{10^{4},9*10^{3},\cdots,2*10^{3}% }_{\texttt{step}=-10^{3}},\underbrace{10^{3},\cdots,500}_{\texttt{step}=-10^{2% }}].factors = [ under⏟ start_ARG 10 start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT , 9 ∗ 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , ⋯ , 2 ∗ 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT step = - 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , under⏟ start_ARG 10 start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , 9 ∗ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , ⋯ , 2 ∗ 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT step = - 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , under⏟ start_ARG 10 start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , ⋯ , 500 end_ARG start_POSTSUBSCRIPT step = - 10 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ] . (156)

Similarly, for a choice of d1=2000subscript𝑑12000\displaystyle d_{1}=2000italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 2000, we select the slice of the above list up to 20002000\displaystyle 20002000. Selecting the schedule is a manual operation and can be treated as a hyper-parameter. In our experiments, we observed that this schedule is sufficient to obtain insights on the NC1 metrics of 𝐐(1)superscript𝐐1\displaystyle\mathbf{Q}^{(1)}bold_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT. Thus, we leave the exploration of various annealing strategies as future work.

\displaystyle\bullet Comparing the spectrum of weight covariance matrices: Since 𝐂𝐂\displaystyle\mathbf{C}bold_C is subject to change while obtaining the stable state of the EoS, we analyze its initial and final (normalized) spectra for two different datasets of dimension d0=32subscript𝑑032\displaystyle d_{0}=32italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 32 and N=1024𝑁1024\displaystyle N=1024italic_N = 1024. Dataset 1: 𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_i ∈ [ italic_N / 2 ], 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_j ∈ [ italic_N / 2 ]. Dataset 2: 𝐱c,i𝒩(𝟎d0,4𝐈d0),i[N/2],c[2]formulae-sequencesimilar-tosuperscript𝐱𝑐𝑖𝒩subscript0subscript𝑑04subscript𝐈subscript𝑑0formulae-sequence𝑖delimited-[]𝑁2𝑐delimited-[]2\displaystyle\mathbf{x}^{c,i}\sim\mathcal{N}(\mathbf{0}_{d_{0}},4*\mathbf{I}_{% d_{0}}),i\in[N/2],c\in[2]bold_x start_POSTSUPERSCRIPT italic_c , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( bold_0 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 4 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_i ∈ [ italic_N / 2 ] , italic_c ∈ [ 2 ] The first dataset is our running example, and the second is pure random noise data. Surprisingly, we observed that the EoS solution captures correlations in the data for both datasets, which is reflected in its final spectrum. In particular, the singular values shift from being constant at initialization to exhibiting a decay in their values (Figure 7). Such a shift does not exactly match the case of 22\displaystyle 22L-FCN because of (1) the difference in the dynamics of GD and Newton-Krylov with annealing, and (2) we start with a GP-based initial value for 𝐂𝐂\displaystyle\mathbf{C}bold_C in EoS. A rigorous analysis of the EoS dynamics is an open research direction (as also highlighted by the Seroussi et al. [40]). Nonetheless, the EoS offers a richer data-dependent setup to analyze the activations and weights, than the UFM.

Refer to caption
(a) EoS: Dataset 1
Refer to caption
(b) 2L-FCN: Dataset 1
Refer to caption
(c) EoS: Dataset 2
Refer to caption
(d) 2L-FCN: Dataset 2
Figure 7: Normalized singular values sorted in descending order λi/λmax,i[32]subscript𝜆𝑖subscript𝜆𝑚𝑎𝑥for-all𝑖delimited-[]32\displaystyle\lambda_{i}/\lambda_{max},\forall i\in[32]italic_λ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_λ start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , ∀ italic_i ∈ [ 32 ] for 𝐂d0×d0𝐂superscriptsubscript𝑑0subscript𝑑0\displaystyle\mathbf{C}\in\mathbb{R}^{d_{0}\times d_{0}}bold_C ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in case of EoS and for 𝐖(1)𝐖(1)d0×d0superscript𝐖limit-from1topsuperscript𝐖1superscriptsubscript𝑑0subscript𝑑0\displaystyle\mathbf{W}^{(1)\top}\mathbf{W}^{(1)}\in\mathbb{R}^{d_{0}\times d_% {0}}bold_W start_POSTSUPERSCRIPT ( 1 ) ⊤ end_POSTSUPERSCRIPT bold_W start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT × italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT in case of 2L-FCN. Here init represents the initialized state of an EoS and 2L-FCN in their respective plots. The final state of EoS is obtained by solving it using Newton-Krylov. The 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 is trained for 10,00010000\displaystyle 10,00010 , 000 epochs using Gradient Descent with a learning rate 103superscript103\displaystyle 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT, weight decay 106superscript106\displaystyle 10^{-6}10 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT, σw=1subscript𝜎𝑤1\displaystyle\sigma_{w}=1italic_σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT = 1 and σb=0subscript𝜎𝑏0\displaystyle\sigma_{b}=0italic_σ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT = 0. The EoS follows an annealing schedule with a final factor 500500\displaystyle 500500.

Appendix H Additional Experiments

Compute Resources: All the experiments in this paper were executed on a machine with 1616\displaystyle 1616 GB of host memory and 88\displaystyle 88 CPU cores. Experiments with the EoS on datasets of varying dimensions and sample sizes took the longest time 1absent1\displaystyle\approx 1≈ 1 hour to finish.

Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) QGPReLUsubscript𝑄𝐺𝑃𝑅𝑒𝐿𝑈\displaystyle Q_{GP-ReLU}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT
Refer to caption
(d) ΘNTKReLUsubscriptΘ𝑁𝑇𝐾𝑅𝑒𝐿𝑈\displaystyle\Theta_{NTK-ReLU}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_R italic_e italic_L italic_U end_POSTSUBSCRIPT
Figure 8: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the post-activation NNGP kernel (QGP(1)subscriptsuperscript𝑄1𝐺𝑃\displaystyle Q^{(1)}_{GP}italic_Q start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_G italic_P end_POSTSUBSCRIPT) and NTK (Θ(2)superscriptΘ2\displaystyle\Theta^{(2)}roman_Θ start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT) corresponding to Erf, ReLU activations. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For (n1,n2)subscript𝑛1subscript𝑛2\displaystyle(n_{1},n_{2})( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that n1+n2=N=1024subscript𝑛1subscript𝑛2𝑁1024\displaystyle n_{1}+n_{2}=N=1024italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N = 1024, we sample the vectors 𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),y1,i=1,i[n1]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]subscript𝑛1\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-1,i\in[n_{1}]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1,j[n2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]subscript𝑛2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{2,j}=1,j\in[n_{2}]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for class 22\displaystyle 22 as our dataset.
Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 9: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a tuple nc=(n1,n2)subscript𝑛𝑐subscript𝑛1subscript𝑛2\displaystyle n_{c}=(n_{1},n_{2})italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) such that n1+n2=N=2048subscript𝑛1subscript𝑛2𝑁2048\displaystyle n_{1}+n_{2}=N=2048italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_N = 2048, we sample the vectors 𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),y1,i=1,i[n1]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]subscript𝑛1\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-1,i\in[n_{1}]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1,j[n2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]subscript𝑛2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{2,j}=1,j\in[n_{2}]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for class 22\displaystyle 22.
Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 10: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a tuple nc=(n1,n2,n3,n4)subscript𝑛𝑐subscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛4\displaystyle n_{c}=(n_{1},n_{2},n_{3},n_{4})italic_n start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = ( italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ) such that n1+n2+n3+n4=N=1024subscript𝑛1subscript𝑛2subscript𝑛3subscript𝑛4𝑁1024\displaystyle n_{1}+n_{2}+n_{3}+n_{4}=N=1024italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = italic_N = 1024, we sample the vectors 𝐱1,i𝒩(6𝟏d0,0.25𝐈d0),y1,i=3,i[n1]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩6subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖3𝑖delimited-[]subscript𝑛1\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-6*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-3,i\in[n_{1}]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 3 , italic_i ∈ [ italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ] for class 11\displaystyle 11, 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1j[n2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0superscript𝑦2𝑗1𝑗delimited-[]subscript𝑛2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{2,j}=-1j\in[n_{2}]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = - 1 italic_j ∈ [ italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] for class 22\displaystyle 22, 𝐱3,k𝒩(2𝟏d0,0.25𝐈d0),y3,k=1,k[n3]formulae-sequencesimilar-tosuperscript𝐱3𝑘𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦3𝑘1𝑘delimited-[]subscript𝑛3\displaystyle\mathbf{x}^{3,k}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{3,k}=1,k\in[n_{3}]bold_x start_POSTSUPERSCRIPT 3 , italic_k end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 3 , italic_k end_POSTSUPERSCRIPT = 1 , italic_k ∈ [ italic_n start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ] for class 33\displaystyle 33 and 𝐱4,l𝒩(6𝟏d0,0.25𝐈d0),y4,l=3,l[n4]formulae-sequencesimilar-tosuperscript𝐱4𝑙𝒩6subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦4𝑙3𝑙delimited-[]subscript𝑛4\displaystyle\mathbf{x}^{4,l}\sim\mathcal{N}(6*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{4,l}=3,l\in[n_{4}]bold_x start_POSTSUPERSCRIPT 4 , italic_l end_POSTSUPERSCRIPT ∼ caligraphic_N ( 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 4 , italic_l end_POSTSUPERSCRIPT = 3 , italic_l ∈ [ italic_n start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ] for class 44\displaystyle 44.
Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 11: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a particular N𝑁\displaystyle Nitalic_N, we sample the vectors 𝐱1,i𝒩(6𝟏d0,0.25𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩6subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-6*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11, and 𝐱4,j𝒩(6𝟏d0,0.25𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱4𝑗𝒩6subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{4,j}\sim\mathcal{N}(6*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 4 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 6 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22.
Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 12: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a particular N𝑁\displaystyle Nitalic_N, we sample the vectors 𝐱1,i𝒩(2𝟏d0,𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑0subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},\mathbf{I}_% {d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑0subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},\mathbf{I}_{% d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22.
Refer to caption
(a) QGPErfsubscript𝑄𝐺𝑃𝐸𝑟𝑓\displaystyle Q_{GP-Erf}italic_Q start_POSTSUBSCRIPT italic_G italic_P - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(b) ΘNTKErfsubscriptΘ𝑁𝑇𝐾𝐸𝑟𝑓\displaystyle\Theta_{NTK-Erf}roman_Θ start_POSTSUBSCRIPT italic_N italic_T italic_K - italic_E italic_r italic_f end_POSTSUBSCRIPT
Refer to caption
(c) EoS
Refer to caption
(d) 2L-FCN
Figure 13: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of the limiting kernels, adaptive kernel (EoS) with final annealing factor d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and 2L-FCN with d1=500subscript𝑑1500\displaystyle d_{1}=500italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 500 and Erf activation. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {8,16,32,64,128}8163264128\displaystyle\{8,16,32,64,128\}{ 8 , 16 , 32 , 64 , 128 }. For a particular N𝑁\displaystyle Nitalic_N, we sample the vectors 𝐱1,i𝒩(2𝟏d0,4𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑04subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},4*\mathbf{I% }_{d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 4 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,4𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑04subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},4*\mathbf{I}% _{d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 4 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22.
Refer to caption
(a) L=3𝐿3\displaystyle L=3italic_L = 3
Refer to caption
(b) L=4𝐿4\displaystyle L=4italic_L = 4
Refer to caption
(c) L=5𝐿5\displaystyle L=5italic_L = 5
Refer to caption
(d) L=6𝐿6\displaystyle L=6italic_L = 6
Figure 14: 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) of deeper FCN networks with Erf activation and hidden later width 500500\displaystyle 500500. The dimension d0subscript𝑑0\displaystyle d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT on the x-axis is chosen from {1,2,8,32,128}12832128\displaystyle\{1,2,8,32,128\}{ 1 , 2 , 8 , 32 , 128 }. For a particular N𝑁\displaystyle Nitalic_N, we sample the vectors 𝐱1,i𝒩(2𝟏d0,0.25𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},0.25*% \mathbf{I}_{d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,0.25𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑00.25subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},0.25*\mathbf% {I}_{d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 0.25 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22.

Appendix I Impact Statement

This paper aims to address the limitations of the unconstrained features model to understand the role of data on the Neural Collapse phenomenon. Our work does not have any direct negative societal impact. On the contrary, our work takes a step forward in understanding the characteristics of datasets that govern the performance of deep neural network classifiers. Thus, laying the groundwork for theoretically informed training practices for wide societal use.

Appendix J Limitations and Future Work

In certain cases, we have observed that none of the kernel methods approximate the 2L-FCN reasonably. One such instance is the following, where we sample 𝐱1,i𝒩(2𝟏d0,4𝐈d0),y1,i=1,i[N/2]formulae-sequencesimilar-tosuperscript𝐱1𝑖𝒩2subscript1subscript𝑑04subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦1𝑖1𝑖delimited-[]𝑁2\displaystyle\mathbf{x}^{1,i}\sim\mathcal{N}(-2*\mathbf{1}_{d_{0}},4*\mathbf{I% }_{d_{0}}),y^{1,i}=-1,i\in[N/2]bold_x start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT ∼ caligraphic_N ( - 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 4 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 1 , italic_i end_POSTSUPERSCRIPT = - 1 , italic_i ∈ [ italic_N / 2 ] for class 11\displaystyle 11 and 𝐱2,j𝒩(2𝟏d0,4𝐈d0),y2,j=1,j[N/2]formulae-sequencesimilar-tosuperscript𝐱2𝑗𝒩2subscript1subscript𝑑04subscript𝐈subscript𝑑0formulae-sequencesuperscript𝑦2𝑗1𝑗delimited-[]𝑁2\displaystyle\mathbf{x}^{2,j}\sim\mathcal{N}(2*\mathbf{1}_{d_{0}},4*\mathbf{I}% _{d_{0}}),y^{2,j}=1,j\in[N/2]bold_x start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT ∼ caligraphic_N ( 2 ∗ bold_1 start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , 4 ∗ bold_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) , italic_y start_POSTSUPERSCRIPT 2 , italic_j end_POSTSUPERSCRIPT = 1 , italic_j ∈ [ italic_N / 2 ] for class 22\displaystyle 22 of our dataset. Essentially, these are scenarios where there is a significant overlap between samples of the two classes. First, we note that we had to increase the learning rate of our 2L-FCN from 103superscript103\displaystyle 10^{-3}10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT to 51035superscript103\displaystyle 5\cdot 10^{-3}5 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and run GD for 20002000\displaystyle 20002000 epochs for convergence. For dimensions d0={8,16,32}subscript𝑑081632\displaystyle d_{0}=\{8,16,32\}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 8 , 16 , 32 }, the EoS reasonably approximates the 2L-FCN but for d0={64,128}subscript𝑑064128\displaystyle d_{0}=\{64,128\}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = { 64 , 128 }, the 𝒩𝒞1(𝐇)𝒩subscript𝒞1𝐇\displaystyle\mathcal{N}\mathcal{C}_{1}(\mathbf{H})caligraphic_N caligraphic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( bold_H ) values for 2L-FCN turned out to be almost twice as large as the EoS (see Figure 13). To this end, we leave modifications to the EoS for handling such noisy data cases and different activation functions as future work.

Additionally, we highlight the difficulties in the theoretical/empirical analysis of NC1 with EoS. The primary bottleneck is a lack of rigorous study on the existence and uniqueness of solutions (As also highlighted by [40]). Since we deviate from the lazy regime and deal with kernels in the feature learning setup, we cannot expect simpler closed-form solutions like the limiting NNGP/NTK for the EoS. However, analytical solutions to the EoS can sometimes be time-consuming and require a manual selection of the annealing schedule. This is a tradeoff that can be improved with future research. Furthermore, the role of scaling N,d0,d1𝑁subscript𝑑0subscript𝑑1\displaystyle N,d_{0},d_{1}italic_N , italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on NC1 is yet to be fully understood and we hope that our analysis lays the groundwork for such efforts.

Finally, we point the reader to Appendix F for a discussion on a relative NC1 metric that explicitly incorporates the variability collapse of the data vectors into the NC1 metric. In particular, we aim to differentiate between settings where the neural network learned meaningful features and learned to classify complex datasets or was simply able to leverage the already collapsed data vectors. Our results showcase that in higher dimensions, the data vectors are ‘more’ collapsed than the activations themselves. Thus showcasing the limitations of the current NC1 metrics and encouraging the reader to explore much richer variants.