Provably Neural Active Learning Succeeds via Prioritizing Perplexing Samples

Dake Bu    Wei Huang    Taiji Suzuki    Ji Cheng    Qingfu Zhang    Zhiqiang Xu    Hau-San Wong
Abstract

Neural Network-based active learning (NAL) is a cost-effective data selection technique that utilizes neural networks to select and train on a small subset of samples. While existing work successfully develops various effective or theory-justified NAL algorithms, the understanding of the two commonly used query criteria of NAL: uncertainty-based and diversity-based, remains in its infancy. In this work, we try to move one step forward by offering a unified explanation for the success of both query criteria-based NAL from a feature learning view. Specifically, we consider a feature-noise data model comprising easy-to-learn or hard-to-learn features disrupted by noise, and conduct analysis over 2-layer NN-based NALs in the pool-based scenario. We provably show that both uncertainty-based and diversity-based NAL are inherently amenable to one and the same principle, i.e., striving to prioritize samples that contain yet-to-be-learned features. We further prove that this shared principle is the key to their success-achieve small test error within a small labeled set. Contrastingly, the strategy-free passive learning exhibits a large test error due to the inadequate learning of yet-to-be-learned features, necessitating resort to a significantly larger label complexity for a sufficient test error reduction. Experimental results validate our findings.

Machine Learning, ICML

1 Introduction

In the deep learning era, we witness the power of neural networks in representation learning. It is also well-known that their success relies on a substantial amount of data and extensive labeling efforts. On the other hand, active learning offers various approaches to select a small subset of unlabeled samples from a large pool of data for labeling and training, while achieving comparable generalization performance to learning on the entire dataset (Settles, 2009; Aggarwal et al., 2014). To enjoy the best of both worlds, people combine neural networks with active learning, giving rise to Neural Network-based Active Learning (NAL) or Deep Active Learning (DAL), such that over-parameterized neural models can work with limited size of labeled data. As summarized in Takezoe et al. (2023), NAL/DAL incorporates two primary criteria for querying (selecting) unlabeled samples: uncertainty-based (Roth and Small, 2006; Joshi et al., 2009) and diversity-based (Sener and Savarese, 2018; Gissin and Shalev-Shwartz, 2019). Also, some studies leverage both criteria to design NAL algorithms (Yin et al., 2017; Shui et al., 2020).

Notably, while various NAL algorithms, based on two query criteria, have achieved significant empirical success, they often come without provable performance guarantees. To overcome this limitation, recent studies (Gu et al., 2014; Gu, 2014; Wang et al., 2022a) came up with theory-driven NAL algorithms. These studies reformulate the problem into a subset selection problem or multi-armed bandit problem, and then utilize theoretical analysis techniques to guarantee the test performance. However, the internal mechanism remains not well understood on why the two widely used query criteria in the NAL family work so well, which naturally leads us to the following questions.

\mdfsetup

frametitle= Essential Questions , innertopmargin=-3pt, innerbottommargin=7pt, innerrightmargin=7pt, innerleftmargin=7pt, frametitleaboveskip=-frametitlealignment=, linewidth=1pt {mdframed} 1. What is the theoretical rationale behind the success of the two query criteria-based NAL algorithms, namely uncertainty-based and diversity-based?
2. Whether and how do the two query criteria of NAL connect to each other intrinsically?

1.1 Our Contribution

To answer the above questions, in this work, we delve into the feature learning dynamics of NAL algorithms. To start with, we draw inspiration from the data models in Zou et al. (2023a); Allen-Zhu and Li (2023); Lu et al. (2023) that consist of multiple task-related feature patches and noise patches with varying strengths and frequencies, similar to what is observed in real-world imbalanced datasets, and conjecture that successful NAL algorithms are able to ensure adequate learning of all types of task-related features.

In this spirit, we adopt a multi-view feature-noise data model that comprises two main components: i) easy-to-learn (i.e., strong &\&& common) features or hard-to-learn (i.e., weak &\&& rare) features, and ii) noise. In Figure 1, the easy-to-learn features are exemplified by the frontal male lions with brown fur in the first row, given their common and easily identifiable lion traits, while lions in all the other rows can be characterized as the hard-to-learn features since they exhibit distinctive poses, colors, ages, races, fur patterns, and even heterogeneity. Hard-to-learn features are less common in the dataset and correspond to weakly recognizable lion traits, compared to the easy-to-learn features.

Refer to caption
Figure 1: Lions in real-world dataset.

Under our data model, we reformulate two representative NAL algorithms, i.e., Uncertainty Sampling and Diversity Sampling, in a pool-based setting, corresponding to two query criteria, respectively. Both are built upon a two-layer ReLU convolutional neural network, and trained by gradient descent. In accordance with the principle of each approach family (Takezoe et al., 2023), the proposed Uncertainty Sampling queries based on the lowest confidence (Lewis and Catlett, 1994), and Diversity Sampling queries based on the largest distance between feature representations of unlabeled samples in the pool and those of labeled data (Sener and Savarese, 2018).

Over our data and algorithm models, our theory sheds light on the benefits of the two primary query criteria in the NAL family. Surprisingly, our analysis unveils that the success of both criteria-based NAL stems from their inherent shared principle, leading to a unified view. Specifically, we make the following contributions in this work.

  • We offer valuable insights that from a feature learning view, the two query criteria-based NAL can be unified as one family. We provably show that the two query criteria-based NAL share the same working principle, i.e., prioritizing perplexing samples-samples with yet-to-be-learned features. Our analysis reveals that in our scenario, those yet-to-be-learned features are actually those weak &\&& rare features.

  • We elucidate a marked disparity in the generalization capabilities between passive learning and NAL algorithms. Our analysis suggest that, both NAL algorithms can learn weak &\&& rare features adequately via prioritizing perplexing samples, and thus achieve a small test error. Contrastingly, the strategy-free passive learning exhibits a large test error. The disparity can be intensified in some out-of-distribution cases. Our experimental study corroborates this finding.

  • We further uncover why and to what extent the two query criteria can alleviate labelling effort. The key lies in NAL’s ability to effectively query perplexing samples in the training distribution. But in contrast, we find that the strategy-free passive learning requires a significantly larger label complexity to adequately learn all types of features.

    \mdfsetup

    frametitle= Perplexing Samples , innertopmargin=-3pt, innerbottommargin=7pt, innerrightmargin=7pt, innerleftmargin=7pt, frametitleaboveskip=-frametitlealignment=, linewidth=1pt {mdframed} Samples in the sampling pool that possess yet-to-be-learned features. We prove that both Uncertainty Sampling and Diversity Sampling inherently strive to query them.

1.2 Related Work

Neural Active Learning. Neural Network-based Active Learning (NAL) is one of the core data selection automation techniques in the field of Data-centric approaches for AutoML and Computer Version. As summarized in recent surveys (Zhan et al., 2021, 2022; Takezoe et al., 2023), there are two main query criteria: uncertainty-based, which chooses samples that the neural models feel most uncertain about (Seung et al., 1992; Lewis and Catlett, 1994; Roth and Small, 2006; Joshi et al., 2009; Houlsby et al., 2011; Cai et al., 2013; Yang and Loog, 2016; Kampffmeyer et al., 2016; Gal et al., 2017; Wang et al., 2022b; Kye et al., 2023; Duan et al., 2024; Cho et al., 2024) and diversity(representative)-based that selects samples that diverse from labeled set in the feature space (Stark et al., 2015; Du et al., 2015; Wang et al., 2016; Sener and Savarese, 2018; Gissin and Shalev-Shwartz, 2019; Sinha et al., 2019; Shui et al., 2020). Also, many works combine the two query criteria into the sampling (querying) strategy through weighted-sum optimization (Yin et al., 2017) or two-stage optimization (Ash et al., 2020; Zhdanov, 2019; Shui et al., 2020). In addition, to develop reliable algorithms, several design methods with theoretical guarantees, including theories such as VC bound (Balcan et al., 2006; Zhu and Nowak, 2022), Logistic Bound (Gu et al., 2014), Rademacher Complexity (Gu, 2014; Shui et al., 2020), and Neural Tangent Kernel (Wang et al., 2021; Mohamadi et al., 2022; Kong et al., 2022; Wang et al., 2022a; Wen et al., 2023). However, despite the development of numerous effective and theory-justified algorithms, the existing studies have not yet offered a comprehensive explanation for the underlying mechanisms of the two query criteria widely applied in NAL. Largely different from prior work, our work pioneeringly explore the theoretical aspect of the two criteria, via studying the feature learning dynamic in NAL algorithms.

Feature Learning in Learning Theory. Recent years witness an extensive body of research in learning theory on structured data from the perspective of feature learning (Li and Liang, 2018; Karp et al., 2021; Allen-Zhu and Li, 2023; Chen et al., 2022, 2023a, 2023b, 2023c, 2023d; Zou et al., 2023b; Li et al., 2023; Kou et al., 2023a, c; Huang et al., 2023a, c; Chidambaram et al., 2023; Deng et al., 2023). The essence of this line-of-research is to explicitly study the learning progress of features and memorization degree of noise under different data and algorithm scenarios, which serves as an intermediate proxy to examine the convergence of training and 0-1 loss. Specifically, Cao et al. (2022a) demonstrate the occurrence of benign overfitting in Convolutional Neural Network over linearly separable data under distinct conditions. Subsequently, Kou et al. (2023b) conduct similar results with ReLU activation, Meng et al. (2023) further derive results over XOR data, Zou et al. (2023a) reveal the benefits of Mixup training over linearly separable data with common and rare features, and Lu et al. (2023) explore the phenomenon of benign oscillation over linearly separable data with common &\&& weak and rare &\&& strong features. Our work extends the line of research by investigating the rationale behind the two primary criteria in NAL family, over both linearly and non-linearly separable data scenarios that include common &\&& strong and rare &\&& weak features. Our study focuses on characterizing the feature learning dynamics in NAL algorithms and providing a mathematical explanation for the benefits and inner relationship of the two primary query criteria of NAL.

2 Problem Settings

Notations. For lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm we utilize p\|\cdot\|_{p}∥ ⋅ ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT to denote its computation. Considering two series ansubscript𝑎𝑛a_{n}italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and bnsubscript𝑏𝑛b_{n}italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we denote an=O(bn)subscript𝑎𝑛𝑂subscript𝑏𝑛a_{n}=O\left(b_{n}\right)italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if there exists positive constant C>0𝐶0C>0italic_C > 0 and N>0𝑁0N>0italic_N > 0 such that for all nN𝑛𝑁n\geq Nitalic_n ≥ italic_N, |an|C|bn|subscript𝑎𝑛𝐶subscript𝑏𝑛\left|a_{n}\right|\leq C\left|b_{n}\right|| italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT | ≤ italic_C | italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT |. Similarly, we denote an=Ω(bn)subscript𝑎𝑛Ωsubscript𝑏𝑛a_{n}=\Omega\left(b_{n}\right)italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if bn=O(an)subscript𝑏𝑛𝑂subscript𝑎𝑛b_{n}=O\left(a_{n}\right)italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) holds, an=Θ(bn)subscript𝑎𝑛Θsubscript𝑏𝑛a_{n}=\Theta\left(b_{n}\right)italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Θ ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if an=O(bn)subscript𝑎𝑛𝑂subscript𝑏𝑛a_{n}=O\left(b_{n}\right)italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and an=Ω(bn)subscript𝑎𝑛Ωsubscript𝑏𝑛a_{n}=\Omega\left(b_{n}\right)italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) both hold, cn=O(an,bn)subscript𝑐𝑛𝑂subscript𝑎𝑛subscript𝑏𝑛c_{n}=O(a_{n},b_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if cn=O(min{an,bn})subscript𝑐𝑛𝑂subscript𝑎𝑛subscript𝑏𝑛c_{n}=O(\min\{a_{n},b_{n}\})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_O ( roman_min { italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) holds and cn=Ω(an,bn)subscript𝑐𝑛Ωsubscript𝑎𝑛subscript𝑏𝑛c_{n}=\Omega(a_{n},b_{n})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) if cn=Ω(max{an,bn})subscript𝑐𝑛Ωsubscript𝑎𝑛subscript𝑏𝑛c_{n}=\Omega(\max\{a_{n},b_{n}\})italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = roman_Ω ( roman_max { italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } ) holds. To omit logarithmic terms, we apply the notations O~(),Ω~()~𝑂~Ω\widetilde{O}(\cdot),\widetilde{\Omega}(\cdot)over~ start_ARG italic_O end_ARG ( ⋅ ) , over~ start_ARG roman_Ω end_ARG ( ⋅ ), and Θ~()~Θ\widetilde{\Theta}(\cdot)over~ start_ARG roman_Θ end_ARG ( ⋅ ). Our 𝟙()1\mathbb{1}(\cdot)blackboard_1 ( ⋅ ) is to denote the indicator variable of an event. We say y=poly(a1,,ak)𝑦polysubscript𝑎1subscript𝑎𝑘y=\operatorname{poly}\left(a_{1},\ldots,a_{k}\right)italic_y = roman_poly ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) if y=O(max{a1,,ak}D)y=O\left(\max\left\{a_{1},\ldots,a_{k}\right\}^{D}\right)italic_y = italic_O ( roman_max { italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT ) for some D>0𝐷0D>0italic_D > 0, and b=polylog(a)𝑏polylog𝑎b=\operatorname{polylog}(a)italic_b = roman_polylog ( italic_a ) if b=𝑏absentb=italic_b = poly (log(a))𝑎(\log(a))( roman_log ( italic_a ) ).

2.1 Data Distribution

In this study, our focus is on the pool-based selective sampling scenario, where the algorithms initially train the model using an initial labeled set and subsequently query a single batch of unlabeled samples from a large sampling pool. Then the algorithms would retrain the model again with fresh initialization. We denote the size of the initial labeled set as n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the querying (sampling) size for all querying algorithms as nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (n=Ω(n0)>n0superscript𝑛Ωsubscript𝑛0subscript𝑛0n^{*}=\Omega(n_{0})>n_{0}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) > italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT), and the size of the labeled set after querying as n1=n0+nsubscript𝑛1subscript𝑛0superscript𝑛n_{1}=n_{0}+n^{*}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. We also define n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG as the maximum size of the labeled set after querying, such that n1n~subscript𝑛1~𝑛n_{1}\leq\widetilde{n}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ over~ start_ARG italic_n end_ARG. Moreover, we have the initial labeled set represented as 𝒟n0:={𝐱(i)}i=1n0\mathcal{D}_{n_{0}}\mathrel{\mathop{:}}=\{\mathbf{x}^{(i)}\}_{i=1}^{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and the sampling pool denoted as 𝒫𝒫\mathcal{P}caligraphic_P. Both of them are synthesized from the same data distribution 𝒟𝒟\mathcal{D}caligraphic_D, which is specified as follows.

Definition 2.1.

Let 𝝁1𝝁2dperpendicular-tosubscript𝝁1subscript𝝁2superscript𝑑\bm{\mu}_{1}\perp\bm{\mu}_{2}\in\mathbb{R}^{d}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be two fixed feature vectors. Each data point (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ), where 𝐱𝐱\mathbf{x}bold_x contains two patches as 𝐱𝐱\mathbf{x}bold_x=[𝐱1T,𝐱2T]Tsuperscriptsuperscriptsubscript𝐱1𝑇superscriptsubscript𝐱2𝑇𝑇[\mathbf{x}_{1}^{T},\mathbf{x}_{2}^{T}]^{T}[ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT \in 2dsuperscript2𝑑\mathbb{R}^{2d}blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT and y𝑦yitalic_y {1,1}absent11\in\{-1,1\}∈ { - 1 , 1 } are generated from the distribution 𝒟𝒟\mathcal{D}caligraphic_D:

  • The ground truth label y is synthesized from a Rademacher distribution.

  • Noise Patch. One patch of 𝐱𝐱\mathbf{x}bold_x is selected as a noise patch 𝝃𝝃\bm{\xi}bold_italic_ξ, synthesized from Gaussian distribution N(𝟎,σp2𝐈)𝑁0superscriptsubscript𝜎𝑝2𝐈N(\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ).

  • Feature Patch. For a feeble p𝑝pitalic_p satisfying p<0.5𝑝0.5p<0.5italic_p < 0.5, the remaining patch of 𝐱𝐱\mathbf{x}bold_x is selected as label-related feature patch, and with high probability (1-p𝑝pitalic_p) the feature patch is a strong feature y𝝁1𝑦subscript𝝁1y\cdot\bm{\mu}_{1}italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while only with probability p𝑝pitalic_p the feature patch is a weak feature y𝝁2𝑦subscript𝝁2y\cdot\bm{\mu}_{2}italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We assume the following about the feature norms: 111The choices of 𝝁lnormsubscript𝝁𝑙\|\bm{\mu}_{l}\|∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ aim to prevent learning of features completely disrupted by noise, while amplifying the distinguishability of the strong feature patch compared to the weak one. Our theory allows for a broader range of parameter settings (see Appendix D.3 for general cases), but for the sake of simplicity in presentation, we here chose a feasible one.: l{1,2},𝝁l22=Ω(σp2log(n0/δ),n~1dσp4)formulae-sequencefor-all𝑙12superscriptsubscriptnormsubscript𝝁𝑙22Ωsuperscriptsubscript𝜎𝑝2subscript𝑛0𝛿superscript~𝑛1𝑑superscriptsubscript𝜎𝑝4\forall l\in\{1,2\},\|\bm{\mu}_{l}\|_{2}^{2}=\Omega(\sigma_{p}^{2}\log(n_{0}/% \delta),\widetilde{n}^{-1}d\sigma_{p}^{4})∀ italic_l ∈ { 1 , 2 } , ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_δ ) , over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ), 𝝁124=Ω(σp4dn01)superscriptsubscriptnormsubscript𝝁124Ωsuperscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\|\bm{\mu}_{1}\|_{2}^{4}=\Omega(\sigma_{p}^{4}dn_{0}^{-1})∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and 𝝁224=O(σp4dn01)superscriptsubscriptnormsubscript𝝁224𝑂superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\|\bm{\mu}_{2}\|_{2}^{4}=O(\sigma_{p}^{4}dn_{0}^{-1})∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = italic_O ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

This feature-noise data model captures the structure of an image, as depicted in Figure 1, by incorporating task-oriented distinctive patterns (features) and background patterns (noise) with different frequencies and strengths. Same as the patches setting in Zou et al. (2023a); Allen-Zhu and Li (2023); Lu et al. (2023), the weak feature patches are orthogonal to the strong feature patches in our setting, which is reasonable since the rare features appear largely different to the common ones. Worth noting that this type of data setting is common in the widely-recognized feature learning line-of-research (Allen-Zhu and Li, 2023; Cao et al., 2022a; Kou et al., 2023b; Zou et al., 2023a; Meng et al., 2023). Allen-Zhu and Li (2023) justify this type of data settings as plausible theoretical setups by highlighting the common occurrence of multiple one-task-oriented features in the latent space of Resnet, as shown in their Figure 2-4, 9. Furthermore, recent empirical and theoretical studies indicate the orthogonal nature of different features within the latent space of ViT and LLM (Yamagiwa et al., 2023; Jiang et al., 2024). To extend our contributions to more practical scenarios, we also conduct rigorous study and draw similar theoretical findings over a non-linearly separable, non-orthogonal data distribution - the XOR data defined in Definition C.2 - and obtained similar results.

2.2 Querying Algorithms

Neural Setting. This work considers a two-layer ReLU CNN adopted in Kou et al. (2023b); Meng et al. (2023); Kou et al. (2023c); Chen et al. (2023d) as the base neural network for querying algorithms. The CNN function f(𝐖,𝐱)𝑓𝐖𝐱f(\mathbf{W},\mathbf{x})italic_f ( bold_W , bold_x ) is defined as j=±1jFj(𝐖,𝐱)subscript𝑗plus-or-minus1𝑗subscript𝐹𝑗𝐖𝐱\sum_{j=\pm 1}j\cdot F_{j}(\mathbf{W},\mathbf{x})∑ start_POSTSUBSCRIPT italic_j = ± 1 end_POSTSUBSCRIPT italic_j ⋅ italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_W , bold_x ), with Fj(𝐖,𝐱)subscript𝐹𝑗𝐖𝐱F_{j}(\mathbf{W},\mathbf{x})italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_W , bold_x ) as

Fj(𝐖,𝐱)=1mr=1m[σ(𝐰j,r,y𝝁)+σ(𝐰j,r,𝝃)].subscript𝐹𝑗𝐖𝐱1𝑚superscriptsubscript𝑟1𝑚delimited-[]𝜎subscript𝐰𝑗𝑟𝑦𝝁𝜎subscript𝐰𝑗𝑟𝝃F_{j}(\mathbf{W},\mathbf{x})=\frac{1}{m}\sum_{r=1}^{m}\left[\sigma\left(\left% \langle\mathbf{w}_{j,r},y\cdot\bm{\mu}\right\rangle\right)+\sigma\left(\left% \langle\mathbf{w}_{j,r},\bm{\xi}\right\rangle\right)\right].italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_W , bold_x ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y ⋅ bold_italic_μ ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ ⟩ ) ] .

where the second layer is fixed as ±1/mplus-or-minus1𝑚\pm 1/m± 1 / italic_m, m𝑚mitalic_m is the number of neurons, σ(z)=max{z,0}𝜎𝑧𝑧0\sigma(z)=\max\{z,0\}italic_σ ( italic_z ) = roman_max { italic_z , 0 } is ReLU function, 𝐰j,rdsubscript𝐰𝑗𝑟superscript𝑑\mathbf{w}_{j,r}\in\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT denotes the weights of the r𝑟ritalic_r-th neuron of Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, 𝐖jm×dsubscript𝐖𝑗superscript𝑚𝑑\mathbf{W}_{j}\in\mathbb{R}^{m\times d}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT collects the weights in Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and 𝐖𝐖\mathbf{W}bold_W collects all weights.

Training Setting. We utilize gradient descent to train the neural model. Denote n𝑛nitalic_n as the size of current labeled training set, denoted as 𝒟={(𝐱(i),yi)}i=1n𝒟superscriptsubscriptsuperscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\mathcal{D}=\left\{\left(\mathbf{x}^{(i)},y_{i}\right)\right\}_{i=1}^{n}caligraphic_D = { ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT generated from 𝒟𝒟\mathcal{D}caligraphic_D over 𝐱×y𝐱𝑦\mathbf{x}\times ybold_x × italic_y. We apply the empirical logist loss:

LS(𝐖)=1ni=1n[yif(𝐖,𝐱(i))],subscript𝐿𝑆𝐖1𝑛superscriptsubscript𝑖1𝑛delimited-[]subscript𝑦𝑖𝑓𝐖superscript𝐱𝑖L_{S}(\mathbf{W})=\frac{1}{n}\sum_{i=1}^{n}\ell\left[y_{i}\cdot f\left(\mathbf% {W},\mathbf{x}^{(i)}\right)\right],italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W ) = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_f ( bold_W , bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ] , (1)

where (z)=log(1+exp(z))𝑧1𝑧\ell(z)=\log(1+\exp(-z))roman_ℓ ( italic_z ) = roman_log ( 1 + roman_exp ( - italic_z ) ). The gradient update of the filters in the first layer can be written as follows:

𝐰j,r(t+1)superscriptsubscript𝐰𝑗𝑟𝑡1\displaystyle\mathbf{w}_{j,r}^{(t+1)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT =𝐰j,r(t)η𝐰j,rLS(𝐖(t))absentsuperscriptsubscript𝐰𝑗𝑟𝑡𝜂subscriptsubscript𝐰𝑗𝑟subscript𝐿𝑆superscript𝐖𝑡\displaystyle=\mathbf{w}_{j,r}^{(t)}-\eta\cdot\nabla_{\mathbf{w}_{j,r}}L_{S}% \left(\mathbf{W}^{(t)}\right)= bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - italic_η ⋅ ∇ start_POSTSUBSCRIPT bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) (2)
=𝐰j,r(t)ηnmi=1ni(t)σ(𝐰j,r(t),𝝃i)jyi𝝃iabsentsuperscriptsubscript𝐰𝑗𝑟𝑡𝜂𝑛𝑚superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑖𝑡superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝃𝑖𝑗subscript𝑦𝑖subscript𝝃𝑖\displaystyle=\mathbf{w}_{j,r}^{(t)}-\frac{\eta}{nm}\sum_{i=1}^{n}{\ell_{i}^{% \prime}}^{(t)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(t)},\bm% {\xi}_{i}\right\rangle\right)\cdot jy_{i}\bm{\xi}_{i}= bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ italic_j italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
ηnml=12iUli(t)σ(t)(𝐰j,r(t),yi𝝁l)j𝝁l,𝜂𝑛𝑚superscriptsubscript𝑙12subscript𝑖superscript𝑈𝑙superscriptsubscript𝑖𝑡superscript𝜎𝑡superscriptsubscript𝐰𝑗𝑟𝑡subscript𝑦𝑖subscript𝝁𝑙𝑗subscript𝝁𝑙\displaystyle-\frac{\eta}{nm}\sum_{l=1}^{2}\sum_{i\in U^{l}}\ell_{i}^{(t)}% \cdot\sigma^{\prime(t)}\left(\left\langle\mathbf{w}_{j,r}^{(t)},y_{i}\bm{\mu}_% {l}\right\rangle\right)\cdot j\bm{\mu}_{l},- divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ ( italic_t ) end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) ⋅ italic_j bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ,

where Ul={𝐱𝒟𝐱signal part =𝝁l}superscript𝑈𝑙conditional-set𝐱𝒟subscript𝐱signal part subscript𝝁𝑙U^{l}=\{\mathbf{x}\in\mathcal{D}\mid\mathbf{x}_{\text{signal part }}=\bm{\mu}_% {l}\}italic_U start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { bold_x ∈ caligraphic_D ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } denote as the set of indices of 𝒟𝒟\mathcal{D}caligraphic_D where the data’s feature patch is 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, i(t)superscriptsuperscriptsubscript𝑖𝑡{\ell_{i}^{\prime}}^{(t)}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT denotes [yif(𝐖(t),𝐱(i))]superscriptdelimited-[]subscript𝑦𝑖𝑓superscript𝐖𝑡superscript𝐱𝑖\ell^{\prime}\left[y_{i}\cdot f(\mathbf{W}^{(t)},\mathbf{x}^{(i)})\right]roman_ℓ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ) ]. The initial values of all elements in 𝐖(0)superscript𝐖0\mathbf{W}^{(0)}bold_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT are generated from independent and identically distributed (i.i.d.) Gaussian distributions with mean 0 and variance σ02superscriptsubscript𝜎02\sigma_{0}^{2}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The querying algorithms would have the neural models retrained after a single querying with the same model initialization.

Querying Setting. During the querying stage, all the querying algorithms select nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT new unlabeled samples from 𝒫𝒫\mathcal{P}caligraphic_P, where the pool size |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P | satisfies |𝒫|=Ω(p1σp4d𝝁224,p1log(1/δ))𝒫Ωsuperscript𝑝1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224superscript𝑝11𝛿\lvert\mathcal{P}\rvert=\Omega(p^{-1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4},% p^{-1}\log(1/\delta))| caligraphic_P | = roman_Ω ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_δ ) )222The choice on |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P | is to ensure the sufficient presence of weak features in 𝒫𝒫\mathcal{P}caligraphic_P.. The three querying algorithms differentiate from each other by their own sampling rules as below:

  • Random Sampling (strategy-free passive learning) randomly selects nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT new samples from 𝒫𝒫\mathcal{P}caligraphic_P.

  • Uncertainty Sampling (uncertainty-based NAL) selects top nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT new samples from 𝒫𝒫\mathcal{P}caligraphic_P based on the lowest Confidence Score at time step t𝑡titalic_t. The Confidence Score C(𝐖,𝐱)𝐶𝐖𝐱C\left(\mathbf{W},\mathbf{x}\right)italic_C ( bold_W , bold_x ) measures the model’s confidence in predicting the label of sample 𝐱𝐱\mathbf{x}bold_x, defined as below:

    C(𝐖,𝐱)=max{11+exp(yf(𝐖,𝐱)),111+exp(yf(𝐖,𝐱))},𝐶𝐖𝐱11𝑦𝑓𝐖𝐱111𝑦𝑓𝐖𝐱\begin{split}C\left(\mathbf{W},\mathbf{x}\right)&=\max\Big{\{}\frac{1}{1+\exp(% -y\cdot f(\mathbf{W},\mathbf{x}))},\\ &\phantom{=}1-\frac{1}{1+\exp(-y\cdot f(\mathbf{W},\mathbf{x}))}\Big{\}},\end{split}start_ROW start_CELL italic_C ( bold_W , bold_x ) end_CELL start_CELL = roman_max { divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_y ⋅ italic_f ( bold_W , bold_x ) ) end_ARG , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 - divide start_ARG 1 end_ARG start_ARG 1 + roman_exp ( - italic_y ⋅ italic_f ( bold_W , bold_x ) ) end_ARG } , end_CELL end_ROW

    which represents the probability of the predicted label y𝑦yitalic_y of logistic loss. In our scenario, the proposed Uncertainty Sampling is actually equivalent to many well-known uncertainty-based approaches such as Least Confidence (Lewis and Catlett, 1994), Margin Roth and Small (2006), and Entropy methods (Joshi et al., 2009), as discussed in Lemma F.5 in Appendix F.2.

  • Diversity Sampling (diversity-based NAL) selects the top nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT new samples from 𝒫𝒫\mathcal{P}caligraphic_P based on the highest Feature Distance at time step t𝑡titalic_t. The Feature Distance D(𝐖,𝐱𝒟n0)𝐷𝐖conditional𝐱subscript𝒟subscript𝑛0D\left(\mathbf{W},\mathbf{x}\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) measures the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT distance between sample 𝐱𝐱\mathbf{x}bold_x and 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT in feature space, specified as:

    D(𝐖,𝐱|𝒟n0)=𝐙(𝐱,t)𝔼𝐱(i)𝒟n0𝐙(𝐱(i),t)p,𝐷𝐖conditional𝐱subscript𝒟subscript𝑛0subscriptnorm𝐙𝐱𝑡superscript𝐱𝑖subscript𝒟subscript𝑛0𝔼𝐙superscript𝐱𝑖𝑡𝑝D(\mathbf{W},\mathbf{x}\ |\mathcal{D}_{n_{0}})=\|\mathbf{Z}(\mathbf{x},t)-% \displaystyle\underset{\mathbf{x}^{(i)}\in\mathcal{D}_{n_{0}}}{\mathbb{E}}% \mathbf{Z}(\mathbf{x}^{(i)},t)\|_{p},italic_D ( bold_W , bold_x | caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ∥ bold_Z ( bold_x , italic_t ) - start_UNDERACCENT bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG bold_Z ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ,

    where the 𝐙(𝐱,t)𝐙𝐱𝑡\mathbf{Z}(\mathbf{x},t)bold_Z ( bold_x , italic_t ) is defined as the sum of feature maps in the feature space of CNN:

    𝐙(𝐱,t)=j(σ(𝐖j(t),𝐱1))+σ(𝐖j(t),𝐱2)).\mathbf{Z}(\mathbf{x},t)=\sum_{j}(\sigma(\langle\mathbf{W}_{j}^{(t)},\mathbf{x% }_{1}\rangle))+\sigma(\langle\mathbf{W}_{j}^{(t)},\mathbf{x}_{2}\rangle)).bold_Z ( bold_x , italic_t ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_σ ( ⟨ bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) ) + italic_σ ( ⟨ bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) ) .

    Specifically, Lemma 4.2 reveals that in our scenario, the proposed Diversity Sampling is equivalent for all values of p𝑝pitalic_p within the range of [1,)1[1,\infty)[ 1 , ∞ ). This implies that our metric can be various distance measure, including Euclidean, Manhattan, or Minkowski distance.

The newly acquired samples are provided to an oracle to obtain their ground truth labels, which are then added to the training set. The whole procedure of the three querying algorithms are shown in Algorithm 1.

Testing Setting. The model performances at initial stage (before querying) and stage after querying are all measured by test error on a test distribution 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT:

L𝒟01(𝐖):=(𝐱,y)𝒟[yf(𝐖,𝐱)<0].L_{\mathcal{D}^{*}}^{0-1}(\mathbf{W})\mathrel{\mathop{:}}=\mathbb{P}_{(\mathbf% {x},y)\sim\mathcal{D}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0].italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W ) : = blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] . (3)

It is important to note that 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT shares the same definition as stated in Definition 1. However, it can have any occurrence probability of the weak feature, denoted as psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, without the limitation of p<0.5superscript𝑝0.5p^{*}<0.5italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT < 0.5 compared to the training distribution. Also, the test loss is defined as :

L𝒟(𝐖):=𝔼(𝐱,y)𝒟[yf(𝐖,𝐱)].L_{\mathcal{D}^{*}}(\mathbf{W})\mathrel{\mathop{:}}=\underset{{(\mathbf{x},y)% \sim\mathcal{D}^{*}}}{\mathbb{E}}\ell[y\cdot f(\mathbf{W},\mathbf{x})].italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_W ) : = start_UNDERACCENT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG roman_ℓ [ italic_y ⋅ italic_f ( bold_W , bold_x ) ] .
Algorithm 1 Querying Algorithms
0:  Neural Network f(;)𝑓f(\cdot;\cdot)italic_f ( ⋅ ; ⋅ ), initial labeled set 𝒟n0:={𝐱(i)}i=1n0𝒟\mathcal{D}_{n_{0}}\mathrel{\mathop{:}}=\{\mathbf{x}^{(i)}\}_{i=1}^{n_{0}}% \subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊆ caligraphic_D, sampling pool 𝒫𝒟𝒫𝒟\mathcal{P}\subseteq\mathcal{D}caligraphic_P ⊆ caligraphic_D, test distribution 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, sample size n=n~n0superscript𝑛~𝑛subscript𝑛0n^{*}=\widetilde{n}-n_{0}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, T𝑇Titalic_T
1:  Initialize Neural Network f(𝐖(0);)𝑓superscript𝐖0f(\mathbf{W}^{(0)};\cdot)italic_f ( bold_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ; ⋅ )
2:  for t1𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
3:     Train Neural Network over 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by LS(𝐖)subscript𝐿𝑆𝐖L_{S}(\mathbf{W})italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W )
4:  end for
5:  Querying: Sample nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT new samples from 𝒫𝒫\mathcal{P}caligraphic_P based on particular rules. New samples 𝒟nsubscript𝒟superscript𝑛\mathcal{D}_{n^{*}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are labeled by oracle and included to the new labeled set 𝒟n1:=𝒟n0𝒟n\mathcal{D}_{n_{1}}\mathrel{\mathop{:}}=\mathcal{D}_{n_{0}}\cup\mathcal{D}_{n^% {*}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : = caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∪ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT
6:  Initialize Neural Network f(𝐖(0);)𝑓superscript𝐖0f(\mathbf{W}^{(0)};\cdot)italic_f ( bold_W start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ; ⋅ )
7:  for t1𝑡1t\leftarrow 1italic_t ← 1 to T𝑇Titalic_T do
8:     Train Neural Network over 𝒟n1subscript𝒟subscript𝑛1\mathcal{D}_{n_{1}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT by LS(𝐖)subscript𝐿𝑆𝐖L_{S}(\mathbf{W})italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W )
9:  end for
10:  Test performance of Neural Network f(𝐖(T);)𝑓superscript𝐖𝑇f(\mathbf{W}^{(T)};\cdot)italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT ; ⋅ ) over 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and obtain L𝒟01(𝐖(T))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑇L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(T)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT )
11:  return L𝒟01(𝐖(T))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑇L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(T)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_T ) end_POSTSUPERSCRIPT )

3 Theoretical Results

For both the initialization stage and the second stage, we consider the learning period 0tT0𝑡superscript𝑇0\leq t\leq T^{*}0 ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, where T=η1superscript𝑇superscript𝜂1T^{*}=\eta^{-1}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT poly (ε1,d,n0,m)Ω~(η1ε1mn0d1σp2)superscript𝜀1𝑑subscript𝑛0𝑚~Ωsuperscript𝜂1superscript𝜀1𝑚subscript𝑛0superscript𝑑1superscriptsubscript𝜎𝑝2\left(\varepsilon^{-1},d,n_{0},m\right)\geq\widetilde{\Omega}\left(\eta^{-1}% \varepsilon^{-1}mn_{0}d^{-1}\sigma_{p}^{-2}\right)( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_d , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_m ) ≥ over~ start_ARG roman_Ω end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) is the maximum admissible iterations for the initial stage. The following provides our main theories over linearly separable data. For non-linear XOR data, please refer to our similar theoretical results in Appendix C.

We first adopt signal-noise decomposition techniques in Cao et al. (2022a) over 𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. By the update rule in (2), we can derive that there exist unique coefficients γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and ρj,r,i(t)superscriptsubscript𝜌𝑗𝑟𝑖𝑡\rho_{j,r,i}^{(t)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT such that

𝐰j,r(t)=𝐰j,r(0)+jl=12γj,r,l(t)𝝁l𝝁l22+i=1nρj,r,i(t)𝝃i𝝃i22superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\sum_{l=1}^{2}\gamma_{j,r,% l}^{(t)}\cdot\dfrac{\bm{\mu}_{l}}{\|\bm{\mu}_{l}\|_{2}^{2}}+\sum_{i=1}^{n}\rho% _{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (4)

The normalization factors 𝝁l22superscriptsubscriptnormsubscript𝝁𝑙22\|\bm{\mu}_{l}\|_{2}^{-2}∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT and 𝝃i22superscriptsubscriptnormsubscript𝝃𝑖22\|\bm{\xi}_{i}\|_{2}^{-2}∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT leads to 𝐰j,r(t),𝝁lγj,r,l(t),𝐰j,r(t),𝝃iρj,r,i(t)formulae-sequencesuperscriptsubscript𝐰𝑗𝑟𝑡subscript𝝁𝑙superscriptsubscript𝛾𝑗𝑟𝑙𝑡superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝃𝑖superscriptsubscript𝜌𝑗𝑟𝑖𝑡\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}_{l}\rangle\approx\gamma_{j,r,l}^{(t)},% \langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\rangle\approx\rho_{j,r,i}^{(t)}⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ≈ italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≈ italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. Importantly, γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT characterizes the learning progress of feature 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, and ρj,r,i(t)superscriptsubscript𝜌𝑗𝑟𝑖𝑡\rho_{j,r,i}^{(t)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT characterizes the degree of noise memorization. Geometrically, the γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT indicates how well the model filters integrate the low-dimensional patterns of the task-oriented features in its latent projection space, and ρj,r,isubscript𝜌𝑗𝑟𝑖\rho_{j,r,i}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT quantifies the extent to which model filters memorize the high-dimensional complex noise. Then, by conducting a scale analysis of the two coefficients, we can then assess the cases where models mainly focus on capturing underlying patterns while avoiding excessive fitting of noise, which we refer to as benign overfitting. Additionally, this analysis helps us identify situations of harmful overfitting, where the models become overly complex, primarily memorizing noise and leading to poor generalization on new, unseen data.

Our findings then reveal that in our case, both the two heuristic NAL methods inherently amenable to query those data with yet-to-be-learned features (i.e., features that model exhibits low γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT). Consequently, the NNs are enabled to sufficiently learn all types of features, and then exhibit benign overfitting even in the case where the label complexity is quite limited.

To present our findings, we make the following assumptions.

Condition 3.1.

Suppose that:

  1. 1.

    The initial training size n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the maximum admissible size after querying n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG, and the width of neural network m𝑚mitalic_m satisfy n0=Ω(log(m/δ),p1log(1/δ))subscript𝑛0Ω𝑚𝛿superscript𝑝11𝛿n_{0}=\Omega(\log(m/\delta),p^{-1}\log(1/\delta))italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Ω ( roman_log ( italic_m / italic_δ ) , italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_δ ) ), n~=O(p1σp4d𝝁224)~𝑛𝑂superscript𝑝1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224\widetilde{n}=O(p^{-1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4})over~ start_ARG italic_n end_ARG = italic_O ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ), m=Ω(log(n0/δ))𝑚Ωsubscript𝑛0𝛿m=\Omega(\log(n_{0}/\delta))italic_m = roman_Ω ( roman_log ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_δ ) ).

  2. 2.

    Dimension d𝑑ditalic_d is sufficiently large: l{1,2}for-all𝑙12\forall l\in\{1,2\}∀ italic_l ∈ { 1 , 2 }, d=Ω(n~σp2𝝁l22log(T),n~2log(n~m/δ)(log(T))2)𝑑Ω~𝑛superscriptsubscript𝜎𝑝2superscriptsubscriptnormsubscript𝝁𝑙22superscript𝑇superscript~𝑛2~𝑛𝑚𝛿superscriptsuperscript𝑇2d=\Omega(\widetilde{n}\sigma_{p}^{-2}\|\bm{\mu}_{l}\|_{2}^{2}\log\left(T^{*}% \right),\widetilde{n}^{2}\log(\widetilde{n}m/\delta)(\log(T^{*}))^{2})italic_d = roman_Ω ( over~ start_ARG italic_n end_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( over~ start_ARG italic_n end_ARG italic_m / italic_δ ) ( roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

  3. 3.

    The standard deviation of Gaussian initialization σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is appropriately chosen such that l{1,2}for-all𝑙12\forall l\in\{1,2\}∀ italic_l ∈ { 1 , 2 }, σ0=O(𝝁l21(log(m/δ)1/2),σp1d1n~1/2)\sigma_{0}=O(\|\bm{\mu}_{l}\|_{2}^{-1}(\log(m/\delta)^{-1/2}),\sigma_{p}^{-1}d% ^{-1}\widetilde{n}^{1/2})italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_O ( ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_log ( italic_m / italic_δ ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ). The learning rate of all algorithms η𝜂\etaitalic_η satisfies that η=O(σp2d1n~,σp2d3/2n~2m(log(n~/δ))1/2)𝜂𝑂superscriptsubscript𝜎𝑝2superscript𝑑1~𝑛superscriptsubscript𝜎𝑝2superscript𝑑32superscript~𝑛2𝑚superscript~𝑛𝛿12\eta=O(\sigma_{p}^{-2}d^{-1}\widetilde{n},\sigma_{p}^{-2}d^{-3/2}\widetilde{n}% ^{2}m(\log(\widetilde{n}/\delta))^{1/2})italic_η = italic_O ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over~ start_ARG italic_n end_ARG , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 3 / 2 end_POSTSUPERSCRIPT over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_m ( roman_log ( over~ start_ARG italic_n end_ARG / italic_δ ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ).

The condition on n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is to guarantee there exists enough strong features in the initial training set with probability at least 1O(en0p)1𝑂superscript𝑒subscript𝑛0𝑝1-O(e^{-n_{0}p})1 - italic_O ( italic_e start_POSTSUPERSCRIPT - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_p end_POSTSUPERSCRIPT ), while the condition on n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG prevents the final training size from being too large, even for passive learning to perform well with considerable chance. The requirement on d𝑑ditalic_d ensures the problem is in a sufficiently overparameterized setting, as in prior works (Chatterji and Long, 2021; Cao et al., 2022a; Frei et al., 2022; Kou et al., 2023b; Lu et al., 2023; Chidambaram et al., 2023). The conditions on σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η𝜂\etaitalic_η guarantee that gradient descent can effectively minimize the empirical loss. A detailed discussions over parameter settings are provided in Appendix B.

The following results illustrate the presence of benign overfitting (i.e., small training loss and small test error) as well as harmful overfitting (i.e., small training loss but large test error) in the three querying algorithms.

Proposition 3.2.

(Before Querying) At the initial stage before querying, ε>0for-allε0\forall\varepsilon>0∀ italic_ε > 0, under Condition 3.1, with probability at least 1δ1δ1-\delta1 - italic_δ, there exists t=O~(η1ε1mn0d1σp2)t~Osuperscriptη1superscriptε1msubscriptn0superscriptd1superscriptsubscriptσp2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mn_{0}d^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), the followings hold for all of the three querying algorithms:

  1. 1.

    The training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  2. 2.

    The test error remains at constant level, i.e., L𝒟01(𝐖(t))=Θ(1)0.12psuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ10.12superscript𝑝L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)=\Theta(1)\geq 0.12\cdot p% ^{*}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_Θ ( 1 ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proposition 3.2 outlines the scenarios of harmful overfitting for all algorithms at the initial stage, which is not a surprise since the initial size n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is limited and always insufficient for adequate learning. Subsequently, the following lemma uncovers a crucial finding regarding the querying stage.

Proposition 3.3.

(Querying Stage) During Querying, under the same conditions as Proposition 3.2, if333We can relax the requirement for the discrepancy of feature norms, as discussed in Appendix D.3. The specific choice made in our presentation was for the sake of simplicity and clarity. 𝛍122𝛍222=Ω(σp2(dn01log(m/δ))1/2)superscriptsubscriptnormsubscript𝛍122superscriptsubscriptnormsubscript𝛍222Ωsuperscriptsubscriptσp2superscriptdsuperscriptsubscriptn01msuperscriptδ12\|\bm{\mu}_{1}\|_{2}^{2}-\|\bm{\mu}_{2}\|_{2}^{2}=\Omega({\sigma_{p}}^{2}(dn_{% 0}^{-1}\log(m/\delta^{\prime}))^{1/2})∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( italic_m / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), with probability at least 1Θ(δ+δ)1Θδsuperscriptδ1-\Theta(\delta+\delta^{\prime})1 - roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), both Uncertainty Sampling and Diversity Sampling pick nsuperscriptnn^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT samples that exhibit lowest 𝔼j,rγj,r,l(t)jr𝔼superscriptsubscriptγjrlt\displaystyle\underset{j,r}{\mathbb{E}}\gamma_{j,r,l}^{(t)}start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

Proposition 3 provides a unifying insight that both NAL algorithms prioritize perplexing samples-samples that exhibit a lack of learning progress (measured by 𝔼j,rγj,r,l(t)𝑗𝑟𝔼superscriptsubscript𝛾𝑗𝑟𝑙𝑡\displaystyle\underset{j,r}{\mathbb{E}}\gamma_{j,r,l}^{(t)}start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT). Lemma 4.1 indicates that these perplexing samples here are essentially samples that contain weak &\&& rare features. We discuss the nature of these perplexing samples in general cases in Appendix D.3. Our inference process for the following theorem reveals that the ability to prioritize these samples is the main contributor to the success of both NAL algorithms.

Theorem 3.4.

(After Querying) If the sampling size nsuperscriptnn^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the three querying algorithms satisfies C1σp4d𝛍224pn0/2n=Θ(n~n0)n~n0subscriptC1superscriptsubscriptσp4dsuperscriptsubscriptnormsubscript𝛍224psubscriptn02superscriptnΘ~nsubscriptn0~nsubscriptn0C_{1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4}-pn_{0}/2\leq n^{*}=\Theta(% \widetilde{n}-n_{0})\leq\widetilde{n}-n_{0}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT - italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / 2 ≤ italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Θ ( over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where C1subscriptC1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is some positive constant. Then for ε>0for-allε0\forall\varepsilon>0∀ italic_ε > 0, under the same conditions as Proposition 3, with probability more than 1 - Θ(δ+δ)Θδsuperscriptδ\Theta(\delta+\delta^{\prime})roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), t=O~(η1ε1m(n0+n)d1σp2)t~Osuperscriptη1superscriptε1msubscriptn0superscriptnsuperscriptd1superscriptsubscriptσp2\exists t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}m(n_{0}+n^{*})d^{-1}% \sigma_{p}^{-2}\right)∃ italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) such that:

  • For all of the three querying algorithms, the training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • Uncertainty Sampling and Diversity Sampling algorithms have small test error: L𝒟01(𝐖(t))exp(Θ(n~𝝁l24σp4d)),l{1,2}formulae-sequencesuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ~𝑛superscriptsubscriptnormsubscript𝝁𝑙24superscriptsubscript𝜎𝑝4𝑑𝑙12L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq\exp(\Theta\left(% \dfrac{-\widetilde{n}\|\bm{\mu}_{l}\|_{2}^{4}}{\sigma_{p}^{4}d}\right)),l\in\{% 1,2\}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ roman_exp ( roman_Θ ( divide start_ARG - over~ start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) ) , italic_l ∈ { 1 , 2 }.

  • Random Sampling algorithm would remain constant order test error: L𝒟01(𝐖(t))=Θ(1)0.12psuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ10.12superscript𝑝L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)=\Theta(1)\geq 0.12\cdot p% ^{*}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_Θ ( 1 ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Theorem 3.4 implies that NAL algorithms achieve benign overfitting, whereas the passive learning remains harmful overfitting. It worth noting that as psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT increases, the test error of Random Sampling tends to explode, especially in out-of-distribution scenarios where p>0.5>psuperscript𝑝0.5𝑝p^{*}>0.5>pitalic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT > 0.5 > italic_p. In contrast, Uncertainty Sampling and Diversity Sampling consistently achieve low test errors regardless of the value of psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which highlights the superiority of Uncertainty Sampling and Diversity Sampling over Random Sampling.

Given that strategy-free passive learning can also adequately learn all types of features with ample data, the following corollary aim to show the extent to which NAL algorithms alleviate the burden of labeling.

Corollary 3.5.

(Label Complexity) Under the same conditions as stated in Theorem 3.4, with a probability of at least 1Θ(δ+δ)1Θδsuperscriptδ1-\Theta(\delta+\delta^{\prime})1 - roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we observe distinct label complexities for strategy-free passive learning and NAL algorithms in achieving Bayes-optimal generalization ability:

  • For a fully trained neural model, the label complexity nCNNsubscript𝑛𝐶𝑁𝑁n_{CNN}italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT requires Ω(p1σp2d𝝁224)Ωsuperscript𝑝1superscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(p^{-1}\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ).

  • For two NAL algorithms, the maximum label complexity n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG only requires Ω(σp2d𝝁224)Ωsuperscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ).

This corollary suggests that NAL algorithms can significantly reduce labeling effort, approximately on the order of Θ(p1)Θsuperscript𝑝1\Theta(p^{-1})roman_Θ ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). This holds true even without the requirement of disparity between feature norms, as demonstrated in Appendix D.3. Hence, we can conclude that NAL algorithms are effective in minimizing labeling effort, particularly in imbalanced data scenarios where the degree of discrimination or rarity varies within the data. In collaboration with Proposition 3 and Theorem 3.4, the essence lies in NAL’s capability to effectively grasp yet-to-be-learned features.

4 Proof Sketch

In this section, we provide an overview of the proof outlines for our theory over linearly separable data. Here we denote n𝑛nitalic_n as the number of training data in current labeled set, which is n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at initial stage and n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after sampling (querying). For s{1,2},l{1,2}formulae-sequence𝑠12𝑙12s\in\{1,2\},l\in\{1,2\}italic_s ∈ { 1 , 2 } , italic_l ∈ { 1 , 2 }, the notations of ns,lsubscript𝑛𝑠𝑙n_{s,l}italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT represent the number of feature 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT at the initial stage s=1𝑠1s=1italic_s = 1 and stage after querying s=2𝑠2s=2italic_s = 2. And for notation simplicity we denote τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as the proportion of data with strong and weak feature in current dataset.

Here are the main challenges we faced and the techniques we used to address them:

  • The synthesis of 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, 𝒫𝒫\mathcal{P}caligraphic_P, and the final labeled set obtained through Random Sampling require sequential martingale-type subset generations from distribution 𝒟𝒟\mathcal{D}caligraphic_D, which poses a big challenge to our analysis. Our solution was to treat the results as independent binomial random variables, which allow us to conduct a reliable analysis with high-probability results by leveraging the properties of binomial tails.

  • During querying, NAL algorithms need to query the samples with the lowest Confidence Score or the highest Feature Distance from the entire sampling pool 𝒫𝒫\mathcal{P}caligraphic_P. This involves |𝒫|(|𝒫|1)/2𝒫𝒫12\lvert\mathcal{P}\rvert(\lvert\mathcal{P}\rvert-1)/2| caligraphic_P | ( | caligraphic_P | - 1 ) / 2 comparison operations. To better scrutinize the sampling dynamics, we defined two full orders and conducted an order-dependent querying analysis to examine the high probability events via combinatorial analysis.

  • Depicting the generalization capability of three different querying algorithms along the whole process was a big challenge. We addressed this by proposing a label complexity-based test error analysis regime, which allowed us to incorporate different scenarios into a single inferential process.

4.1 Feature Learning and Noise Memorization Analysis

Leverage the inductive techniques adopted in many works (Cao et al., 2022a; Kou et al., 2023b; Meng et al., 2023; Kou et al., 2023c; Chen et al., 2023d), we can in our case study the coefficient scales.

Lemma 4.1.

Under Condition 3.1, there exists T1=Θ(η1nmσp2d1)subscript𝑇1Θsuperscript𝜂1𝑛𝑚superscriptsubscript𝜎𝑝2superscript𝑑1T_{1}=\Theta(\eta^{-1}nm\sigma_{p}^{2}d^{-1})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Θ ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n italic_m italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), for t[T1,T]𝑡subscript𝑇1superscript𝑇t\in\left[T_{1},T^{*}\right]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] we have the following hold for j{±1},r[m]formulae-sequencefor-all𝑗plus-or-minus1𝑟delimited-[]𝑚\forall j\in\{\pm 1\},r\in[m]∀ italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }:

  • i=1nρj,r,i(t)𝟙(ρj,r,i(t)>0)=Ω(n)superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡1superscriptsubscript𝜌𝑗𝑟𝑖𝑡0Ω𝑛\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\cdot\mathbb{1}(\rho_{j,r,i}^{(t)}>0)=\Omega(n)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT > 0 ) = roman_Ω ( italic_n ),

  • γj,r,l(t)=Θ(τlnσp2d1𝝁l22)superscriptsubscript𝛾𝑗𝑟𝑙𝑡Θsubscript𝜏𝑙𝑛superscriptsubscript𝜎𝑝2superscript𝑑1superscriptsubscriptnormsubscript𝝁𝑙22\gamma_{j,r,l}^{(t)}=\Theta\left(\tau_{l}n\cdot\sigma_{p}^{-2}d^{-1}\|\bm{\mu}% _{l}\|_{2}^{2}\right)italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

It is evident that there is a noticeable disparity in the learning efficiency of features, as ρj,r,i(t)superscriptsubscript𝜌𝑗𝑟𝑖𝑡\rho_{j,r,i}^{(t)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is directly proportional to both the data proportion τlsubscript𝜏𝑙\tau_{l}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and the feature norms 𝝁l2subscriptnormsubscript𝝁𝑙2\|\bm{\mu}_{l}\|_{2}∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Furthermore, according to Lemma 17, we can model the data synthesis from 𝒟𝒟\mathcal{D}caligraphic_D as a binomial variable. This allows effective control over the probability tails, resulting in τ2=Θ(p)subscript𝜏2Θ𝑝\tau_{2}=\Theta(p)italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Θ ( italic_p ) and τ1=Θ(1p)subscript𝜏1Θ1𝑝\tau_{1}=\Theta(1-p)italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Θ ( 1 - italic_p ). Thus, we can conclude that the perplexing samples are actually those 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-equipped samples. Subsequently, we can now examine the querying stage closely.

4.2 Order-dependent Sampling (Querying) Analysis

To rigorously analyze the statistics of the querying stage, we define two orders, namely Uncertainty Order C(t)superscriptsubscriptprecedes-or-equals𝐶𝑡\preceq_{C}^{(t)}⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and Diversity Order D(t)superscriptsubscriptprecedes-or-equals𝐷𝑡\preceq_{D}^{(t)}⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. For 𝐱,𝐱𝒫for-all𝐱superscript𝐱𝒫\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{P}∀ bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P, we have 𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq_{C}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x if C(𝐖(t),𝐱)C(𝐖(t),𝐱)𝐶superscript𝐖𝑡superscript𝐱𝐶superscript𝐖𝑡𝐱C\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\geq C\left(\mathbf{W}^{(t)}% ,\mathbf{x}\right)italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ≥ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ), and 𝐱D(t)𝐱 if D(𝐖(t),𝐱𝒟n)D(𝐖(t),𝐱𝒟n),p[1,)formulae-sequencesuperscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱𝐱 if 𝐷superscript𝐖𝑡conditionalsuperscript𝐱subscript𝒟𝑛𝐷superscript𝐖𝑡conditional𝐱subscript𝒟𝑛for-all𝑝1\mathbf{x}^{\prime}\preceq_{D}^{(t)}\mathbf{x}\text{ if \ }D\left(\mathbf{W}^% {(t)},\mathbf{x}^{\prime}\ \mid\mathcal{D}_{n}\right)\leq D\left(\mathbf{W}^{(% t)},\mathbf{x}\ \mid\mathcal{D}_{n}\right),\forall p\in[1,\infty)bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x if italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , ∀ italic_p ∈ [ 1 , ∞ ). Specifically, if the Confidence Score of all elements in a set 𝐗𝐗\mathbf{X}bold_X at time step t𝑡titalic_t are all less than those in the set 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we utilize the same notation to describe the Uncertainty Order between sets: 𝐗C(t)𝐗superscriptsubscriptprecedes-or-equals𝐶𝑡𝐗superscript𝐗\mathbf{X}\preceq_{C}^{(t)}\mathbf{X}^{\prime}bold_X ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Similarly, we also have set-level notation for D(t)superscriptsubscriptprecedes-or-equals𝐷𝑡\preceq_{D}^{(t)}⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT. The detailed definitions are delayed to Appendix F.

The following lemma presents our important findings when examining the two orders of samples.

Lemma 4.2.

Under the same conditions in Proposition 3, for 𝐱,𝐱𝒫𝐱superscript𝐱𝒫\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{P}bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P, denote 𝛍l𝐱,𝛍l𝐱subscript𝛍subscript𝑙𝐱subscript𝛍subscript𝑙superscript𝐱\bm{\mu}_{l_{\mathbf{x}}},\bm{\mu}_{l_{\mathbf{x}^{\prime}}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT as the feature patches in 𝐱𝐱\mathbf{x}bold_x and 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT separately, where l𝐱,l𝐱{1,2}subscript𝑙𝐱subscript𝑙superscript𝐱12l_{\mathbf{x}},l_{\mathbf{x^{\prime}}}\in\{1,2\}italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∈ { 1 , 2 }, it holds that

  • 𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq_{C}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x has a sufficient event that

    {Θ(𝔼𝑟(γy,r,l𝐱))Θ(𝔼𝑟(γy,r,l𝐱))Learning Progress Disparity: Feature in 𝐱 vs. Feature in 𝐱\displaystyle\{\underbrace{\Theta(\underset{r}{\mathbb{E}}(\gamma_{y^{\prime},% r,l_{\mathbf{x}^{\prime}}}))-\Theta(\underset{r}{\mathbb{E}}(\gamma_{y,r,l_{% \mathbf{x}}}))}_{\text{Learning Progress Disparity: Feature in $\mathbf{x}$ vs% . Feature in $\mathbf{x}^{\prime}$}}{ under⏟ start_ARG roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Learning Progress Disparity: Feature in bold_x vs. Feature in bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (5)
    >maxj,r,l{|𝐰j,r(t),𝐳l|}}.\displaystyle>\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z% }_{l}\right\rangle\right|\}\}.> roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } .
  • 𝐱D(t)𝐱superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq_{D}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x has a sufficient event that

    {|Θ(𝔼𝑟(γy,r,l𝐱))lτlΘ(𝔼ilU0l,r(γyil,r,l))|Learning Progress Disparity: Feature in 𝐱 vs. Features in Initial Set\displaystyle\{\underbrace{\lvert\Theta(\underset{r}{\mathbb{E}}(\gamma_{y,r,l% _{\mathbf{x}}}))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l},r}{% \mathbb{E}}(\gamma_{y_{i_{l}},r,l}))\rvert}_{\text{Learning Progress Disparity% : Feature in $\mathbf{x}$ vs. Features in Initial Set}}{ under⏟ start_ARG | roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) | end_ARG start_POSTSUBSCRIPT Learning Progress Disparity: Feature in bold_x vs. Features in Initial Set end_POSTSUBSCRIPT (6)
    |Θ(𝔼𝑟(γy,r,l𝐱))lτlΘ(𝔼ilU0l,r(γyil,r,l))Learning Progress Disparity: Feature in 𝐱 vs. Features in Initial Set|subscriptΘ𝑟𝔼subscript𝛾superscript𝑦𝑟subscript𝑙superscript𝐱subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝑟𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙Learning Progress Disparity: Feature in 𝐱 vs. Features in Initial Set\displaystyle-\lvert\underbrace{\Theta(\underset{r}{\mathbb{E}}(\gamma_{y^{% \prime},r,l_{\mathbf{x}^{\prime}}}))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{% l}\in U_{0}^{l},r}{\mathbb{E}}(\gamma_{y_{i_{l}},r,l}))}_{\text{Learning % Progress Disparity: Feature in $\mathbf{x}^{\prime}$ vs. Features in Initial % Set}}\rvert- | under⏟ start_ARG roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Learning Progress Disparity: Feature in bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT vs. Features in Initial Set end_POSTSUBSCRIPT |
    >maxj,r,l{|𝐰j,r(t),𝐳l|}},\displaystyle>\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z% }_{l}\right\rangle\right|\}\},> roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } ,

    where U0l={𝐱𝒟0𝐱signal part =𝝁l}superscriptsubscript𝑈0𝑙conditional-set𝐱subscript𝒟0subscript𝐱signal part subscript𝝁𝑙U_{0}^{l}=\{\mathbf{x}\in\mathcal{D}_{0}\mid\mathbf{x}_{\text{signal part }}=% \bm{\mu}_{l}\}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { bold_x ∈ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT }.

Remark 4.3.

This lemma demonstrate that Uncertainty Sampling holds the comparisons of the model’s learning progress of features in 𝒫𝒫\mathcal{P}caligraphic_P, as shown in (5). On the other hand, Diversity Sampling cares the comparisons of the disparity between model’s learning progress of samples and the labeled training set, as shown in (6).

Refer to caption
(a) Full trained model
Refer to caption
(b) Random Sampling
Refer to caption
(c) Uncertainty Sampling
Refer to caption
(d) Diversity Sampling
Figure 2: Learning/memorization progress of features and noise (γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents maxj,kγj,k,l(t)subscript𝑗𝑘superscriptsubscript𝛾𝑗𝑘𝑙𝑡\max_{j,k}\gamma_{j,k,l}^{(t)}roman_max start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, and ρ𝜌\rhoitalic_ρ represents maxj,k,iρj,k,i(t)subscript𝑗𝑘𝑖superscriptsubscript𝜌𝑗𝑘𝑖𝑡\max_{j,k,i}\rho_{j,k,i}^{(t)}roman_max start_POSTSUBSCRIPT italic_j , italic_k , italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with T=200superscript𝑇200T^{*}=200italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=9normsubscript𝝁19\|\bm{\mu}_{1}\|=9∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 9, p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2, 𝝁2=3normsubscript𝝁23\|\bm{\mu}_{2}\|=3∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 3, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190.

We note that (6) is irrelevant to the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT distance measure metric (i.e., p[1,)for-all𝑝1\forall p\in[1,\infty)∀ italic_p ∈ [ 1 , ∞ )). This is because we can eliminate the scaling term m1psuperscript𝑚1𝑝m^{\frac{1}{p}}italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT at two sides of the inequality when examining the probability lower bound (see more details in Appendix G.4). Based on Lemma 4.1, the event (5) and event (6) could be all simplified to the following shared sufficient event

{Θ(𝔼j,r(γj,r,l𝐱))Θ(𝔼j,r(γj,r,l𝐱))>maxj,r,l{|𝐰j,r(t),𝐳l|}}.Θ𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝑙superscript𝐱Θ𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝑙𝐱subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙\{\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,l_{\mathbf{x}^{\prime}}}))-% \Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,l_{\mathbf{x}}}))>\max_{j,r,l}\{% \left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}\}.{ roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } .

This implies that both the event {𝐱C(t)𝐱}superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱𝐱\{\mathbf{x}^{\prime}\preceq_{C}^{(t)}\mathbf{x}\}{ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x } and the event {𝐱D(t)𝐱}superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱𝐱\{\mathbf{x}^{\prime}\preceq_{D}^{(t)}\mathbf{x}\}{ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x } have a common occurrence where the model’s learning of 𝝁l𝐱subscript𝝁subscript𝑙𝐱\bm{\mu}_{l_{\mathbf{x}}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT bold_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT is considerably worse compared to its learning of 𝝁l𝐱subscript𝝁subscript𝑙superscript𝐱\bm{\mu}_{l_{\mathbf{x}^{\prime}}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT. Based on this observation and Lemma 4.1, we can deduce the following lemma with some effort.

Lemma 4.4.

Under the same conditions as Proposition 3, denoting 𝐗𝒫1𝒫superscriptsubscript𝐗𝒫1𝒫\mathbf{X}_{\mathcal{P}}^{1}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of all the data points with strong feature 𝛍1subscript𝛍1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in 𝒫𝒫\mathcal{P}caligraphic_P, and 𝐗𝒫2𝒫superscriptsubscript𝐗𝒫2𝒫\mathbf{X}_{\mathcal{P}}^{2}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of data points with weak feature 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have the conclusion that with probability more than 1-Θ(δ)Θsuperscript𝛿\Theta(\delta^{\prime})roman_Θ ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), 𝐗𝒫1C(t)𝐗𝒫2superscriptsubscriptprecedes-or-equals𝐶𝑡superscriptsubscript𝐗𝒫1superscriptsubscript𝐗𝒫2\mathbf{X}_{\mathcal{P}}^{1}\preceq_{C}^{(t)}\mathbf{X}_{\mathcal{P}}^{2}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and 𝐗𝒫1D(t)𝐗𝒫2superscriptsubscriptprecedes-or-equals𝐷𝑡superscriptsubscript𝐗𝒫1superscriptsubscript𝐗𝒫2\mathbf{X}_{\mathcal{P}}^{1}\preceq_{D}^{(t)}\mathbf{X}_{\mathcal{P}}^{2}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (p[1,)for-all𝑝1\forall p\in\left[1,\infty\right)∀ italic_p ∈ [ 1 , ∞ )) both hold.

This lemma directly implies the result in Proposition 3.

4.3 Label Complexity-based Test Error Analysis

To assess the generalization ability of all the three querying algorithms before and after querying, we establish a comprehensive analysis regime that examines the impact of label complexity for each type of feature on the test error, via a single inferential process. Specifically, We introduce the following lemma, employing a standard proving technique utilized in prior research (Chatterji and Long, 2021; Frei et al., 2022; Kou et al., 2023b; Meng et al., 2023).

Lemma 4.5.

Under Condition 3.1, ε>0for-all𝜀0\forall\varepsilon>0∀ italic_ε > 0, t=O~(η1ε1mn0d1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚subscript𝑛0superscript𝑑1superscriptsubscript𝜎𝑝2\exists\ t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mn_{0}d^{-1}\sigma_{p}^% {-2}\right)∃ italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), we have the following two situations before and after querying (i.e., s{0,1}for-all𝑠01\forall s\in\{0,1\}∀ italic_s ∈ { 0 , 1 }) for three quering algorithms:

  • The training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • If l{1,2},ns,lC1σp4d𝝁l24formulae-sequencefor-all𝑙12subscript𝑛𝑠𝑙subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁𝑙24\forall l\in\{1,2\},n_{s,l}\geq C_{1}\sigma_{p}^{4}d\|\bm{\mu}_{l}\|_{2}^{-4}∀ italic_l ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ≥ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT holds, the test error achieves Bayes-optimal: L𝒟01(𝐖(t))p1exp(ns,1𝝁124C3σp4d)+p2exp(ns,2𝝁224C4σp4d.)L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq p^{*}_{1}\cdot\exp% \left(\dfrac{-n_{s,1}\|\bm{\mu}_{1}\|_{2}^{4}}{C_{3}\sigma_{p}^{4}d}\right)+p^% {*}_{2}\cdot\exp\left(\dfrac{-n_{s,2}\|\bm{\mu}_{2}\|_{2}^{4}}{C_{4}\sigma_{p}% ^{4}d}.\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG . )

  • If l{1,2},ns,lC2σp4d𝝁l24formulae-sequencesuperscript𝑙12subscript𝑛𝑠superscript𝑙subscript𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁superscript𝑙24\exists l^{\prime}\in\{1,2\},n_{s,{l^{\prime}}}\leq C_{2}\sigma_{p}^{4}d\|\bm{% \mu}_{l^{\prime}}\|_{2}^{-4}∃ italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT holds, the test error stays constant-level: L𝒟01(𝐖(t))0.12pl.superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡0.12subscriptsuperscript𝑝superscript𝑙L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq 0.12\cdot p^{*}_{l^% {\prime}}.italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT .

Here plsubscriptsuperscript𝑝𝑙p^{*}_{l}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the occurrence probability of feature 𝛍lsubscript𝛍𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are some positive constants.

By Condition 3.1, along with the findings from Lemma 4.4 and Lemma 4.5, we can deduce that only the two NAL algorithms are able to obtain ample 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT for adequate learning after querying, which support the results in Proposition 3.2 and Theorem 3.4. Also, by Lemma 17 and Lemma 4.5, Random Sampling necessitates a label complexity that is approximately Θ(p1)Θsuperscript𝑝1\Theta(p^{-1})roman_Θ ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) times larger to sufficiently learn 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. This finding aligns with the conclusions in Corollary 3.5.

5 Experiments

In this section, we demonstrate the validity of our theoretical analysis through simulations. The experiments regarding the theories of XOR data as well as other data settings are also conducted, please refer to Appendix E for further details.

Here we generate synthetic data exactly following Definition 1. Specifically, we let the dimensionality as d=2000𝑑2000d=2000italic_d = 2000, and strengths of the strong and weak feature as 𝝁12=9subscriptnormsubscript𝝁129\|\bm{\mu}_{1}\|_{2}=9∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 9 and 𝝁22=3subscriptnormsubscript𝝁223\|\bm{\mu}_{2}\|_{2}=3∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 3, respectively. For the occurrence probability, we let p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2. For size setting of data, we let the nCNNsubscript𝑛𝐶𝑁𝑁n_{CNN}italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT=200, n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT=10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and n^=40^𝑛40\hat{n}=40over^ start_ARG italic_n end_ARG = 40, and set |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190. For model initialization, we let σp=1subscript𝜎𝑝1\sigma_{p}=1italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 and σ0=0.01subscript𝜎00.01\sigma_{0}=0.01italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.01. The parameters are initialized using the default method in PyTorch, and the models are trained using gradient descent with a learning rate of 0.1 for 200 iterations at the initial stage and the stage after sampling. All the data points are generated beforehand and shared by all the algorithms, thus the results are fairly comparable.

Figure 2 illustrates the effectiveness of both Uncertainty Sampling and Diversity Sampling in comparison to Random Sampling and full-trained ReLU CNN model with ample quantity of training samples. It’s evident that the learning of weak &\&& rare feature (quantified by γ2subscript𝛾2\gamma_{2}italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) in hard-to-learn samples are significantly poorer than strong &\&& common feature (quantified by γ1subscript𝛾1\gamma_{1}italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT) in easy-to-learn samples at the initial stage. After querying, we see explicitly that both the NAL algorithms learn the weak &\&& rare feature well and achieve comparable test performance compared to full trained model after querying. In contrast, Random Sampling continues to exhibit limited learning progress of weak features and results in poor test accuracy. The results verify Proposition 3.2 and Theorem 3.4. Illustrations of the querying stage details are deferred to Appendix E.1.

6 Potential Extension and Implication for Practical NALs

In this section, we first explore the potential extensions of our findings to broader theoretical realm, then elaborate on the practical implications derived from our theories.

Potential Extension to Multi-round NALs. The intrinsic principle we uncovered underlying both NAL methods is not tied to the single-round setting, and a fine-grained analysis can be conducted on complex iterative processes, as discussed in Appendix D.5.

Potential Extension to Broader NALs: BADGE (Ash et al., 2020) as an Exemplar. The key idea behind BADGE is to prioritize samples exhibiting large and diverse gradients. Our analysis reveals that such samples in our context tend to have smaller-scale latent representations (γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT is smaller) or more diverse gradient directions (many diverging γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT) due to the non-increasing nature of the logistic loss function. These characteristics align with the cases described in Lemma 4.2, which in our context refers to samples with lower γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT that correspond to yet-to-be-learned features. Therefore, BADGE is well-grounded in the principles uncovered by our theoretical analysis. A more detailed discussion is provided in Appendix D.2.

Potential Extension to Examine Criteria Preference. Our results of test error is based on the conditions that there is a clear learning progress disparity between different task-oriented features, under which we see that both NALs inherently favour samples with yet-to-be-learnt features. However, when this disparity does not hold prominently as dicussed in Appendix D.3.2, the behaviors of uncertainty-based and diversity-based sampling may diverge. For example, uncertainty sampling can more precisely prioritize samples with underexplored features when label budgets are not highly constrained. Conversely, diversity sampling may be preferred when label complexity is very limited, as it can enhance the model’s ability to capture diverse low-dimensional patterns. This argument is consistent with the claim in recent survey (Zhan et al., 2021). Our theory also suggests that when the “easiness” of learning various task-oriented features is balanced, uniform random sampling may suffice, without clear advantages for NALs. Additionally, in scenarios of active fine-tuning where the task objective changes, the task-oriented representation could shift, reducing the effectiveness of NAL methods that leverage prior neural representations for sampling. In such cases, random sampling may already be a satisfactory choice. A refined discussion is in Appendix D.4.

Practical Lessons from Our Theoretical Results. Our theoretical analysis yields several important practical insights, as detailed in Appendix D.6. First, we find that NALs have the potential to surpass the performance of fully-trained neural networks. As corroborated by the results in Lu et al. (2023), NALs can more effectively balance the learning progress across features with different lengths. Additionally, our work suggests that techniques capable of capturing the meaningful orthogonal components of a NN’s features or gradients, such as ICA (Yamagiwa et al., 2023), could help identify samples underrepresented in NN’s latent space. State-of-the-art methods like BADGE (Ash et al., 2020) leverages this idea upon the gradient components.

7 Conclusion

In this work, we theoretically demystify and unify the primary query criteria-based NAL methods. We prove they inherently prioritize perplexing samples - those with yet-to-be-learned features. This ensures adequate learning of all feature types, underpinning their strong generalization with limited labeled data. Future work can extend our theory to other complex NAL scenarios, such as multi-model committee and stream-based sampling. Additionally, the potential extensions and implications discussed in Section 6 represent valuable directions for further fine-grained exploration.

Acknowledgements

We thank the anonymous reviewers for their instrumental comments. DB and HW are supported in part by the Research Grants Council of the Hong Kong Special Administration Region (Project No. CityU 11206622). WH was partially supported by JSPS KAKENHI (24K20848). TS was partially supported by JSPS KAKENHI (24K02905) and JST CREST (JPMJCR2115, JPMJCR2015).

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Aggarwal et al. [2014] Charu C Aggarwal, Xiangnan Kong, Quanquan Gu, Jiawei Han, and S Yu Philip. Active learning: A survey. In Data classification, pages 599–634. Chapman and Hall/CRC, 2014.
  • Allen-Zhu and Li [2023] Zeyuan Allen-Zhu and Yuanzhi Li. Towards understanding ensemble, knowledge distillation and self-distillation in deep learning. In The Eleventh International Conference on Learning Representations, 2023.
  • Allen-Zhu et al. [2019] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In Proceedings of the 36th International Conference on Machine Learning, volume 97, pages 242–252. PMLR, 2019.
  • Ash et al. [2020] Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, John Langford, and Alekh Agarwal. Deep batch active learning by diverse, uncertain gradient lower bounds. In International Conference on Learning Representations, 2020.
  • Ba et al. [2022] Jimmy Ba, Murat A Erdogdu, Taiji Suzuki, Zhichao Wang, Denny Wu, and Greg Yang. High-dimensional asymptotics of feature learning: How one gradient step improves the representation. In Advances in Neural Information Processing Systems, volume 35, pages 37932–37946. Curran Associates, Inc., 2022.
  • Balcan et al. [2006] Maria-Florina Balcan, Alina Beygelzimer, and John Langford. Agnostic active learning. In Proceedings of the 23rd international conference on Machine Learning, volume 148, pages 65–72, 2006.
  • Cai et al. [2013] Wenbin Cai, Ya Zhang, and Jun Zhou. Maximizing expected model change for active learning in regression. In 2013 IEEE 13th International Conference on Data Mining, pages 51–60, 2013.
  • Cao and Gu [2019a] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a.
  • Cao and Gu [2019b] Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep ReLU networks. arXiv preprint arkiv: 1902.01384, 2019b.
  • Cao et al. [2020] Yuan Cao, Zhiying Fang, Yue Wu, Ding-Xuan Zhou, and Quanquan Gu. Towards understanding the spectral bias of deep learning. arXiv preprint arkiv: 1912.01198, 2020.
  • Cao et al. [2022a] Yuan Cao, Zixiang Chen, Misha Belkin, and Quanquan Gu. Benign overfitting in two-layer convolutional neural networks. In Advances in Neural Information Processing Systems, volume 35, pages 25237–25250, 2022a.
  • Cao et al. [2022b] Yuan Cao, Quanquan Gu, and Mikhail Belkin. Risk bounds for over-parameterized maximum margin classification on sub-gaussian mixtures. arXiv preprint arkiv: 2104.13628, 2022b.
  • Chatterji and Long [2021] Niladri S. Chatterji and Philip M. Long. Finite-sample analysis of interpolating linear classifiers in the overparameterized regime. Journal of Machine Learning Research, 22(129):1–30, 2021.
  • Chen et al. [2023a] **ghui Chen, Yuan Cao, and Quanquan Gu. Benign overfitting in adversarially robust linear classification. In Robin J. Evans and Ilya Shpitser, editors, Proceedings of the Thirty-Ninth Conference on Uncertainty in Artificial Intelligence, volume 216, pages 313–323, 2023a.
  • Chen et al. [2021a] Yilan Chen, Wei Huang, Lam Nguyen, and Tsui-Wei Weng. On the equivalence between neural network and support vector machine. In Advances in Neural Information Processing Systems, volume 34, pages 23478–23490. Curran Associates, Inc., 2021a.
  • Chen et al. [2023b] Yongqiang Chen, Wei Huang, Kaiwen Zhou, Yatao Bian, Bo Han, and James Cheng. Understanding and improving feature learning for out-of-distribution generalization. In Thirty-seventh Conference on Neural Information Processing Systems, 2023b.
  • Chen et al. [2020] Zixiang Chen, Yuan Cao, Quanquan Gu, and Tong Zhang. A generalized neural tangent kernel analysis for two-layer neural networks. In Advances in Neural Information Processing Systems, volume 33, pages 13363–13373. Curran Associates, Inc., 2020.
  • Chen et al. [2021b] Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep ReLU networks? arXiv preprint arkiv: 1911.12360, 2021b.
  • Chen et al. [2022] Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, and Yuanzhi Li. Towards understanding mixture of experts in deep learning. In Advances in Neural Information Processing Systems, volume 35, pages 23049–23062, 2022.
  • Chen et al. [2023c] Zixiang Chen, Yihe Deng, Yuanzhi Li, and Quanquan Gu. Understanding transferable representation learning and zero-shot transfer in CLIP. arXiv preprint arkiv: 2310.00927, 2023c.
  • Chen et al. [2023d] Zixiang Chen, Junkai Zhang, Yiwen Kou, Xiangning Chen, Cho-Jui Hsieh, and Quanquan Gu. Why does sharpness-aware minimization generalize better than SGD? In Thirty-seventh Conference on Neural Information Processing Systems, 2023d.
  • Chidambaram et al. [2023] Muthu Chidambaram, Xiang Wang, Chenwei Wu, and Rong Ge. Provably learning diverse features in multi-view data with midpoint mixup. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 5563–5599, 2023.
  • Cho et al. [2024] Seong ** Cho, Gwangsu Kim, Junghyun Lee, **woo Shin, and Chang D. Yoo. Querying easily flip-flopped samples for deep active learning. arXiv preprint arkiv:2401.09787, 2024.
  • Deng et al. [2023] Yihe Deng, Yu Yang, Baharan Mirzasoleiman, and Quanquan Gu. Robust learning with progressive data expansion against spurious correlation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Devroye et al. [2023] Luc Devroye, Abbas Mehrabian, and Tommy Reddad. The total variation distance between high-dimensional gaussians with the same mean. arXiv preprint arkiv: 1810.08693, 2023.
  • Du et al. [2015] Bo Du, Zengmao Wang, Lefei Zhang, Liangpei Zhang, Wei Liu, Jialie Shen, and Dacheng Tao. Exploring representativeness and informativeness for active learning. IEEE transactions on cybernetics, 47(1):14–26, 2015.
  • Duan et al. [2024] Ruxiao Duan, Brian Caffo, Harrison X. Bai, Haris I. Sair, and Craig Jones. Evidential uncertainty quantification: A variance-based perspective. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 2132–2141, January 2024.
  • Frei et al. [2022] Spencer Frei, Niladri S Chatterji, and Peter Bartlett. Benign overfitting without linearity: Neural network classifiers trained by gradient descent for noisy linear data. In Proceedings of Thirty Fifth Conference on Learning Theory, volume 178, pages 2668–2703, 2022.
  • Frei et al. [2023] Spencer Frei, Niladri S. Chatterji, and Peter L. Bartlett. Random feature amplification: Feature learning and generalization in neural networks. Journal of Machine Learning Research, 24(303):1–49, 2023.
  • Gal et al. [2017] Yarin Gal, Riashat Islam, and Zoubin Ghahramani. Deep Bayesian active learning with image data. In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 1183–1192, 2017.
  • Gissin and Shalev-Shwartz [2019] Daniel Gissin and Shai Shalev-Shwartz. Discriminative active learning. arXiv preprint arkiv: 1907.06347, 2019.
  • Gu [2014] Quanquan Gu. online and Active Learning of Big networks: Theory and Algorithms. Dissertation, University of Illinois at Urbana-Champaign, Urbana, IL, 09 2014.
  • Gu et al. [2014] Quanquan Gu, Tong Zhang, and Jiawei Han. Batch-mode active learning via error bound minimization. In Uncertainty in Artificial Intelligence - Proceedings of the 30th Conference, UAI 2014, pages 300–309, 2014.
  • Houlsby et al. [2011] Neil Houlsby, Ferenc Huszár, Zoubin Ghahramani, and Máté Lengyel. Bayesian active learning for classification and preference learning. arXiv preprint arkiv: 1112.5745, 2011.
  • Huang et al. [2020] Wei Huang, Weitao Du, Richard Yi Da Xu, and Chunrui Liu. Implicit bias of deep linear networks in the large learning rate phase. arXiv preprint arkiv: 2011.12547, 2020.
  • Huang et al. [2021] Wei Huang, Weitao Du, and Richard Yi Da Xu. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arkiv: 2004.05867, 2021.
  • Huang et al. [2022] Wei Huang, Yayong Li, Weitao Du, Jie Yin, Richard Yi Da Xu, Ling Chen, and Miao Zhang. Towards deepening Graph neural networks: A gntk-based optimization perspective. arXiv preprint arkiv: 2103.03113, 2022.
  • Huang et al. [2023a] Wei Huang, Yuan Cao, Haonan Wang, Xin Cao, and Taiji Suzuki. Graph neural networks provably benefit from structural information: A feature learning perspective. arXiv preprint arkiv: 2306.13926, 2023a.
  • Huang et al. [2023b] Wei Huang, Chunrui Liu, Yilan Chen, Richard Yi Da Xu, Miao Zhang, and Tsui-Wei Weng. Analyzing deep PAC-Bayesian learning with neural tangent kernel: Convergence, analytic generalization bound, and efficient hyperparameter selection. Transactions on Machine Learning Research, 2023b. ISSN 2835-8856.
  • Huang et al. [2023c] Wei Huang, Ye Shi, Zhongyi Cai, and Taiji Suzuki. Understanding convergence and generalization in federated learning through feature learning theory. In The Twelfth International Conference on Learning Representations, 2023c.
  • Jacot et al. [2020] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arkiv: 1806.07572, 2020.
  • Jiang et al. [2024] Yibo Jiang, Goutham Rajendran, Pradeep Ravikumar, Bryon Aragam, and Victor Veitch. On the origins of linear representations in large language lodels. arXiv preprint arkiv: 2403.03867, 2024.
  • Joshi et al. [2009] Ajay J. Joshi, Fatih Porikli, and Nikolaos Papanikolopoulos. Multi-class active learning for image classification. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 2372–2379, 2009.
  • Kampffmeyer et al. [2016] Michael Kampffmeyer, Arnt-Borre Salberg, and Robert Jenssen. Semantic segmentation of small objects and modeling of uncertainty in urban remote sensing images using deep convolutional [n]eural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2016.
  • Karp et al. [2021] Stefani Karp, Ezra Winston, Yuanzhi Li, and Aarti Singh. local signal adaptivity: Provable feature learning in neural networks beyond kernels. In Advances in Neural Information Processing Systems, volume 34, pages 24883–24897, 2021.
  • Kim and Suzuki [2024] Juno Kim and Taiji Suzuki. Transformers learn nonlinear features in context: Nonconvex mean-field dynamics on the attention landscape. arXiv preprint arkiv: 2402.01258, 2024.
  • Kim et al. [2024] Juno Kim, Kakei Yamamoto, Kazusato Oko, Zhuoran Yang, and Taiji Suzuki. Symmetric mean-field langevin dynamics for distributional minimax problems. In The Twelfth International Conference on Learning Representations, 2024.
  • Kong et al. [2022] Seo Taek Kong, Soomin Jeon, Dongbin Na, Jaewon Lee, Hong-Seok Lee, and Kyu-Hwan Jung. A neural pre-conditioning active learning algorithm to reduce label complexity. In Advances in Neural Information Processing Systems, volume 35, pages 32842–32853, 2022.
  • Kou et al. [2023a] Yiwen Kou, Zixiang Chen, Yuan Cao, and Quanquan Gu. How does semi-supervised learning with pseudo-labelers work? a case study. In The Eleventh International Conference on Learning Representations, 2023a.
  • Kou et al. [2023b] Yiwen Kou, Zixiang Chen, Yuanzhou Chen, and Quanquan Gu. Benign overfitting in two-layer ReLU convolutional neural networks. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 17615–17659, 2023b.
  • Kou et al. [2023c] Yiwen Kou, Zixiang Chen, and Quanquan Gu. Implicit bias of gradient descent for two-layer ReLU and leaky ReLU networks on nearly-orthogonal data. In Thirty-seventh Conference on Neural Information Processing Systems, 2023c.
  • Kye et al. [2023] Seong Min Kye, Kwanghee Choi, Hyeongmin Byun, and Buru Chang. TiDAL: Learning training dynamics for active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 22335–22345, October 2023.
  • Lewis and Catlett [1994] David D. Lewis and Jason Catlett. Heterogeneous uncertainty sampling for supervised learning. In Machine Learning Proceedings 1994, pages 148–156. Morgan Kaufmann, 1994.
  • Li et al. [2023] Hongkang Li, Meng Wang, Sijia Liu, and Pin-Yu Chen. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023.
  • Li and Liang [2018] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Lu et al. [2023] Miao Lu, Beining Wu, Xiaodong Yang, and Difan Zou. Benign oscillation of stochastic gradient descent with large learning rate. In NeurIPS 2023 Workshop on Mathematics of Modern Machine Learning, 2023.
  • Mei et al. [2018] Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
  • Mei et al. [2019] Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Mean-field theory of two-layers neural networks: Dimension-free bounds and kernel limit. In Proceedings of the Thirty-Second Conference on Learning Theory, volume 99, pages 2388–2464. PMLR, 25–28 Jun 2019.
  • Meng et al. [2023] Xuran Meng, Difan Zou, and Yuan Cao. Benign overfitting in two-layer ReLU convolutional neural networks for XOR data. arXiv preprint arkiv: 2310.01975, 2023.
  • Mohamadi et al. [2022] Mohamad Amin Mohamadi, Wonho Bae, and Danica J. Sutherland. Making look-ahead active learning strategies feasible with neural tangent kernels. In Advances in Neural Information Processing Systems, volume 35, pages 12542–12553, 2022.
  • Nitanda [2024] Atsushi Nitanda. Improved particle approximation error for mean field neural networks. arXiv preprint arXiv:2405.15767, 2024.
  • Nitanda and Suzuki [2017] Atsushi Nitanda and Taiji Suzuki. Stochastic particle gradient descent for infinite ensembles. arXiv preprint arXiv:1712.05438, 2017.
  • Nitanda et al. [2021] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Particle dual averaging: optimization of mean field neural networks with global convergence rate analysis. In Advances in Neural Information Processing Systems, volume 34, pages 19608–19621, 2021.
  • Nitanda et al. [2022] Atsushi Nitanda, Denny Wu, and Taiji Suzuki. Convex analysis of the mean field langevin dynamics. In Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151, pages 9741–9757. PMLR, 2022.
  • Nitanda et al. [2023a] Atsushi Nitanda, Kazusato Oko, Denny Wu, Nobuhito Takenouchi, and Taiji Suzuki. Primal and dual analysis of entropic fictitious play for finite-sum problems. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 26266–26282. PMLR, 2023a.
  • Nitanda et al. [2023b] Atsushi Nitanda, Kazusato Oko, Denny Wu, Nobuhito Takenouchi, and Taiji Suzuki. Primal and dual analysis of entropic fictitious play for finite-sum problems. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 26266–26282. PMLR, 2023b.
  • Nitanda et al. [2024] Atsushi Nitanda, Kazusato Oko, Taiji Suzuki, and Denny Wu. Improved statistical and computational complexity of the mean-field langevin dynamics under structured data. In The Twelfth International Conference on Learning Representations, 2024.
  • Oko et al. [2022] Kazusato Oko, Taiji Suzuki, Atsushi Nitanda, and Denny Wu. Particle stochastic dual coordinate ascent: Exponential convergent algorithm for mean field neural network optimization. In International Conference on Learning Representations, 2022.
  • Otto and Villani [2000] Felix Otto and Cédric Villani. Generalization of an inequality by talagrand and links with the logarithmic sobolev inequality. Journal of Functional Analysis, 173(2):361–400, 2000.
  • Roth and Small [2006] Dan Roth and Kevin Small. Margin-based active learning for structured output spaces. In Machine Learning: ECML 2006, pages 413–424, 2006.
  • Rotskoff and Vanden-Eijnden [2018] Grant M. Rotskoff and Eric Vanden-Eijnden. Trainability and Accuracy of neural networks: An interacting particle system approach. arXiv preprint arXiv:1805.00915, 2018.
  • Sener and Savarese [2018] Ozan Sener and Silvio Savarese. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Settles [2009] Burr Settles. Active learning literature survey. Technical Report TR1648, University of Wisconsin-Madison Department of Computer Sciences, 2009.
  • Seung et al. [1992] H Sebastian Seung, Manfred Opper, and Haim Sompolinsky. Query by committee. In Proceedings of the fifth annual workshop on Computational Learning theory, pages 287–294, 1992.
  • Shui et al. [2020] Changjian Shui, Fan Zhou, Christian Gagné, and Boyu Wang. Deep active learning: unified and principled method for query and training. In International Conference on Artificial Intelligence and Statistics, pages 1308–1318, 2020.
  • Sinha et al. [2019] Samarth Sinha, Sayna Ebrahimi, and Trevor Darrell. Variational adversarial active learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  • Sirignano and Spiliopoulos [2020] Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A central limit theorem. Stochastic Processes and their Applications, 130(3):1820–1852, 2020.
  • Stark et al. [2015] Fabian Stark, Caner Hazırbas, Rudolph Triebel, and Daniel Cremers. CAPTCHA recognition with active deep learning. In Workshop new challenges in Neural computation, volume 2015, page 94, 2015.
  • Suzuki et al. [2023a] Taiji Suzuki, Atsushi Nitanda, and Denny Wu. Uniform-in-time propagation of chaos for the mean-field gradient langevin dynamics. In The Eleventh International Conference on Learning Representations, 2023a.
  • Suzuki et al. [2023b] Taiji Suzuki, Denny Wu, and Atsushi Nitanda. Convergence of mean-field langevin dynamics: Time-space discretization, stochastic gradient, and variance reduction. In Advances in Neural Information Processing Systems, volume 36, pages 15545–15577. Curran Associates, Inc., 2023b.
  • Suzuki et al. [2023c] Taiji Suzuki, Denny Wu, Kazusato Oko, and Atsushi Nitanda. Feature learning via mean-field langevin dynamics: Classifying sparse parities and beyond. In Advances in Neural Information Processing Systems, volume 36, pages 34536–34556. Curran Associates, Inc., 2023c.
  • Takezoe et al. [2023] Rinyoichi Takezoe, Xu Liu, Shunan Mao, Marco Tianyu Chen, Zhanpeng Feng, Shiliang Zhang, and Xiaoyu Wang. Deep active learning for computer vision: Past and future. APSIPA Transactions on Signal and Information Processing, 12(1):–, 2023.
  • Tian et al. [2023] Yuandong Tian, Yi** Wang, Beidi Chen, and Simon Du. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer. arXiv preprint arkiv: 2305.16380, 2023.
  • Tian et al. [2024] Yuandong Tian, Yi** Wang, Zhenyu Zhang, Beidi Chen, and Simon Du. JoMA: Demystifying multilayer transformers via joint dynamics of MLP and attention. arXiv preprint arkiv: 2310.00535, 2024.
  • Vershynin [2018] Roman Vershynin. High-dimensional Probability: An Introduction with Applications in Data science, volume 47. Cambridge university press, 2018.
  • Wainwright [2019] Martin J Wainwright. High-dimensional statistics: A non-asymptotic Viewpoint, volume 48. Cambridge university press, 2019.
  • Wang et al. [2022a] Haonan Wang, Wei Huang, Ziwei Wu, Hanghang Tong, Andrew J Margenot, and **grui He. Deep active learning by leveraging training dynamics. In Advances in Neural Information Processing Systems, volume 35, pages 25171–25184, 2022a.
  • Wang et al. [2016] Keze Wang, Dongyu Zhang, Ya Li, Ruimao Zhang, and Liang Lin. Cost-effective active learning for deep image classification. IEEE Transactions on Circuits and Systems for Video Technology, 27(12):2591–2600, 2016.
  • Wang et al. [2022b] Tianyang Wang, Xingjian Li, Pengkun Yang, Guosheng Hu, Xiangrui Zeng, Siyu Huang, Cheng-Zhong Xu, and Min Xu. Boosting active learning via improving test performance. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, pages 8566–8574, 2022b.
  • Wang et al. [2021] Zhilei Wang, Pranjal Awasthi, Christoph Dann, Ayush Sekhari, and Claudio Gentile. Neural active learning with performance guarantees. In Advances in Neural Information Processing Systems, volume 34, pages 7510–7521, 2021.
  • Wen et al. [2023] Ziting Wen, Oscar Pizarro, and Stefan Williams. NTKCPL: Active learning on top of self-supervised model by estimating true coverage. arXiv preprint arkiv: 2306.04099, 2023.
  • Yamagiwa et al. [2023] Hiroaki Yamagiwa, Momose Oyama, and Hidetoshi Shimodaira. Discovering universal geometry in embeddings with ica. arXiv preprint arkiv: 2305.13175, 2023.
  • Yang and Hu [2022] Greg Yang and Edward J. Hu. Feature learning in infinite-width neural networks. arXiv preprint arkiv: 2011.14522, 2022.
  • Yang and Loog [2016] Yazhou Yang and Marco Loog. Active learning using uncertainty information. In 2016 23rd International Conference on Pattern Recognition (ICPR), pages 2646–2651, 2016.
  • Yehudai and Shamir [2019] Gilad Yehudai and Ohad Shamir. On the power and limitations of random features for understanding neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Yin et al. [2017] Changchang Yin, Buyue Qian, Shilei Cao, Xiaoyu Li, Jishang Wei, Qinghua Zheng, and Ian Davidson. Deep similarity-based batch mode active learning with exploration-exploitation. In 2017 IEEE International Conference on Data Mining (ICDM), pages 575–584, 2017.
  • Yu et al. [2023] Yaodong Yu, Sam Buchanan, Druv Pai, Tianzhe Chu, Ziyang Wu, Shengbang Tong, Benjamin D. Haeffele, and Yi Ma. White-box transformers via sparse rate reduction. arXiv preprint arkiv: 2306.01129, 2023.
  • Zhan et al. [2021] Xueying Zhan, Huan Liu, Qing Li, and Antoni B. Chan. A comparative survey: Benchmarking for pool-based active learning. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI-21, pages 4679–4686, 2021.
  • Zhan et al. [2022] Xueying Zhan, Qingzhong Wang, Kuan hao Huang, Haoyi Xiong, De**g Dou, and Antoni B. Chan. A comparative survey of deep active learning. arXiv preprint arkiv: 2203.13450, 2022.
  • Zhdanov [2019] Fedor Zhdanov. Diverse mini-batch active learning. arXiv preprint arkiv: 1901.05954, 2019.
  • Zhu and Nowak [2022] Yinglun Zhu and Robert Nowak. Active learning with neural networks: Insights from nonparametric statistics. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 142–155. Curran Associates, Inc., 2022.
  • Zou et al. [2020] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep ReLU networks. Machine Learning, 109:467–492, 2020.
  • Zou et al. [2023a] Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 43423–43479, 2023a.
  • Zou et al. [2023b] Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. Understanding the generalization of adam in learning neural networks with proper regularization. In The Eleventh International Conference on Learning Representations, 2023b.

Appendix A Additional Related Work: Theory of Feature Learning in Overparameterized Neural Network

The rapid progress of deep neural networks has prompted growing interest in understanding their underlying theoretical principles, particularly regarding the optimization and generalization properties of overparameterized models. A key development in this area is the study of the Neural Tangent Kernel (NTK) [Jacot et al., 2020, Chen et al., 2020, Cao and Gu, 2019a, b, Cao et al., 2020, Allen-Zhu et al., 2019, Chen et al., 2021b, Zou et al., 2020, Huang et al., 2020, Chen et al., 2021a, Huang et al., 2021, 2022, 2023b, Yang and Hu, 2022]. This has provided powerful insights into the training dynamics of wide neural networks, revealing that their behavior in the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-loss setting closely mirrors the function approximation in reproducing kernel Hilbert spaces (RKHS), where the kernel is associated with the network architecture. However, instead of feature learning, this line of research suggest that the parameter update dynamics can be approximated by the first-order Taylor expansion at initialization, where the NN with wide enough width can effectively perform linear regression over a prescribed feature map, which cannot characterize the NN’s ability to perform feature learning [Yang and Hu, 2022].

In parallel, an active research direction is the analysis of NN under mean-field regime [Mei et al., 2018, 2019], which allows the network parameters to evolve away from the initialization, thereby enabling feature learning for various target functions [Ba et al., 2022, Suzuki et al., 2023c]. Recently, Mean-Field Langevin Dynamics (MFLD) has attracted increased attention, where Gaussian noise is added to the gradient to encourage “exploration” [Mei et al., 2018, Suzuki et al., 2023b]. This framework lifts the learning of finite-width neural networks to an infinite-dimensional optimization problem in the space of probability measures, and by exploiting the convexity of the loss function in this measure space, MFLD can achieve near-optimal global convergence under gradient-based optimization [Nitanda and Suzuki, 2017, Mei et al., 2018, Nitanda et al., 2021, 2022, 2023a, 2023b, 2024, Oko et al., 2022, Otto and Villani, 2000, Rotskoff and Vanden-Eijnden, 2018, Sirignano and Spiliopoulos, 2020, Suzuki et al., 2023a, c, b, Nitanda, 2024, Kim et al., 2024, Kim and Suzuki, 2024]. Despite the remarkable ability of NNs under the MFLD regime to learn complex “features”, their superior performance still requires a large width at the order of eO(d)superscript𝑒𝑂𝑑e^{O(d)}italic_e start_POSTSUPERSCRIPT italic_O ( italic_d ) end_POSTSUPERSCRIPT [Suzuki et al., 2023c]. Moreover, the optimization behavior of MFLD differs from the widely-applied SGD-based neural network algorithms, leaving the real-world feature learning phenomenon of commonly-utilized deep learning algorithms largely unexplained.

To overcome the technical challenges and shed light on the practical feature learning observed in GD/SGD-based learning algorithms, the seminal work by Allen-Zhu and Li [2023] takes a step forward. It first attempted to explain the observed success of ensemble methods in deep learning by adopting the NTK framework, but recognized the limitations of this approach. To tackle this challenge and fill the understanding gap, Allen-Zhu and Li [2023] considers a multi-view data model, which is a more complex version of the data model examined in the main body of our work. Allen-Zhu and Li [2023] justify this multi-view data model as plausible theoretical setups by empirically demonstrating the common occurrence of multiple one-task-oriented features in the latent space of ResNet, as shown in their Figures 2-4 and 9. Given the plausibility and suitability of this data setting for theoretical investigations of feature learning dynamics, a considerable body of research has delved into examining the capabilities of different learning algorithms under different structured conditions [Li and Liang, 2018, Karp et al., 2021, Yehudai and Shamir, 2019, Cao et al., 2022b, Chen et al., 2022, 2023b, 2023c, 2023a, 2023d, Zou et al., 2023b, Li et al., 2023, Kou et al., 2023b, a, c, Meng et al., 2023, Huang et al., 2023a, c, Chidambaram et al., 2023, Deng et al., 2023, Frei et al., 2023, Tian et al., 2023, 2024]. Notably, the width requirement for this line of research is considerably weaker compared to the NTK and MFLD regimes, which allows for a more fine-grained analysis of feature learning dynamics based on inner product-based feature direction reconstruction.

We believe our work extend this line of research by showing that the two primary criteria-based NALs are inherently prioritizing those underrepresented samples with yet-to-be-learned features. We hope this insight can help the community gain a deeper understanding of the heuristic NAL methods, and develop new principled approaches that can alleviate the data hungriness of deep learning.

Appendix B Discussions on the Parameter Settings

In this section, we motivate the settings of our systems and dicuss the consequences of violating the requirements.

B.1 Choice of Systems

We would like to motivate our choice of systems in detail as below.

  • The system of learning dynamic: d,n,m,μ,σ0,η𝑑𝑛𝑚norm𝜇subscript𝜎0𝜂d,n,m,||\bm{\mu}||,\sigma_{0},\etaitalic_d , italic_n , italic_m , | | bold_italic_μ | | , italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η. The choice of d,n,m𝑑𝑛𝑚d,n,mitalic_d , italic_n , italic_m aligns with the feature learning line of research [Li and Liang, 2018, Karp et al., 2021, Frei et al., 2022, Chen et al., 2022, Allen-Zhu and Li, 2023, Chen et al., 2023b, c, a, d, Zou et al., 2023b, Li et al., 2023, Kou et al., 2023a, Huang et al., 2023a, Kou et al., 2023c, Chidambaram et al., 2023, Deng et al., 2023, Huang et al., 2023c], with the aim of ensuring the learning problem is in a small but sufficiently overparameterized regime where the benign overfitting - overparamiterized NN can generalize well when trained to convergence - could occur. This phenomenon is non-trivial against prior belief that overfit is always harmful-the greater the capacity of a model to fit data distribution, the worse the model’s test results will be. The system chosen allows for analysis of learning progress of features, as the weak requirement on network width m𝑚mitalic_m allows us to conduct a fine-grained analysis based on inner product arguments (i.e., scale analysis of γ,ρ𝛾𝜌\gamma,\rhoitalic_γ , italic_ρ), which fundamentally differs from the NTK line of research [Jacot et al., 2020] that requires an infinitely wide network to perform linear regression over a prescribed feature map, rather than learning the features themselves. Moreover, this system ensures a small Signal-to-Noise Ratio (SNR), under which the memorization of noise would become the primary contributor to the volume of the NN’s weight matrices, allowing a more balanced and controllable coefficient updates [Kou et al., 2023b, Meng et al., 2023].

  • The system of sampling dynamic: n~,n0,n,p,|𝒫|,μ1,μ2,σp~𝑛subscript𝑛0superscript𝑛𝑝𝒫normsubscript𝜇1normsubscript𝜇2subscript𝜎𝑝\widetilde{n},n_{0},n^{*},p,\lvert\mathcal{P}\rvert,||\bm{\mu}_{1}||,||\bm{\mu% }_{2}||,\sigma_{p}over~ start_ARG italic_n end_ARG , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_p , | caligraphic_P | , | | bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | | , | | bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | | , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT. The choice of this system is to (i) avoid the cases where all sampling methods would succeed or fail simultaneously, and (ii) ensure there is a marked learning progress disparity between well-learned and yet-to-be-learned features within the initial stage. The reason to maintain these conditions is to help reveal the underlying rationale behind NAL. It’s worth noting that we also provide discussions in Appendix D.3 on the general settings beyond the specific system chosen in the main body of the work. In these broader scenarios, there might be various patterns in the learning progress of features.

In all, albeit the two systems interact and operate together, they have distinct tasks. The first system is tailored to the non-trivial learning problem at hand. Meanwhile, the choice of the second system aims to help reveal the non-trivial connections between the two NAL methods, by closely tracking the learning progress of task-oriented features after sampling.

B.2 Consequences of Violating System Requirements

The following outlines the consequences that may arise where the requirements over the systems are violated:

  1. 1.

    The choice of d𝑑ditalic_d. The large d𝑑ditalic_d technically ensures the per-sample loss contributions are in a controllable order during training, preventing any individual’s noise from exerting outsized influence on the dynamics. When d𝑑ditalic_d decreases with respect to n,m𝑛𝑚n,mitalic_n , italic_m, the control of the order over <𝝁l𝝁l,𝝃i𝝃i>,<𝝃i𝝃i,𝝃j𝝃j>,<𝐰j,r(0),𝝁l𝝁l>,<𝐰j,r(0),𝝃i𝝃i>,l,ij<\frac{\bm{\mu}_{l}}{||\bm{\mu}_{l}||},\frac{\bm{\xi}_{i}}{||\bm{\xi}_{i}||}>,% <\frac{\bm{\xi}_{i}}{||\bm{\xi}_{i}||},\frac{\bm{\xi}_{j}}{||\bm{\xi}_{j}||}>,% <\mathbf{w}_{j,r}^{(0)},\frac{\bm{\mu}_{l}}{||\bm{\mu}_{l}||}>,<\mathbf{w}_{j,% r}^{(0)},\frac{\bm{\xi}_{i}}{||\bm{\xi}_{i}||}>,\forall l,i\neq j< divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | end_ARG , divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | end_ARG > , < divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | end_ARG , divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_ξ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | | end_ARG > , < bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT | | end_ARG > , < bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG | | bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | | end_ARG > , ∀ italic_l , italic_i ≠ italic_j no longer hold with high probability as listed in Appendix G.1, and our technical results on training convergence can not be assured to hold with high chance. Also, a small d𝑑ditalic_d leads to a large Signal-to-Noise Ratio (SNR), where the memorization of noise is no longer the dominant factor in the NN’s weight matrix volume. This makes the automatic balance of coefficient updates techniques in Kou et al. [2023b], Meng et al. [2023] cannot hold, which serves as a convenient lever to observe the bounds on the coefficients and matrix volume update.

  2. 2.

    The choices of occurrence probability p𝑝pitalic_p, initial size n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, query size nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, pool size |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P |, feature norm 𝝁lnormsubscript𝝁𝑙\|\bm{\mu}_{l}\|∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ jointly determine the sampling results.

    • Combinations of p,𝝁l𝑝normsubscript𝝁𝑙p,\|\bm{\mu}_{l}\|italic_p , ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ reflect the diverse “easiness” to learn particular features, leading to varied sampling scenarios as discussed in Appendix D.3.2.

    • As p,n0𝑝subscript𝑛0p,n_{0}italic_p , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT increase, the chance of getting all features well-learned goes up, reducing NAL’s advantage over random sampling as discussed in Appendix D.3.2.

    • Lower p𝑝pitalic_p values (e.g. p<0.5𝑝0.5p<0.5italic_p < 0.5) allow NAL to better alleviate labeling efforts by prioritizing the samples with yet-to-be-learned features, but if p0𝑝0p\rightarrow 0italic_p → 0 or |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P | decreases, there might be few yet-to-be-learned features in the pool, limiting NAL’s ability to select enough of them to ensure sufficient learning, as discussed in Appendix D.3.2.

    • Smaller n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT may limit the learning of all features at initial stage, and all sampling methods might behave similarly since all types of features require further learning as discussed in Appendix D.3.2. Decreases in n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P | would make it challenging to reliably control the proportions of samples as in Lemma 17.

  3. 3.

    The choices of σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and η𝜂\etaitalic_η aim to ensure effective optimization via GD. As σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT grows, the model has a stronger “belief” that is harder to change. While analysis under larger η𝜂\etaitalic_η is also doable [Lu et al., 2023], a small η𝜂\etaitalic_η is preferred to better present our main findings.

Amidst parameter variations, we believe our findings are non-trivial.

Appendix C Theoretical Results: XOR data version

In a similar vein to the theoretical results on linearly separable data, we now present a theory specifically tailored for XOR data. The purpose or effect of each result is similar to those obtained for linearly separable data, so we will omit the detailed description of each result. The experiments and proofs can be found in Appendix E.3 and Appendix H.

Definition C.1.

[Meng et al., 2023] Let 𝐚,𝐛d\{𝟎}𝐚𝐛\superscript𝑑0\mathbf{a},\mathbf{b}\in\mathbb{R}^{d}\backslash\{\mathbf{0}\}bold_a , bold_b ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT \ { bold_0 } with 𝐚𝐛perpendicular-to𝐚𝐛\mathbf{a}\perp\mathbf{b}bold_a ⟂ bold_b be two fixed vectors. For 𝝁d𝝁superscript𝑑\bm{\mu}\in\mathbb{R}^{d}bold_italic_μ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and y¯{±1}¯𝑦plus-or-minus1\bar{y}\in\{\pm 1\}over¯ start_ARG italic_y end_ARG ∈ { ± 1 }, we say that 𝝁𝝁\bm{\mu}bold_italic_μ and y¯¯𝑦\bar{y}over¯ start_ARG italic_y end_ARG are jointly generated from distribution 𝒟XOR(𝐚,𝐛)subscript𝒟XOR𝐚𝐛\mathcal{D}_{\mathrm{XOR}}(\mathbf{a},\mathbf{b})caligraphic_D start_POSTSUBSCRIPT roman_XOR end_POSTSUBSCRIPT ( bold_a , bold_b ) if the pair (𝝁,y¯)𝝁¯𝑦(\bm{\mu},\bar{y})( bold_italic_μ , over¯ start_ARG italic_y end_ARG ) is randomly and uniformly drawn from the set {(𝐚+𝐛,+1),(𝐚𝐛,+1),(𝐚𝐛,1),(𝐚+𝐛,1)}𝐚𝐛1𝐚𝐛1𝐚𝐛1𝐚𝐛1\{(\mathbf{a}+\mathbf{b},+1),(-\mathbf{a}-\mathbf{b},+1),(\mathbf{a}-\mathbf{b% },-1),(-\mathbf{a}+\mathbf{b},-1)\}{ ( bold_a + bold_b , + 1 ) , ( - bold_a - bold_b , + 1 ) , ( bold_a - bold_b , - 1 ) , ( - bold_a + bold_b , - 1 ) }.

Definition C.2.

For l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }, let {𝐚1,𝐛1}{𝐚2,𝐛2}d\{𝟎}perpendicular-tosubscript𝐚1subscript𝐛1subscript𝐚2subscript𝐛2\superscript𝑑0\{\mathbf{a}_{1},\mathbf{b}_{1}\}\perp\{\mathbf{a}_{2},\mathbf{b}_{2}\}\in% \mathbb{R}^{d}\backslash\{\mathbf{0}\}{ bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } ⟂ { bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT \ { bold_0 }, with 𝐚l𝐛lperpendicular-tosubscript𝐚𝑙subscript𝐛𝑙\mathbf{a}_{l}\perp\mathbf{b}_{l}bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟂ bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT be two pair of fixed vectors satisfying 𝐚l2+𝐛l2=𝝁l22superscriptnormsubscript𝐚𝑙2superscriptnormsubscript𝐛𝑙2superscriptsubscriptnormsubscript𝝁𝑙22\|\mathbf{a}_{l}\|^{2}+\|\mathbf{b}_{l}\|^{2}=\|\bm{\mu}_{l}\|_{2}^{2}∥ bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where 𝝁l2subscriptnormsubscript𝝁𝑙2\|\bm{\mu}_{l}\|_{2}∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents feature strength. Then each data point (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ) with 𝐱=[𝐱(1),𝐱(2)]2d𝐱superscriptsuperscript𝐱limit-from1topsuperscript𝐱limit-from2toptopsuperscript2𝑑\mathbf{x}=\left[\mathbf{x}^{(1)\top},\mathbf{x}^{(2)\top}\right]^{\top}\in% \mathbb{R}^{2d}bold_x = [ bold_x start_POSTSUPERSCRIPT ( 1 ) ⊤ end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ( 2 ) ⊤ end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT and y{±1}𝑦plus-or-minus1y\in\{\pm 1\}italic_y ∈ { ± 1 } is generated from 𝒟𝒟\mathcal{D}caligraphic_D as follows:

  • Feature Patch. For a feeble p𝑝pitalic_p satisfying p<0.5𝑝0.5p<0.5italic_p < 0.5, one patch of 𝐱𝐱\mathbf{x}bold_x is randomly selected as feature patch, and with high probability (1-p𝑝pitalic_p) the feature patch 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is easy-to-learn feature 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while only with probability p𝑝pitalic_p the feature patch is hard-to-learn feature 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. 𝝁ldsubscript𝝁𝑙superscript𝑑\bm{\mu}_{l}\in\mathbb{R}^{d}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and y¯{±1}¯𝑦plus-or-minus1\bar{y}\in\{\pm 1\}over¯ start_ARG italic_y end_ARG ∈ { ± 1 } are jointly generated from 𝒟XOR(𝐚l,𝐛l)subscript𝒟XORsubscript𝐚𝑙subscript𝐛𝑙\mathcal{D}_{\mathrm{XOR}}(\mathbf{a}_{l},\mathbf{b}_{l})caligraphic_D start_POSTSUBSCRIPT roman_XOR end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ).

  • Noise Patch. The other patch of 𝐱𝐱\mathbf{x}bold_x is assigned as a randomly generated Gaussian vector 𝝃N(𝟎,σp2(𝐈l(𝐚l𝐚l/𝐚l2𝐛l𝐛l/𝐛l2)))similar-to𝝃𝑁0superscriptsubscript𝜎𝑝2𝐈subscript𝑙subscript𝐚𝑙superscriptsubscript𝐚𝑙topsuperscriptnormsubscript𝐚𝑙2subscript𝐛𝑙superscriptsubscript𝐛𝑙topsuperscriptnormsubscript𝐛𝑙2\bm{\xi}\sim N\left(\mathbf{0},\sigma_{p}^{2}\cdot\left(\mathbf{I}-\sum_{l}{(% \mathbf{a}_{l}\mathbf{a}_{l}^{\top}/\|\mathbf{a}_{l}\|^{2}-\mathbf{b}_{l}% \mathbf{b}_{l}^{\top}/\|\mathbf{b}_{l}\|^{2})}\right)\right)bold_italic_ξ ∼ italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ( bold_I - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / ∥ bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT / ∥ bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ) ).

  • The ground truth label y is synthesized from a Rademacher distribution.

Here we assume the two types of feature differ: (1p)𝝁124=Ω(σp4dn01)1𝑝superscriptsubscriptnormsubscript𝝁124Ωsuperscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01(1-p)\|\bm{\mu}_{1}\|_{2}^{4}=\Omega(\sigma_{p}^{4}dn_{0}^{-1})( 1 - italic_p ) ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and p𝝁224=O(σp4dn01)𝑝superscriptsubscriptnormsubscript𝝁224𝑂superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01p\|\bm{\mu}_{2}\|_{2}^{4}=O(\sigma_{p}^{4}dn_{0}^{-1})italic_p ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = italic_O ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Also, we assume the noise cannot completely disturb the learning of features: n~𝝁l24=Ω(σp4d),l{1,2}formulae-sequence~𝑛superscriptsubscriptnormsubscript𝝁𝑙24Ωsuperscriptsubscript𝜎𝑝4𝑑𝑙12\widetilde{n}\|\bm{\mu}_{l}\|_{2}^{4}=\Omega(\sigma_{p}^{4}d),l\in\{1,2\}over~ start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ) , italic_l ∈ { 1 , 2 }.

For (𝐱,y)𝒟similar-to𝐱𝑦𝒟(\mathbf{x},y)\sim\mathcal{D}( bold_x , italic_y ) ∼ caligraphic_D in Definition C.2, it’s safe to say that:

(𝐱,y)=d(𝐱,y), and therefore (𝐱,y)𝒟(y𝜽,𝐱>0)=1/2 for any 𝜽2d.superscript𝑑𝐱𝑦𝐱𝑦, and therefore subscriptsimilar-to𝐱𝑦𝒟𝑦𝜽𝐱012 for any 𝜽superscript2𝑑(\mathbf{x},y)\stackrel{{\scriptstyle d}}{{=}}(-\mathbf{x},y)\text{, and % therefore }\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}}(y\cdot\langle\bm{\theta}% ,\mathbf{x}\rangle>0)=1/2\text{ for any }\bm{\theta}\in\mathbb{R}^{2d}.( bold_x , italic_y ) start_RELOP SUPERSCRIPTOP start_ARG = end_ARG start_ARG italic_d end_ARG end_RELOP ( - bold_x , italic_y ) , and therefore blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT ( italic_y ⋅ ⟨ bold_italic_θ , bold_x ⟩ > 0 ) = 1 / 2 for any bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT .

In other words, all linear predictors will provably fail to learn the XOR-type data 𝒟𝒟\mathcal{D}caligraphic_D.

Condition C.3.

For certain ε,δ>0𝜀𝛿0\varepsilon,\delta>0italic_ε , italic_δ > 0, suppose that

  1. 1.

    The initial training size n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, the maximum admissible size after querying n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG, and the width of neural network m𝑚mitalic_m satisfy n0=Ω(log(m/δ),p1log(1/δ))subscript𝑛0Ω𝑚𝛿superscript𝑝11𝛿n_{0}=\Omega(\log(m/\delta),p^{-1}\log(1/\delta))italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = roman_Ω ( roman_log ( italic_m / italic_δ ) , italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_δ ) ), n~=O(p1σp4d𝝁224)~𝑛𝑂superscript𝑝1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224\widetilde{n}=O(p^{-1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4})over~ start_ARG italic_n end_ARG = italic_O ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ), m=Ω(log(n~/δ))𝑚Ω~𝑛𝛿m=\Omega(\log(\widetilde{n}/\delta))italic_m = roman_Ω ( roman_log ( over~ start_ARG italic_n end_ARG / italic_δ ) ).

  2. 2.

    The dimension d𝑑ditalic_d satisfies: d=Ω~(n~2,n~𝝁l22σp2)polylog(1/ε)polylog(1/δ)𝑑~Ωsuperscript~𝑛2~𝑛superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝜎𝑝2polylog1𝜀polylog1𝛿d=\widetilde{\Omega}\left(\widetilde{n}^{2},\widetilde{n}\|\bm{\mu}_{l}\|_{2}^% {2}\sigma_{p}^{-2}\right)\cdot\operatorname{polylog}(1/\varepsilon)\cdot% \operatorname{polylog}(1/\delta)italic_d = over~ start_ARG roman_Ω end_ARG ( over~ start_ARG italic_n end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , over~ start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ⋅ roman_polylog ( 1 / italic_ε ) ⋅ roman_polylog ( 1 / italic_δ ), for l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }.

  3. 3.

    Random initialization scale σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT satisfies: σ0O~(min{n0/(σpd),n0𝝁l2/(σp2d)})subscript𝜎0~𝑂subscript𝑛0subscript𝜎𝑝𝑑subscript𝑛0subscriptnormsubscript𝝁𝑙2superscriptsubscript𝜎𝑝2𝑑\sigma_{0}\leq\widetilde{O}\left(\min\left\{\sqrt{n_{0}}/\left(\sigma_{p}d% \right),n_{0}\|\bm{\mu}_{l}\|_{2}/\left(\sigma_{p}^{2}d\right)\right\}\right)italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ over~ start_ARG italic_O end_ARG ( roman_min { square-root start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG / ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d ) , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) } ), for l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }, the learning rate η𝜂\etaitalic_η satisfies: η=O~([max{σp2d3/2/(n02m),σp2d/(n0m)]1)\eta=\widetilde{O}\left(\left[\max\left\{\sigma_{p}^{2}d^{3/2}/\left(n_{0}^{2}% \sqrt{m}\right),\sigma_{p}^{2}d/(n_{0}m)\right]^{-1}\right)\right.italic_η = over~ start_ARG italic_O end_ARG ( [ roman_max { italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT / ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_m end_ARG ) , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_m ) ] start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ).

  4. 4.

    The angle θ𝜃\thetaitalic_θ between 𝐚l+𝐛lsubscript𝐚𝑙subscript𝐛𝑙\mathbf{a}_{l}+\mathbf{b}_{l}bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐚l𝐛lsubscript𝐚𝑙subscript𝐛𝑙\mathbf{a}_{l}-\mathbf{b}_{l}bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT satisfies cosθ<1/2𝜃12\cos\theta<1/2roman_cos italic_θ < 1 / 2, for l{1,2}for-all𝑙12\forall l\in\{1,2\}∀ italic_l ∈ { 1 , 2 }.

Proposition C.4.

(Before Querying) For any ε,δ>0εδ0\varepsilon,\delta>0italic_ε , italic_δ > 0, if Condition C.3 holds, when the probability of the appearance of weak feature in each data point generated from the testing distribution 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is psuperscriptpp^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, then with probability at least 12δ12δ1-2\delta1 - 2 italic_δ, the following results hold at a certain t=Ω(n0m/(ηεσp2d))tΩsubscriptn0mηεsuperscriptsubscriptσp2dt=\Omega\left(n_{0}m/\left(\eta\varepsilon\sigma_{p}^{2}d\right)\right)italic_t = roman_Ω ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_m / ( italic_η italic_ε italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) ) :

  • The training loss converges below ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • The test error achieve sub-optimal constant-level L𝒟01(𝐖(t))p0.12superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡superscript𝑝0.12L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq p^{*}\cdot 0.12italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ 0.12.

Proposition C.5.

(Querying Stage) During Querying, under the same conditions as Proposition C.4, if (1p)𝛍122p𝛍222=Ω(σp2d1/2n01/2(log(m/δ))1/2)1psuperscriptsubscriptnormsubscript𝛍122psuperscriptsubscriptnormsubscript𝛍222Ωsuperscriptsubscriptσp2superscriptd12superscriptsubscriptn012superscriptmsuperscriptδ12(1-p)\|\bm{\mu}_{1}\|_{2}^{2}-p\|\bm{\mu}_{2}\|_{2}^{2}=\Omega({\sigma_{p}}^{2% }d^{1/2}n_{0}^{-1/2}(\log(m/\delta^{\prime}))^{1/2})( 1 - italic_p ) ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_p ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( roman_log ( italic_m / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) and the size of the sampling pool |𝒫|𝒫\lvert\mathcal{P}\rvert| caligraphic_P | is adequately substantial, satisfying: |𝒫|=Ω(p1σp4d𝛍224,p1log(1/δ))𝒫Ωsuperscriptp1superscriptsubscriptσp4dsuperscriptsubscriptnormsubscript𝛍224superscriptp11δ\lvert\mathcal{P}\rvert=\Omega(p^{-1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4},% p^{-1}\log(1/\delta))| caligraphic_P | = roman_Ω ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_log ( 1 / italic_δ ) ), then with probability at least 1Θ(δ+δ)1Θδsuperscriptδ1-\Theta(\delta+\delta^{\prime})1 - roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), both Uncertainty Sampling and Diversity Sampling pick samples with hard-to-learn features 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in 𝒫𝒫\mathcal{P}caligraphic_P.

Theorem C.6.

(After Querying) If the sampling size nsuperscriptnn^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the two types of Sampling algorithm satisfies C^1σp4d𝛍224pn02n=Θ(n~n0)n~n0subscript^C1superscriptsubscriptσp4dsuperscriptsubscriptnormsubscript𝛍224psubscriptn02superscriptnΘ~nsubscriptn0~nsubscriptn0\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}-\dfrac{pn_{0}}{2}% \leq n^{*}=\Theta(\widetilde{n}-n_{0})\leq\widetilde{n}-n_{0}divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Θ ( over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where C^1subscript^C1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is some positive constant, under the same conditions as Proposition C.5, the 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and psuperscriptpp^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT follows the same definitions in Proposition C.4, then with probability at least 1 - Θ(δ+δ)Θδsuperscriptδ\Theta(\delta+\delta^{\prime})roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we have the following results hold at a certain t=Ω((n0+n)m/(ηεσp2d))tΩsubscriptn0superscriptnmηεsuperscriptsubscriptσp2dt=\Omega\left((n_{0}+n^{*})m/\left(\eta\varepsilon\sigma_{p}^{2}d\right)\right)italic_t = roman_Ω ( ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_m / ( italic_η italic_ε italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) ):

  • For both the Random Sampling method and Uncertainty Sampling method, the training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • Uncertainty Sampling and Diversity Sampling algorithms both have negligible test error: L𝒟01(𝐖(t))exp(Θ(n~𝝁l24σp4d)),l{1,2}formulae-sequencesuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ~𝑛superscriptsubscriptnormsubscript𝝁𝑙24superscriptsubscript𝜎𝑝4𝑑𝑙12L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq\exp(\Theta\left(% \dfrac{-\widetilde{n}\|\bm{\mu}_{l}\|_{2}^{4}}{\sigma_{p}^{4}d}\right)),l\in\{% 1,2\}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ roman_exp ( roman_Θ ( divide start_ARG - over~ start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) ) , italic_l ∈ { 1 , 2 }.

  • Random Sampling algorithm would remain the sub-optimal constant-level test error: L𝒟01(𝐖(t))p0.12superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡superscript𝑝0.12L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq p^{*}\cdot 0.12italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ 0.12.

Appendix D Discussions over General Scenarios

Our findings align with the concept of “Active Learning,” where models resemble students (models) actively selecting valuable practice questions (samples) to prepare for exams (tasks). Students prioritize perplexing questions based on high uncertainty of their answers, or rare knowledge points (features), in order to enhance their understanding of yet-to-be-mastered (lack of learning progress) knowledge points (features) in test questions. Similar to students, for most black-box deep neural models, the “learning progress” of particular “feature” is not readily available for algorithm developer due to their inherent opacity. From a feature learning view, that’s why NAL algorithms need to indirectly prioritize those yet-to-be-learned features, since this is the key for their good generalization ability and achieve benign overfitting. Our study shows that uncertainty-based and diversity-based NAL inherently strive to prioritize yet-to-be-learned feature-assisted samples (i.e., perplexing samples) via different comparisons in a heuristic manner. We believe future work can figure out if developed interpretable models [Yu et al., 2023] reduced labelling efforts by prioritizing perplexing samples.

Below, we present several discussions regarding general scenarios and the potential wider applicability of our theorems, beyond the specific conditions considered in the main body of our work. It is important to note that our point-mass querying approach and one-round querying settings were adopted to better unveil the inherent principle of query criteria-based NAL algorithms in a rigorous manner, albeit other complex NAL algorithms may be better suited for real-world complex data distribution and corresponding tasks. Note that our multiple task-oriented feature-noise data modellings follow the modellings in Allen-Zhu and Li [2023], Chen et al. [2022, 2023b, 2023c, 2023a, 2023d], Zou et al. [2023b], Li et al. [2023], Kou et al. [2023a, c], which empirically mirror the latent representation of models like Resnet [Allen-Zhu and Li, 2023] or transformer [Yamagiwa et al., 2023, Jiang et al., 2024].

D.1 Discussion of the Role of Benign Oscillation

In the work by Lu et al. [2023], they analyze the role of a large learning rate in the context of feature learning. Their data modeling includes weak features present in each data point, strong features present in a small fraction of data points, and noise. Although our work differs in terms of the data modeling and analysis framework, we might also observe the impact of a large learning rate. In Figures 2, 5, and 7, we can see that Uncertainty Sampling and Diversity Sampling algorithms empirically outperform the fully-trained model. Drawing insights from the results in Lu et al. [2023], we attribute this phenomenon to the large learning rate, which drives the model to be trained to focus more on weak and rare features. It is worth noting that although our training loss does not exhibit the benign oscillation phenomenon mentioned in Lu et al. [2023], this probably could be due to the difference in optimization algorithms (GD with logistic loss in our work versus SGD with square loss in Lu et al. [2023]).

D.2 Potential Extension over State-of-arts and Criteria-combined NALs: BADGE as an Exampler

We believe our analysis can indeed be extended to reveal the success of methods like BADGE [Ash et al., 2020] that combine uncertainty and diversity criteria. We show they share a common principle of prioritizing samples with yet-to-be-learned features. Like the inner product arguments in prior theoretical results [Li and Liang, 2018, Karp et al., 2021, Allen-Zhu and Li, 2023, Chen et al., 2022, 2023b, 2023c, 2023a, 2023d, Zou et al., 2023b, Li et al., 2023, Kou et al., 2023a, Huang et al., 2023a, Kou et al., 2023c, Chidambaram et al., 2023, Deng et al., 2023, Huang et al., 2023c], our theory characterizes learning progress via the coefficients γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT, which high-levelly represent how well the NN has integrated low-dimensional task-oriented patterns into its latent space. We believe the underlying principle of BADGE [Ash et al., 2020] aligns well with this view:

  • Core idea of BADGE. The key idea behind BADGE is to query samples that exhibit large and diverse gradients within a single batch, achieved through k𝑘kitalic_k-MEANS ++ or k𝑘kitalic_k-DPP in the pseudo gradient space.

  • Connection between gradient and latent space of NN. Since our analysis utilizes the well-applied non-increasing logistic loss, the smaller the magnitude of the latent representation, the larger the magnitude of the gradient embedding will be. Additionally, the diversity of the latent vectors’ directions will be preserved in the gradient space. Based on Lemma G.15, we see that the rows of the latent representations are roughly of the order as γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

  • BADGE also prioritizes samples with yet-to-be-learned feature. We now know the BADGE tends to prioritize samples with smaller scale latent representations (smaller γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT) or more diverse directions (many diverging γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT). These samples correspond to the cases described in Lemma 4.2, which in our context refers to samples with lower γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT that have yet-to-be-learned features.

Therefore, we claim that BADGE, in the context of our analysis regime, can be explained as a well-motivated NAL method. The key reason is that the two core ideas of BADGE align with the shared underlying rationale of NAL that we has uncovered. One of our future work would serve to give a fine-grained analysis of the success factors behind BADGE, and we also believe our theoretical framework has the potential to extend to the understanding of some other state-of-the-art methods.

D.3 Extension over Data Distribution under Other Conditions

The theory presented in our main study focuses on a data model that includes weak and rare features, strong and common features, and noise. This setting is motivated by real-world imbalanced datasets, as illustrated in Figure 1. However, thanks to our general analysis framework, we can also discuss more general scenarios with broader conditions. In the following sections, we first discuss a theory version that relaxes the conditions on feature norms. This case suggests that rare features may also possess sufficiently discriminative label-related features, such as Simba in the last row of Figure 1, even though they are rare occurrences in the overall data distribution. Secondly, we introduce a more general theoretical results. While our discussions below focused on results for linearly separable data, we assert that the same results hold for non-linearly separable XOR data, as the requirements for the parameters are indeed similar. The proofs of all results in this section can be readily obtained based on our results in Appendix G.4, H.3 , G.5 or H.4.

To start, we present the condition-relaxed versions of Proposition G.16, which describe the order situation of samples in 𝒫𝒫\mathcal{P}caligraphic_P under relaxed conditions. Here we denote τlsubscript𝜏𝑙\tau_{l}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT as the proportion of 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT-equipped data in 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Proposition D.1.

(Proposition G.16 with relaxed conditions on feature norms) Under Condition 3.1, there exist t=O~(η1ε1mnd1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚𝑛superscript𝑑1superscriptsubscript𝜎𝑝2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mnd^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) that for 𝐱,𝐱𝒫𝒟for-all𝐱superscript𝐱𝒫𝒟\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{P}\subsetneq\mathcal{D}∀ bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ⊊ caligraphic_D where 𝐱𝐱\mathbf{x}bold_x contains feature patch y𝛍2𝑦subscript𝛍2y\cdot\bm{\mu}_{2}italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT while 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains feature patch y𝛍1superscript𝑦subscript𝛍1y^{\prime}\cdot\bm{\mu}_{1}italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, with probability at least 18mexp{Θ([τ1𝛍122τ2𝛍222]2/(σp4d/n0))}18𝑚Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝛍122subscript𝜏2superscriptsubscriptnormsubscript𝛍2222superscriptsubscript𝜎𝑝4𝑑subscript𝑛01-8m\exp\left\{-\Theta\left({\left[\tau_{1}\left\|\bm{\mu}_{1}\right\|_{2}^{2}% -\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2}\right]^{2}}/{(\sigma_{p}^{4}d/n_{% 0})}\right)\right\}1 - 8 italic_m roman_exp { - roman_Θ ( [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) }, we have 𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x.

Proof of Proposition D.1. See the proving process of Proposition G.16.

This theorem serve as the key to analysis of the querying statistics, as samples with the lower 𝔼j,r(γj,r,l)𝑗𝑟𝔼subscript𝛾𝑗𝑟𝑙\underset{j,r}{\mathbb{E}}(\gamma_{j,r,l})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT ) are perplexing samples. Based on the coefficient scale presented in Lemma G.14, we can obtain the probability lower bound for 𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x, which is

P(𝐱(t)𝐱)18mexp{Θ([τ1𝝁122τ2𝝁222]2σp4d/n0)}.𝑃superscriptprecedes-or-equals𝑡superscript𝐱𝐱18𝑚Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁2222superscriptsubscript𝜎𝑝4𝑑subscript𝑛0P(\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x})\geq 1-8m\exp\left\{-\Theta\left(% \frac{\left[\tau_{1}\left\|\bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu% }_{2}\right\|_{2}^{2}\right]^{2}}{\sigma_{p}^{4}d/n_{0}}\right)\right\}.italic_P ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x ) ≥ 1 - 8 italic_m roman_exp { - roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) } . (7)

Thus we can conclude that perplexing samples are samples with lower τl𝝁l22subscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙22\tau_{l}\left\|\bm{\mu}_{l}\right\|_{2}^{2}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We then can relax the conditions on feature norms by imposing specific conditions on p𝑝pitalic_p. Additionally, we can relax both conditions on feature norms and conditions on p𝑝pitalic_p to consider a more general case. The upcoming sections will discuss these scenarios in detail.

D.3.1 Case 1: Relaxed Conditions on Feature Norms

In the main body of our work, we have the conditions on feature norms: 𝝁124=Ω(σp4dn01)superscriptsubscriptnormsubscript𝝁124Ωsuperscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\|\bm{\mu}_{1}\|_{2}^{4}=\Omega(\sigma_{p}^{4}dn_{0}^{-1})∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), 𝝁224=O(σp4dn01)superscriptsubscriptnormsubscript𝝁224𝑂superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\|\bm{\mu}_{2}\|_{2}^{4}=O(\sigma_{p}^{4}dn_{0}^{-1})∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = italic_O ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) and 𝝁122𝝁222=Ω(σp2d1/2n01/2(log(m/δ))1/2)superscriptsubscriptnormsubscript𝝁122superscriptsubscriptnormsubscript𝝁222Ωsuperscriptsubscript𝜎𝑝2superscript𝑑12superscriptsubscript𝑛012superscript𝑚superscript𝛿12\|\bm{\mu}_{1}\|_{2}^{2}-\|\bm{\mu}_{2}\|_{2}^{2}=\Omega({\sigma_{p}}^{2}d^{1/% 2}n_{0}^{-1/2}(\log(m/\delta^{\prime}))^{1/2})∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( roman_log ( italic_m / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) for the ease of presentations. In this section we provide a theory version that relaxes these requirements (i.e., no discrepancy in terms of feature norms). The essence is that we can impose stricter assumptions on p𝑝pitalic_p to ensure there exists a learning progress disparity between the two features. Despite this, the inherent principle of the two-criteria-based NAL approach would still drive the algorithms to preferentially query the samples containing the yet-to-be-learned features. The rigorous rationale behind these will be thoroughly explored in Appendix G.3 and Appendix G.5. Here, we can leverage the deduction results in Appendix G.3, Appendix G.4 and Appendix G.5 to readily form the following results.

Definition D.2.

(Definition with relaxed conditions on feature norms) Let 𝝁1𝝁2dperpendicular-tosubscript𝝁1subscript𝝁2superscript𝑑\bm{\mu}_{1}\perp\bm{\mu}_{2}\in\mathbb{R}^{d}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟂ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT be two fixed feature vectors. Each data point (𝐱,y)𝐱𝑦(\mathbf{x},y)( bold_x , italic_y ), where 𝐱𝐱\mathbf{x}bold_x contains two patches as 𝐱𝐱\mathbf{x}bold_x=[𝐱1T,𝐱2T]Tsuperscriptsuperscriptsubscript𝐱1𝑇superscriptsubscript𝐱2𝑇𝑇[\mathbf{x}_{1}^{T},\mathbf{x}_{2}^{T}]^{T}[ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT \in 2dsuperscript2𝑑\mathbb{R}^{2d}blackboard_R start_POSTSUPERSCRIPT 2 italic_d end_POSTSUPERSCRIPT and y𝑦yitalic_y {1,1}absent11\in\{-1,1\}∈ { - 1 , 1 } are generated from the distribution 𝒟𝒟\mathcal{D}caligraphic_D:

  • The ground truth label y is synthesized from a Rademacher distribution.

  • Noise Patch. One patch of 𝐱𝐱\mathbf{x}bold_x is selected as a noise patch 𝝃𝝃\bm{\xi}bold_italic_ξ, synthesized from Gaussian distribution N(𝟎,σp2𝐈)𝑁0superscriptsubscript𝜎𝑝2𝐈N(\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ).

  • Feature Patch. For a feeble p𝑝pitalic_p satisfying p<O(n0σp4d𝝁224),(𝝁122+𝝁222)1(𝝁122+σp2d1/2n01/2(log(8m/δ))1/2)𝑝𝑂subscript𝑛0superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224superscriptsuperscriptsubscriptnormsubscript𝝁122superscriptsubscriptnormsubscript𝝁2221superscriptsubscriptnormsubscript𝝁122superscriptsubscript𝜎𝑝2superscript𝑑12superscriptsubscript𝑛012superscript8𝑚superscript𝛿12p<O(n_{0}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4}),(\|\bm{\mu}_{1}\|_{2}^{2}+% \|\bm{\mu}_{2}\|_{2}^{2})^{-1}(\|\bm{\mu}_{1}\|_{2}^{2}+{\sigma_{p}}^{2}d^{1/2% }n_{0}^{-1/2}(\log(8m/\delta^{\prime}))^{1/2})italic_p < italic_O ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ) , ( ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( roman_log ( 8 italic_m / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), the remaining patch of 𝐱𝐱\mathbf{x}bold_x is selected as label-related feature patch, and with high probability (1-p𝑝pitalic_p) the feature patch is a common feature y𝝁1𝑦subscript𝝁1y\cdot\bm{\mu}_{1}italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while only with probability p𝑝pitalic_p the feature patch is a rare feature y𝝁2𝑦subscript𝝁2y\cdot\bm{\mu}_{2}italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Here we only require that the learning of features would not completely disturbed by noise: l{1,2},𝝁l22=Ω(σp2log(n0/δ),n01dσp4)formulae-sequencefor-all𝑙12superscriptsubscriptnormsubscript𝝁𝑙22Ωsuperscriptsubscript𝜎𝑝2subscript𝑛0𝛿superscriptsubscript𝑛01𝑑superscriptsubscript𝜎𝑝4\forall l\in\{1,2\},\|\bm{\mu}_{l}\|_{2}^{2}=\Omega(\sigma_{p}^{2}\log(n_{0}/% \delta),n_{0}^{-1}d\sigma_{p}^{4})∀ italic_l ∈ { 1 , 2 } , ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / italic_δ ) , italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ).

The specific condition on the occurrence probability p𝑝pitalic_p serves two purposes. Firstly, it ensures that strategy-free passive learning cannot sample enough rare data to adequately learn the rare label-related feature 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, as observed in the real-world scenario depicted in Figure 1. Secondly, it helps distinguish the learning progress between 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

We can prove that three querying algorithms still exhibit harmful overfitting at the initial stage.

Proposition D.3.

(Before Querying) At the initial stage before querying, ε>0for-allε0\forall\varepsilon>0∀ italic_ε > 0, under Condition 3.1, with probability at least 1δ1δ1-\delta1 - italic_δ, there exists t=O~(η1ε1mn0d1σp2)t~Osuperscriptη1superscriptε1msubscriptn0superscriptd1superscriptsubscriptσp2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mn_{0}d^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), the followings hold for all of the three querying algorithms:

  1. 1.

    The training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  2. 2.

    The test error remains at constant level, i.e., L𝒟01(𝐖(t))=Θ(1)0.12psuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ10.12superscript𝑝L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)=\Theta(1)\geq 0.12\cdot p% ^{*}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_Θ ( 1 ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Then, we can still have a look on the querying stage based on the techniques in Appendix G.4.

Proposition D.4.

(Querying Stage) During Querying, under the same conditions as Proposition D.3, then with probability at least 1Θ(δ+δ)1Θδsuperscriptδ1-\Theta(\delta+\delta^{\prime})1 - roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), Uncertainty Sampling and Diversity Sampling would all pick nsuperscriptnn^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT samples that models exhibit lowest 𝔼j,rγj,r,l(t)jr𝔼superscriptsubscriptγjrlt\displaystyle\underset{j,r}{\mathbb{E}}\gamma_{j,r,l}^{(t)}start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT (i.e., perplexing samples). Moreover, those perplexing samples are samples with rare feature 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

Similar to the theories presented in the main body of our study, we can establish the following theorem.

Theorem D.5.

(After Querying) If the sampling size nsuperscriptnn^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the three querying algorithms satisfies C1σp4d𝛍224pn0/2n=Θ(n~n0)n~n0subscriptC1superscriptsubscriptσp4dsuperscriptsubscriptnormsubscript𝛍224psubscriptn02superscriptnΘ~nsubscriptn0~nsubscriptn0C_{1}\sigma_{p}^{4}d\|\bm{\mu}_{2}\|_{2}^{-4}-pn_{0}/2\leq n^{*}=\Theta(% \widetilde{n}-n_{0})\leq\widetilde{n}-n_{0}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT - italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / 2 ≤ italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Θ ( over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where C1subscriptC1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is some positive constant. Then for ε>0for-allε0\forall\varepsilon>0∀ italic_ε > 0, under the same conditions as Proposition 3, with probability more than 1 - Θ(δ+δ)Θδsuperscriptδ\Theta(\delta+\delta^{\prime})roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), there exists t=O~(η1ε1m(n0+n)d1σp2)t~Osuperscriptη1superscriptε1msubscriptn0superscriptnsuperscriptd1superscriptsubscriptσp2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}m(n_{0}+n^{*})d^{-1}\sigma_{p}^{% -2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) such that:

  • For all of the three querying algorithms, the training loss converges to ε𝜀\varepsilonitalic_ε, i.e., LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • Uncertainty Sampling and Diversity Sampling algorithms have negligible near Bayes-optimal test error: L𝒟01(𝐖(t))exp(Θ(n~𝝁l24σp4d)),l{1,2}formulae-sequencesuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ~𝑛superscriptsubscriptnormsubscript𝝁𝑙24superscriptsubscript𝜎𝑝4𝑑𝑙12L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq\exp(\Theta\left(% \dfrac{-\widetilde{n}\|\bm{\mu}_{l}\|_{2}^{4}}{\sigma_{p}^{4}d}\right)),l\in\{% 1,2\}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ roman_exp ( roman_Θ ( divide start_ARG - over~ start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) ) , italic_l ∈ { 1 , 2 }.

  • Random Sampling algorithm would remain constant order test error: L𝒟01(𝐖(t))=Θ(1)0.12psuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡Θ10.12superscript𝑝L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)=\Theta(1)\geq 0.12\cdot p% ^{*}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = roman_Θ ( 1 ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

D.3.2 Case 2: Flexible Cases

Indeed, we can relax both the conditions on feature norms and the conditions on p𝑝pitalic_p to explore more general cases. By (7), if τ1𝝁122τ2𝝁222subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁222\tau_{1}\left\|\bm{\mu}_{1}\right\|_{2}^{2}\approx\tau_{2}\left\|\bm{\mu}_{2}% \right\|_{2}^{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the learning progress of the two types of features would be alike (i.e., 𝔼j,r(γj,r,1)𝔼j,r(γj,r,2)𝑗𝑟𝔼subscript𝛾𝑗𝑟1𝑗𝑟𝔼subscript𝛾𝑗𝑟2\underset{j,r}{\mathbb{E}}(\gamma_{j,r,1})\approx\underset{j,r}{\mathbb{E}}(% \gamma_{j,r,2})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) ≈ start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT )), and we cannot clearly observe which type of feature-equipped samples are likely to be queried. Thanks to our sample-complexity analysis regimes in Appendix G.5, we can clearly examine two scenarios at the initial stage based on (17) and Lemma G.21:

  • Benign Overfitting: if τl𝝁l242C1σp4dn01subscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙242subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\tau_{l}\|\bm{\mu}_{l}\|_{2}^{4}\geq 2C_{1}\sigma_{p}^{4}dn_{0}^{-1}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≥ 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the learning of 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT-equipped data would be adequate, and the test error of algorithms achieve Bayes-optimal.

  • Harmful Overfitting: if τl𝝁l242C2/3σp4dn01subscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙242subscript𝐶23superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\tau_{l}\|\bm{\mu}_{l}\|_{2}^{4}\leq 2C_{2}/3\sigma_{p}^{4}dn_{0}^{-1}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 3 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, the learning of 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT-equipped data would be inadequate, and the test error of algorithms remains constant level.

Then, we can list some cases with certain p𝑝pitalic_p (τ2=Θ(p)subscript𝜏2Θ𝑝\tau_{2}=\Theta(p)italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Θ ( italic_p ) by Lemma 17) ,𝝁l2,l{1,2},\|\bm{\mu}_{l}\|_{2},l\in\{1,2\}, ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l ∈ { 1 , 2 } in our analysis regime:

  1. 1.

    When the learning of 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are all adequate, we can conclude that n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is already sufficient for training in this case.

  2. 2.

    When the learning of 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are all inadequate at the initial stage, all querying algorithms (i.e., Random Sampling, Uncertainty Sampling and Diversity Sampling) can help leverage learning of features. While our theory indicates that the two NAL algorithms would tend to prioritize samples with comparatively poorer learned feature (i.e., {𝝁lτl𝝁l24=min(τ1𝝁124,τ2𝝁224)}conditional-setsubscript𝝁𝑙subscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙24subscript𝜏1superscriptsubscriptnormsubscript𝝁124subscript𝜏2superscriptsubscriptnormsubscript𝝁224\{\bm{\mu}_{l}\mid\tau_{l}\|\bm{\mu}_{l}\|_{2}^{4}=\min(\tau_{1}\|\bm{\mu}_{1}% \|_{2}^{4},\tau_{2}\|\bm{\mu}_{2}\|_{2}^{4})\}{ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∣ italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT = roman_min ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ) }), the difference in generalization ability between Random Sampling and the two NAL algorithms would depend on certain parameters (i.e., p,n,|𝒫|,𝝁12,𝝁22𝑝superscript𝑛𝒫subscriptnormsubscript𝝁12subscriptnormsubscript𝝁22p,n^{*},\lvert\mathcal{P}\rvert,\|\bm{\mu}_{1}\|_{2},\|\bm{\mu}_{2}\|_{2}italic_p , italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , | caligraphic_P | , ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT).

  3. 3.

    When the learning of 𝝁l1subscript𝝁subscript𝑙1\bm{\mu}_{l_{1}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is adequate while the learning of 𝝁l2subscript𝝁subscript𝑙2\bm{\mu}_{l_{2}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT is inadequate (l1l2{1,2}subscript𝑙1subscript𝑙212l_{1}\neq l_{2}\in\{1,2\}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { 1 , 2 }), we have the following cases based on our theory:

    • If τl1𝝁l122τl2𝝁l222subscript𝜏subscript𝑙1superscriptsubscriptnormsubscript𝝁subscript𝑙122subscript𝜏subscript𝑙2superscriptsubscriptnormsubscript𝝁subscript𝑙222\tau_{l_{1}}\left\|\bm{\mu}_{l_{1}}\right\|_{2}^{2}\approx\tau_{l_{2}}\left\|% \bm{\mu}_{l_{2}}\right\|_{2}^{2}italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, the prioritization by two NAL algorithms is not obvious, and they would perform similarly to Random Sampling.

    • If τl1𝝁l122>τl2𝝁l222subscript𝜏subscript𝑙1superscriptsubscriptnormsubscript𝝁subscript𝑙122subscript𝜏subscript𝑙2superscriptsubscriptnormsubscript𝝁subscript𝑙222\tau_{l_{1}}\left\|\bm{\mu}_{l_{1}}\right\|_{2}^{2}>\tau_{l_{2}}\left\|\bm{\mu% }_{l_{2}}\right\|_{2}^{2}italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT > italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, two NAL algorithms would tend to prioritize perplexing samples (i.e., samples with 𝝁l2subscript𝝁subscript𝑙2\bm{\mu}_{l_{2}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT), and their prioritization has lower probability bound in (7). Meanwhile, the difference in generalization ability between Random Sampling and the two NAL algorithms would depend on certain parameters (i.e., p,n,|𝒫|,𝝁12,𝝁22𝑝superscript𝑛𝒫subscriptnormsubscript𝝁12subscriptnormsubscript𝝁22p,n^{*},\lvert\mathcal{P}\rvert,\|\bm{\mu}_{1}\|_{2},\|\bm{\mu}_{2}\|_{2}italic_p , italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , | caligraphic_P | , ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). Specifically, under Condition 3.1, Definition 1 and Definition D.2 provide two parameter settings satisfying τl1𝝁l122τl2𝝁l222=Ω(σp2d1/2n01/2(log(m/δ))1/2)subscript𝜏subscript𝑙1superscriptsubscriptnormsubscript𝝁subscript𝑙122subscript𝜏subscript𝑙2superscriptsubscriptnormsubscript𝝁subscript𝑙222Ωsuperscriptsubscript𝜎𝑝2superscript𝑑12superscriptsubscript𝑛012superscript𝑚superscript𝛿12\tau_{l_{1}}\|\bm{\mu}_{l_{1}}\|_{2}^{2}-\tau_{l_{2}}\|\bm{\mu}_{l_{2}}\|_{2}^% {2}=\Omega({\sigma_{p}}^{2}d^{1/2}n_{0}^{-1/2}(\log(m/\delta^{\prime}))^{1/2})italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ( roman_log ( italic_m / italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ), where the two NAL algorithms succeed while Random Sampling fails (i.e., Theorem 3.4 and Theorem D.5). Other general scenarios can also be rigorously analyzed with the prioritization probability lower bound in (7) and permutation probability.

  4. 4.

    Other cases would be similar to the second or third case (i.e., where l{1,2},2C2/3σp4dn01τl𝝁l242C1σp4dn01formulae-sequence𝑙122subscript𝐶23superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01subscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙242subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscript𝑛01\exists l\in\{1,2\},2C_{2}/3\sigma_{p}^{4}dn_{0}^{-1}\leq\tau_{l}\|\bm{\mu}_{l% }\|_{2}^{4}\leq 2C_{1}\sigma_{p}^{4}dn_{0}^{-1}∃ italic_l ∈ { 1 , 2 } , 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 3 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≤ italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≤ 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT).

In real-world scenarios, the pool-based setting often resembles a wide range of flexible cases. From the perspective of feature learning, our theoretical observations suggest that the occurrence probability and strength of different task-specific features can profoundly impact the efficiency of NAL algorithms.

D.4 Cases of Criteria Preference

Our work has uncovered a non-trivial connection between the two query criteria-based NAL methods. Specifically, they share a sufficient condition - which we also called it as the shared principle - that is vital to the success of NAL methods, which holds when the learning progress of the well-learned features greatly surpasses the learning of the yet-to-be-learned features to a certain degree

Θ(𝔼j,r(γj,r,1))Θ(𝔼j,r(γj,r,2))Learning Progress Disparity: well-learned Feature vs.yet-to-be-learned Feature>maxj,r,l|<𝐰j,r(t),𝐳l>|.{\underbrace{\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,1}))-\Theta(% \underset{j,r}{\mathbb{E}}(\gamma_{j,r,2}))}_{\text{Learning Progress % Disparity: }\text{well-learned Feature }vs.\text{yet-to-be-learned Feature}}}>% \max_{j,r,l}|<\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{l}>|.under⏟ start_ARG roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) ) - roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT Learning Progress Disparity: well-learned Feature italic_v italic_s . yet-to-be-learned Feature end_POSTSUBSCRIPT > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT | < bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT > | .

However, as discussed in Appendix D.3.2 above, when this shared sufficient condition (or principle) does not hold, the behaviors of the two heuristic criteria-based sampling methods may differ.

Cases favoring uncertainty-based Sampling. Specifically, when the label budget is not highly limited and there is sufficient opportunity to capture all feature types, uncertainty-based sampling may be preferred. Our analysis shows that compared to uncertainty sampling, diversity sampling has a stricter requirement, with a less than 1 scalar (τ1τ2)subscript𝜏1subscript𝜏2(\tau_{1}-\tau_{2})( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) in the left side of inequalities (37) and (70), versus (31) and (64) for uncertainty sampling. This allows uncertainty sampling to more precisely prioritize samples with yet-to-be-learned features, more easily ensuring adequate learning across all feature types.

Cases favoring diversity-based Sampling. However, when label complexity is quite limited (as per Appendix D.3.2) where all task-oriented features require further labelling budget, we may favor diversity-based sampling. Despite all sampling algorithms increasing test accuracy by addressing insufficient learning of certain features, diversity sampling’s efficiency in obtaining diverse features could enhance the model’s ability to grasp diverse low-dimensional patterns. This in turn could enrich generalization, even when the test distribution differs from training.

Our statements here align with discussions in the recent survey [Zhan et al., 2021]. We believe this nuanced perspective deserves further exploration.

Cases favoring Strategy-free Random Sampling. As discussed in Appendix D.3.2, our theory suggests that when τ1𝝁12τ2𝝁22subscript𝜏1superscriptnormsubscript𝝁12subscript𝜏2superscriptnormsubscript𝝁22\tau_{1}\|\bm{\mu}_{1}\|^{2}\approx\tau_{2}\|\bm{\mu}_{2}\|^{2}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≈ italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT where τlsubscript𝜏𝑙\tau_{l}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT denotes the proportion of 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in training set, it indicates a balanced “easiness” to learn multiple task-oriented features. In such cases, the learning progress of these features tends to be similar, and the prioritization by NAL methods may not be clearly evident. In other words, if there is no distinct gap between well-learned and yet-to-be-learned features, uniform sampling might be sufficient, and the advantage of NAL methods only emerges when there is a clear distinction of “learning easiness” among various task-oriented feature categories.

Additionally, when it comes to the scenarios of active fine-tuning, where the task objective is heavily or slightly changing. In such situations, the task-oriented low-dimensional patterns may shift, and the model’s optimal representation could differ from before. As a result, NAL methods that leverage prior neural representations for sampling may not be as effective, and uniform sampling could be a satisfactory choice.

D.5 Discussions of Multi-round NALs

Our theory suggests that the core principle underlying both NAL methods is their tendency to prioritize the selection of samples containing yet-to-be-learned features. This fundamental characteristic is not inherently tied to the single-round setting, but rather reflects an intrinsic property of the two primary criteria-based NAL family.

In the multi-round iterative process, the learning progress of different features may diverge across rounds and potentially align with the various cases discussed in Appendix D.3.2. However, we expect the NAL methods to continue performing well due to their innate focus on prioritizing the selection of samples containing yet-to-be-learned features.

D.6 Discussions of Practical Lessons of our Results

Here are some key takeaways of our theory:

  • Potential of NAL to surpass fully-trained NN. As discussed in Appendix D.1, and corroborated by the results in Lu et al. [2023], fully-trained neural networks tend to learn hard-to-learn features in an inefficient manner, as they place disproportionate emphasis on the easy-to-learn ones. In contrast, our analysis suggests that the NAL approach prioritizes samples with low γj,r,lsubscript𝛾𝑗𝑟𝑙\gamma_{j,r,l}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT, making it more likely to achieve a balanced rise in γj,r,1subscript𝛾𝑗𝑟1\gamma_{j,r,1}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT and γj,r,2subscript𝛾𝑗𝑟2\gamma_{j,r,2}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT during the new round of training. This implies that NAL has a better chance of ensuring sufficient learning of all features within a certain number of iterations, compared to fully-trained neural networks. This conclusion is partially validated by the empirical results presented in our Figures 2, 5, and 7, where the NALs outperform the neural networks. In real-world settings, we conjecture that NAL might have this potential when the neural network is sufficiently overparameterized and has the capacity to capture all relevant patterns of the problem instances within limited iterations.

  • Care orthogonal components of features or gradients. Our theory suggests that if techniques can be adopted to capture the meaningful orthogonal components of a neural network’s features or gradients (e.g., using ICA [Yamagiwa et al., 2023]), then the samples with low-magnitude latent feature components or high-magnitude gradient components might align with the perplexing samples in our work. This is because our theory indicates that yet-to-be-learned features are often underrepresented in the neural network’s latent space, and if the loss is non-increasing, the length in the latent space might be inversely proportional to the length in the corresponding gradient space. Notably, existing state-of-the-art methods such as BADGE [Ash et al., 2020] also leverage a similar idea with respect to the gradient component of the last layer.

  • Incorporate Signal-to-Noise Ratio (SNR) Measurement. Our discussions in Appendix D.3 denote that the perplexing samples are often characterized by their rarity and low SNR (the scale ratio between feature and noise). Techniques, whether learnable or unlearnable, that can accurately or approximately measure the SNR of multiple task-oriented features in a NN’s latent space may help develop a principled NAL approach, and for specific tasks and datasets, it may be feasible to develop such task-oriented SNR measurement methods.

Appendix E Additional Experiments

E.1 Sampling Information of Main Results

Refer to caption
Figure 3: Rescaled γ𝛾\gammaitalic_γ (γ=𝔼γj,k,l(t)𝛾𝔼superscriptsubscript𝛾𝑗𝑘𝑙𝑡\gamma={\mathbb{E}}\gamma_{j,k,l}^{(t)}italic_γ = blackboard_E italic_γ start_POSTSUBSCRIPT italic_j , italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT), Uncertainty (i.e., --Confidence Score) and Feature Distance (with various p𝑝pitalic_p of lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm) of the samples in sampling pool 𝒫𝒫\mathcal{P}caligraphic_P, where γ𝛾\gammaitalic_γ represents the learning progress of feature in particular sample. The dashed line in the graph represents the top 30 samples with the highest Feature Distance.
Refer to caption
(a) Random Sampling
Refer to caption
(b) Uncertainty Sampling
Refer to caption
(c) Diversity Sampling
Figure 4: Comparison of querying information between two NAL algorithms, illustrating training size changes in labeled data sets, Confidence Score, and Feature Distance before and after querying.

Here we give more visualized details of the querying stage. The parameter settings are the same in Section 5. Figure 3 visualized the rescaled 𝔼j,k,lγj,k,l𝑗𝑘𝑙𝔼subscript𝛾𝑗𝑘𝑙\underset{j,k,l}{\mathbb{E}}\gamma_{j,k,l}start_UNDERACCENT italic_j , italic_k , italic_l end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_k , italic_l end_POSTSUBSCRIPT, uncertainty(-Confidence Score) and Feature Distance of each samples in the unlabeled sampling pool 𝒫𝒫\mathcal{P}caligraphic_P, where the dash line corresponds to the top nsuperscript𝑛n^{*}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT samples based on Diversity Order. It’s obvious that regardless of the value p𝑝pitalic_p, the Uncertainty Order and Diversity Order of samples remain the same, and corresponds to the order of 𝔼j,k,lγj,k,l𝑗𝑘𝑙𝔼subscript𝛾𝑗𝑘𝑙\underset{j,k,l}{\mathbb{E}}\gamma_{j,k,l}start_UNDERACCENT italic_j , italic_k , italic_l end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_k , italic_l end_POSTSUBSCRIPT. This validates our unification claims in Proposition 3, and Lemma 4.4. Figure 4 makes it clear that the two NAL algorithms successfully obtain those hard-to-learn samples, while Random Sampling hardly obtain hard-to-learn samples as it selects samples in a random manner.

E.2 Experiments: Data Model under Other Conditions

Refer to caption
(a) Full trained model
Refer to caption
(b) Random Sampling
Refer to caption
(c) Uncertainty Sampling
Refer to caption
(d) Diversity Sampling
Figure 5: Learning/memorization progress of features and noise (γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents maxj,kγj,k,l(t)subscript𝑗𝑘superscriptsubscript𝛾𝑗𝑘𝑙𝑡\max_{j,k}\gamma_{j,k,l}^{(t)}roman_max start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_k , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, and ρ𝜌\rhoitalic_ρ represents maxj,k,ij,k,i(t)subscript𝑗𝑘𝑖𝑗𝑘superscript𝑖𝑡\max_{j,k,i}{j,k,i}^{(t)}roman_max start_POSTSUBSCRIPT italic_j , italic_k , italic_i end_POSTSUBSCRIPT italic_j , italic_k , italic_i start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT), train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with T=200superscript𝑇200T^{*}=200italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=8normsubscript𝝁18\|\bm{\mu}_{1}\|=8∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 8, p=p=0.1𝑝superscript𝑝0.1p=p^{*}=0.1italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.1, 𝝁2=8normsubscript𝝁28\|\bm{\mu}_{2}\|=8∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 8, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190.
Refer to caption
(a) Random Sampling
Refer to caption
(b) Uncertainty Sampling
Refer to caption
(c) Diversity Sampling
Figure 6: Comparison of querying information between two NAL algorithms, illustrating training size changes in labeled data sets, Confidence Score, and Feature Distance before and after querying. (T=200superscript𝑇200T^{*}=200italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=9normsubscript𝝁19\|\bm{\mu}_{1}\|=9∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 9, p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2, 𝝁2=3normsubscript𝝁23\|\bm{\mu}_{2}\|=3∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 3, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190)

We investigate the scenario where the strengths (i.e., feature norms) of different features do not vary significantly, as discussed in the main body of our work. Specifically, we set them as the same: 𝝁12=𝝁22=8subscriptnormsubscript𝝁12subscriptnormsubscript𝝁228\|\bm{\mu}_{1}\|_{2}=\|\bm{\mu}_{2}\|_{2}=8∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 8. Other parameters are listed as the following: T=200superscript𝑇200T^{*}=200italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, p=p=0.1𝑝superscript𝑝0.1p=p^{*}=0.1italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.1, d=2000𝑑2000d=2000italic_d = 2000, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30, |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190, σp=1subscript𝜎𝑝1\sigma_{p}=1italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT = 1 and σ0=0.01subscript𝜎00.01\sigma_{0}=0.01italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0.01. In this case, where τ1𝝁1<τ2𝝁2subscript𝜏1normsubscript𝝁1subscript𝜏2normsubscript𝝁2\tau_{1}\|\bm{\mu}_{1}\|<\tau_{2}\|\bm{\mu}_{2}\|italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ < italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥, the perplexing samples are those samples equipped with 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. It is worth noting that our chosen value of p=0.1𝑝0.1p=0.1italic_p = 0.1 is not small enough to satisfy the condition in Definition D.2. Instead, our parameter setting falls under the second bullet point of the third case discussed in Appendix D.3.2. Figure 5 demonstrates the success of both NAL algorithms, while Figure 6 illustrates the sample information. It is clear that both NAL algorithms prioritize the perplexing samples more effectively than Random Sampling, resulting in a lower test error rate.

E.3 Experiments: XOR Data Versions

Refer to caption
(a) Full-trained 2 layer CNN
Refer to caption
(b) Random Sampling
Refer to caption
(c) Uncertainty Sampling
Refer to caption
(d) Diversity Sampling
Figure 7: Learning/memorization progress of features and noise (γlsubscript𝛾𝑙\gamma_{l}italic_γ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT represents maxj,k{γj,k,𝐮l(t),γj,k,𝐯l(t)}subscript𝑗𝑘superscriptsubscript𝛾𝑗𝑘subscript𝐮𝑙𝑡superscriptsubscript𝛾𝑗𝑘subscript𝐯𝑙𝑡\max_{j,k}\{\gamma_{j,k,\mathbf{u}_{l}}^{(t)},\gamma_{j,k,\mathbf{v}_{l}}^{(t)}\}roman_max start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT { italic_γ start_POSTSUBSCRIPT italic_j , italic_k , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT italic_j , italic_k , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT }, and ρ𝜌\rhoitalic_ρ represents maxj,k,iρj,k,i(t)subscript𝑗𝑘𝑖superscriptsubscript𝜌𝑗𝑘𝑖𝑡\max_{j,k,i}\rho_{j,k,i}^{(t)}roman_max start_POSTSUBSCRIPT italic_j , italic_k , italic_i end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_k , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT), train/test losses, and test accuracy of the full-trained model and the three querying algorithms, with cosθ=0.4,T=200formulae-sequence𝜃0.4superscript𝑇200\cos{\theta}=0.4,T^{*}=200roman_cos italic_θ = 0.4 , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=20normsubscript𝝁120\|\bm{\mu}_{1}\|=20∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 20, p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2, 𝝁2=6normsubscript𝝁26\|\bm{\mu}_{2}\|=6∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 6, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190.
Refer to caption
(a) Random Sampling
Refer to caption
(b) Uncertainty Sampling
Refer to caption
(c) Diversity Sampling
Figure 8: Comparison of querying information between two NAL algorithms over XOR data, illustrating training size changes in labeled data sets, Confidence Score, and Feature Distance before and after querying. (cosθ=0.4,T=200formulae-sequence𝜃0.4superscript𝑇200\cos{\theta}=0.4,T^{*}=200roman_cos italic_θ = 0.4 , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=20normsubscript𝝁120\|\bm{\mu}_{1}\|=20∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 20, p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2, 𝝁2=6normsubscript𝝁26\|\bm{\mu}_{2}\|=6∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 6, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190)

We also conduct experiments on XOR data. We set the parameters as: cosθ=0.4,T=200formulae-sequence𝜃0.4superscript𝑇200\cos{\theta}=0.4,T^{*}=200roman_cos italic_θ = 0.4 , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 200, d=2000𝑑2000d=2000italic_d = 2000, 𝝁1=20normsubscript𝝁120\|\bm{\mu}_{1}\|=20∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ = 20, p=p=0.2𝑝superscript𝑝0.2p=p^{*}=0.2italic_p = italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 0.2, 𝝁2=6normsubscript𝝁26\|\bm{\mu}_{2}\|=6∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ = 6, nCNN=200subscript𝑛𝐶𝑁𝑁200n_{CNN}=200italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT = 200, n0=10subscript𝑛010n_{0}=10italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 10, n=30superscript𝑛30n^{*}=30italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = 30 and |𝒫|=190𝒫190\lvert\mathcal{P}\rvert=190| caligraphic_P | = 190. Figure 7 and Figure 8 clearly demonstrate that the two NAL algorithms succeed via prioritizing perplexing samples-samples with 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT features.

Appendix F Details of Querying Algorithms

F.1 2-layer ReLU CNN

We adopted the 2-layer ReLU CNN, which is representative for non-linear neural models. Also, this neural setting makes both the model’s uncertainty towards samples and the latent feature representation available, paving the way to design NAL algorithms based on this neural settings. The first layer of the model is composed of 2m2𝑚2m2 italic_m neurons/filters, with m𝑚mitalic_m positive and m𝑚mitalic_m negative, each of which is applied separately to the two patches 𝐱1subscript𝐱1\mathbf{x}_{1}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐱2subscript𝐱2\mathbf{x}_{2}bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, with a ReLU function σ(z)=max{0,z}𝜎𝑧0𝑧\sigma(z)=\max\{0,z\}italic_σ ( italic_z ) = roman_max { 0 , italic_z }. Specifically, the parameters of the second pooling layer are set to +1m1𝑚+\frac{1}{m}+ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG and 1m1𝑚-\frac{1}{m}- divide start_ARG 1 end_ARG start_ARG italic_m end_ARG respectively. The network can thus be expressed as f(𝐖,𝐱)=F+1(𝐖+1,𝐱)F1(𝐖1,𝐱)𝑓𝐖𝐱subscript𝐹1subscript𝐖1𝐱subscript𝐹1subscript𝐖1𝐱f(\mathbf{W},\mathbf{x})=F_{+1}\left(\mathbf{W}_{+1},\mathbf{x}\right)-F_{-1}% \left(\mathbf{W}_{-1},\mathbf{x}\right)italic_f ( bold_W , bold_x ) = italic_F start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , bold_x ) - italic_F start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , bold_x ), where the partial network functions for positive and negative neurons/filters. For j{+1,1}𝑗11j\in\{+1,-1\}italic_j ∈ { + 1 , - 1 }, Fj(𝐖j,𝐱)subscript𝐹𝑗subscript𝐖𝑗𝐱F_{j}\left(\mathbf{W}_{j},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x ) is defined as follows:

Fj(𝐖j,𝐱)subscript𝐹𝑗subscript𝐖𝑗𝐱\displaystyle F_{j}\left(\mathbf{W}_{j},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , bold_x ) =1mr=1m[σ(𝐰j,r,𝐱1)+σ(𝐰j,r,𝐱2)]absent1𝑚superscriptsubscript𝑟1𝑚delimited-[]𝜎subscript𝐰𝑗𝑟subscript𝐱1𝜎subscript𝐰𝑗𝑟subscript𝐱2\displaystyle=\frac{1}{m}\sum_{r=1}^{m}[\sigma\left(\left\langle\mathbf{w}_{j,% r},\mathbf{x}_{1}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_{j,r}% ,\mathbf{x}_{2}\right\rangle\right)]= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) ] (8)
=1mr=1m[σ(𝐰j,r,y𝝁)+σ(𝐰j,r,𝝃)].absent1𝑚superscriptsubscript𝑟1𝑚delimited-[]𝜎subscript𝐰𝑗𝑟𝑦𝝁𝜎subscript𝐰𝑗𝑟𝝃\displaystyle=\frac{1}{m}\sum_{r=1}^{m}\left[\sigma\left(\left\langle\mathbf{w% }_{j,r},y\cdot\bm{\mu}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_% {j,r},\bm{\xi}\right\rangle\right)\right].= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y ⋅ bold_italic_μ ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ ⟩ ) ] .

We denotes 𝐰j,rdsubscript𝐰𝑗𝑟superscript𝑑\mathbf{w}_{j,r}\in\mathbb{R}^{d}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT as the weight vector for the r𝑟ritalic_r-th neuron/filter in 𝐖jsubscript𝐖𝑗\mathbf{W}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, where 𝐖jsubscript𝐖𝑗\mathbf{W}_{j}bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is the aggregate of model weights associated with Fjsubscript𝐹𝑗F_{j}italic_F start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT filters. We use 𝐖𝐖\mathbf{W}bold_W to denote the aggregate of all model weights. Without loss of generality, we let the derivative of the ReLU function at 0 is equal to 1, denoted as σ(0)=1superscript𝜎01\sigma^{\prime}(0)=1italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( 0 ) = 1.

F.2 Score and Order of Samples

We claim that the following definitions and lemmas hold for both linearly s

Definition F.1.

(Confidence Score) The Confidence Score C(𝐖(t),𝐱)𝐶superscript𝐖𝑡𝐱C\left(\mathbf{W}^{(t)},\mathbf{x}\right)italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) is defined as below:

C(𝐖(t),𝐱)=max{11+exp{yf(𝐖(t),𝐱)},111+exp{yf(𝐖(t),𝐱)}}𝐶superscript𝐖𝑡𝐱11𝑦𝑓superscript𝐖𝑡𝐱111𝑦𝑓superscript𝐖𝑡𝐱\begin{split}C\left(\mathbf{W}^{(t)},\mathbf{x}\right)&=\max\Big{\{}\frac{1}{1% +\exp\big{\{}-y\cdot f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\big{\}}},\\ &\phantom{=\max\Big{\{}}1-\frac{1}{1+\exp\big{\{}-y\cdot f\left(\mathbf{W}^{(t% )},\mathbf{x}\right)\big{\}}}\Big{\}}\end{split}start_ROW start_CELL italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) end_CELL start_CELL = roman_max { divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { - italic_y ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) } end_ARG , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 - divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { - italic_y ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) } end_ARG } end_CELL end_ROW (9)

The Confidence Score C(𝐖(t),𝐱)𝐶superscript𝐖𝑡𝐱C\left(\mathbf{W}^{(t)},\mathbf{x}\right)italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) represents the probability of the predicted label y𝑦yitalic_y of logistic loss.

Definition F.2.

(Uncertainty Order) We denote the sampling pool as 𝒫𝒫\mathcal{P}caligraphic_P that 𝒫𝒟𝒫𝒟\mathcal{P}\subsetneq\mathcal{D}caligraphic_P ⊊ caligraphic_D. For t>0𝑡0t>0italic_t > 0, 𝐱for-all𝐱\forall\mathbf{x}∀ bold_x and 𝐱𝒫superscript𝐱𝒫\mathbf{x}^{\prime}\in\mathcal{P}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P, we define the Uncertainty Order C(t)superscriptsubscriptprecedes𝐶𝑡\prec_{C}^{(t)}≺ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and C(t)superscriptsubscriptprecedes-or-equals𝐶𝑡\preceq_{C}^{(t)}⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, which denote the order of the model’s uncertainty upon its prediction upon 𝐱𝐱\mathbf{x}bold_x and 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT at the time step t𝑡titalic_t:

𝐱C(t)𝐱 if C(𝐖(t),𝐱)>C(𝐖(t),𝐱),superscriptsubscriptprecedes𝐶𝑡𝐱superscript𝐱 if 𝐶superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡superscript𝐱\displaystyle\mathbf{x}\prec_{C}^{(t)}\mathbf{x}^{\prime}\text{ if \ }C\left(% \mathbf{W}^{(t)},\mathbf{x}\right)>C\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}% \right),bold_x ≺ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) > italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) , (10)
𝐱C(t)𝐱 if C(𝐖(t),𝐱)C(𝐖(t),𝐱).superscriptsubscriptprecedes-or-equals𝐶𝑡𝐱superscript𝐱 if 𝐶superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡superscript𝐱\displaystyle\mathbf{x}\preceq_{C}^{(t)}\mathbf{x}^{\prime}\text{ if \ }C% \left(\mathbf{W}^{(t)},\mathbf{x}\right)\geq C\left(\mathbf{W}^{(t)},\mathbf{x% }^{\prime}\right).bold_x ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≥ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) .

We say the model uncertainty at time step t𝑡titalic_t upon 𝐱𝐱\mathbf{x}bold_x is less than 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if 𝐱C(t)𝐱superscriptsubscriptprecedes𝐶𝑡𝐱superscript𝐱\mathbf{x}\prec_{C}^{(t)}\mathbf{x}^{\prime}bold_x ≺ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Specifically, if the model’s uncertainty towards its predictions upon all elements in a set 𝐗𝐗\mathbf{X}bold_X at time step t𝑡titalic_t are all less than those in the set 𝐗superscript𝐗\mathbf{X}^{\prime}bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we utilize the same notation to describe the Uncertainty Order at time step t𝑡titalic_t between sets: 𝐗C(t)𝐗superscriptsubscriptprecedes𝐶𝑡𝐗superscript𝐗\mathbf{X}\prec_{C}^{(t)}\mathbf{X}^{\prime}bold_X ≺ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Lemma F.3.

The Uncertainty Order is a full order. In addition, for 𝐱for-all𝐱\forall\mathbf{x}∀ bold_x and 𝐱𝒫superscript𝐱𝒫\mathbf{x}^{\prime}\in\mathcal{P}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P, at t>0𝑡0t>0italic_t > 0 we have:

𝐱C(t)𝐱|f(𝐖(t),𝐱)||f(𝐖(t),𝐱)|superscriptsubscriptprecedes-or-equals𝐶𝑡𝐱superscript𝐱𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱\mathbf{x}\preceq_{C}^{(t)}\mathbf{x}^{\prime}\Leftrightarrow\left|f\left(% \mathbf{W}^{(t)},\mathbf{x}\right)\right|\geq\left|f\left(\mathbf{W}^{(t)},% \mathbf{x}^{\prime}\right)\right|bold_x ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⇔ | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | ≥ | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | (11)
Proof.
𝐱C(t)𝐱C(𝐖(t),𝐱)C(𝐖(t),𝐱)superscriptsubscriptprecedes-or-equals𝐶𝑡𝐱superscript𝐱𝐶superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡superscript𝐱\displaystyle\mathbf{x}\preceq_{C}^{(t)}\mathbf{x}^{\prime}\Leftrightarrow C% \left(\mathbf{W}^{(t)},\mathbf{x}\right)\geq C\left(\mathbf{W}^{(t)},\mathbf{x% }^{\prime}\right)bold_x ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⇔ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≥ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
11+exp{|f(𝐖,𝐱)|}11+exp{|f(𝐖(t),𝐱)|}absent11𝑓𝐖𝐱11𝑓superscript𝐖𝑡superscript𝐱\displaystyle\Leftrightarrow\dfrac{1}{1+\exp\{|f(\mathbf{W},\mathbf{x})|\}}% \geq\dfrac{1}{1+\exp\left\{-\left|f\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}% \right)\right|\right\}}⇔ divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { | italic_f ( bold_W , bold_x ) | } end_ARG ≥ divide start_ARG 1 end_ARG start_ARG 1 + roman_exp { - | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | } end_ARG
|f(𝐖(t),𝐱)||f(𝐖(t),𝐱)|absent𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱\displaystyle\Leftrightarrow\left|f\left(\mathbf{W}^{(t)},\mathbf{x}\right)% \right|\geq\left|f\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\right|\qed⇔ | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | ≥ | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | italic_∎

As one can always get f(𝐖(t),𝐱)𝑓superscript𝐖𝑡𝐱f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\in\mathbb{R}italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ∈ blackboard_R by a given 𝐱𝐱\mathbf{x}bold_x at time step t𝑡titalic_t, the Uncertainty Order is a full order.

In Lemma F.5, we will show that sampling based on the Uncertainty Order is equivalent to various typical sampling methods based on the score functions defined in many typical Model Uncertainty-based Approaches, such as Least Confidence [Lewis and Catlett, 1994], Margin Roth and Small [2006] and Entropy [Joshi et al., 2009] methods under our data model scenario, thus it’s representative to the main idea of the approaches family while elegant.

Definition F.4.

The following are the definitions of the score functions of LeastConf [Lewis and Catlett, 1994], Margin Roth and Small [2006] and Entropy [Joshi et al., 2009].

  • Least Confidence selects data points whose predicted label y𝑦yitalic_y have the lowest posterior probability, so the score function of LeastConf is:

    Score(𝐖(t),𝐱)=P(y|𝐱,𝐖(t)),𝑆𝑐𝑜𝑟𝑒superscript𝐖𝑡𝐱𝑃conditional𝑦𝐱superscript𝐖𝑡Score(\mathbf{W}^{(t)},\mathbf{x})=-P(y|\mathbf{x},\mathbf{W}^{(t)}),italic_S italic_c italic_o italic_r italic_e ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = - italic_P ( italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , (12)
  • The score function of Margin is:

    Score(𝐖(t),𝐱)=[p(y|𝐱,𝐖(t))P(y|𝐱,𝐖(t))],𝑆𝑐𝑜𝑟𝑒superscript𝐖𝑡𝐱delimited-[]𝑝conditional𝑦𝐱superscript𝐖𝑡𝑃conditional𝑦𝐱superscript𝐖𝑡Score(\mathbf{W}^{(t)},\mathbf{x})=-[p(y|\mathbf{x},\mathbf{W}^{(t)})-P(-y|% \mathbf{x},\mathbf{W}^{(t)})],italic_S italic_c italic_o italic_r italic_e ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = - [ italic_p ( italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) - italic_P ( - italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ] , (13)
  • The score function of Entropy is:

    Score(𝐖(t),𝐱)=[P(y|𝐱,𝐖(t))logP(y|𝐱,𝐖(t))+P(y|𝐱,𝐖(t))logP(y|𝐱,𝐖(t))],𝑆𝑐𝑜𝑟𝑒superscript𝐖𝑡𝐱delimited-[]𝑃conditional𝑦𝐱superscript𝐖𝑡𝑃conditional𝑦𝐱superscript𝐖𝑡𝑃conditional𝑦𝐱superscript𝐖𝑡𝑃conditional𝑦𝐱superscript𝐖𝑡Score(\mathbf{W}^{(t)},\mathbf{x})=-[P(y|\mathbf{x},\mathbf{W}^{(t)})\log P(y|% \mathbf{x},\mathbf{W}^{(t)})+P(-y|\mathbf{x},\mathbf{W}^{(t)})\log P(-y|% \mathbf{x},\mathbf{W}^{(t)})],italic_S italic_c italic_o italic_r italic_e ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = - [ italic_P ( italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) roman_log italic_P ( italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) + italic_P ( - italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) roman_log italic_P ( - italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ] , (14)
Lemma F.5.

Sampling based on the score functions defined in (12), (13) and (14) are equivalent to sampling based on the Confidence Order in Definition 10.

Proof.

By definitions, C(𝐖(t),𝐱)=P(y|𝐱,𝐖(t))=Score(𝐖(t),𝐱)𝐶superscript𝐖𝑡𝐱𝑃conditional𝑦𝐱superscript𝐖𝑡𝑆𝑐𝑜𝑟𝑒superscript𝐖𝑡𝐱C\left(\mathbf{W}^{(t)},\mathbf{x}\right)=P(y|\mathbf{x},\mathbf{W}^{(t)})=-% Score(\mathbf{W}^{(t)},\mathbf{x})italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = italic_P ( italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = - italic_S italic_c italic_o italic_r italic_e ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ), showing the equivalence of LeastConf methods and ours. Then by Lemma 11 and the property: P(y|𝐱,𝐖(t))=1C(𝐖(t),𝐱)𝑃conditional𝑦𝐱superscript𝐖𝑡1𝐶superscript𝐖𝑡𝐱P(-y|\mathbf{x},\mathbf{W}^{(t)})=1-C\left(\mathbf{W}^{(t)},\mathbf{x}\right)italic_P ( - italic_y | bold_x , bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) = 1 - italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ), it’s easy to verify that |f(𝐖(t),𝐱)|C(𝐖(t),𝐱)[C(𝐖(t),𝐱)(1C(𝐖(t),𝐱))]proportional-to𝑓superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡𝐱proportional-todelimited-[]𝐶superscript𝐖𝑡𝐱1𝐶superscript𝐖𝑡𝐱\left|f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\right|\propto C\left(\mathbf{W% }^{(t)},\mathbf{x}\right)\propto[C\left(\mathbf{W}^{(t)},\mathbf{x}\right)-(1-% C\left(\mathbf{W}^{(t)},\mathbf{x}\right))]| italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | ∝ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ∝ [ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) - ( 1 - italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ) ], and |f(𝐖(t),𝐱)|C(𝐖(t),𝐱)[C(𝐖(t),𝐱)logC(𝐖(t),𝐱)+(1C(𝐖(t),𝐱))log(1C(𝐖(t),𝐱))]proportional-to𝑓superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡𝐱proportional-todelimited-[]𝐶superscript𝐖𝑡𝐱𝐶superscript𝐖𝑡𝐱1𝐶superscript𝐖𝑡𝐱1𝐶superscript𝐖𝑡𝐱\left|f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\right|\propto C\left(\mathbf{W% }^{(t)},\mathbf{x}\right)\propto[C\left(\mathbf{W}^{(t)},\mathbf{x}\right)\log C% \left(\mathbf{W}^{(t)},\mathbf{x}\right)+(1-C\left(\mathbf{W}^{(t)},\mathbf{x}% \right))\log(1-C\left(\mathbf{W}^{(t)},\mathbf{x}\right))]| italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | ∝ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ∝ [ italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) roman_log italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) + ( 1 - italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ) roman_log ( 1 - italic_C ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ) ]. Therefore, the priority order of the samples based on those score functions are the same as the Uncertainty Order, thus the proof is completed. ∎

Definition F.6.

(Feature Distance) The latent feature representation of a sample 𝐱𝐱\mathbf{x}bold_x=[𝐱1T,𝐱2T]Tsuperscriptsuperscriptsubscript𝐱1𝑇superscriptsubscript𝐱2𝑇𝑇[\mathbf{x}_{1}^{T},\mathbf{x}_{2}^{T}]^{T}[ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in the latent feature space 𝒵m𝒵superscript𝑚\mathcal{Z}\subseteq\mathbb{R}^{m}caligraphic_Z ⊆ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT of our ReLU CNN at timestep t𝑡titalic_t is:

𝐙(𝐱,t)=j(σ(𝐖j(t),𝐱1))+σ(𝐖j(t),𝐱2))\mathbf{Z}(\mathbf{x},t)=\sum_{j}(\sigma(\langle\mathbf{W}_{j}^{(t)},\mathbf{x% }_{1}\rangle))+\sigma(\langle\mathbf{W}_{j}^{(t)},\mathbf{x}_{2}\rangle))bold_Z ( bold_x , italic_t ) = ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_σ ( ⟨ bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) ) + italic_σ ( ⟨ bold_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) )

Apparently 𝐙(𝐱,t)m𝐙𝐱𝑡superscript𝑚\mathbf{Z}(\mathbf{x},t)\in\mathbb{R}^{m}bold_Z ( bold_x , italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT. The Feature Distance is measured by the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (p[1,)𝑝1p\in\left[1,\infty\right)italic_p ∈ [ 1 , ∞ )) distance between sample’s feature representation and the average feature representation of the current labeled set 𝒟n:={𝐱(i)}i=1n\mathcal{D}_{n}\mathrel{\mathop{:}}=\{\mathbf{x}^{(i)}\}_{i=1}^{n}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT : = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT:

D(𝐖(t),𝐱𝒟n)=𝐙(𝐱,t)𝔼𝐱(i)𝒟n𝐙(𝐱(i),t)p𝐷superscript𝐖𝑡conditional𝐱subscript𝒟𝑛subscriptnorm𝐙𝐱𝑡superscript𝐱𝑖subscript𝒟𝑛𝔼𝐙superscript𝐱𝑖𝑡𝑝D\left(\mathbf{W}^{(t)},\mathbf{x}\ \mid\mathcal{D}_{n}\right)=\|\mathbf{Z}(% \mathbf{x},t)-\displaystyle\underset{\mathbf{x}^{(i)}\in\mathcal{D}_{n}}{% \mathbb{E}}\mathbf{Z}(\mathbf{x}^{(i)},t)\|_{p}italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∥ bold_Z ( bold_x , italic_t ) - start_UNDERACCENT bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ∈ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG bold_Z ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (15)
Definition F.7.

(Diversity Order) Similar to Definition 10, we defined Diversity Order D(t)superscriptsubscriptprecedes𝐷𝑡\prec_{D}^{(t)}≺ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, D(t)superscriptsubscriptprecedes-or-equals𝐷𝑡\preceq_{D}^{(t)}⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT based on Feature Distance D(𝐖(t),𝐱𝒟n)𝐷superscript𝐖𝑡conditional𝐱subscript𝒟𝑛D\left(\mathbf{W}^{(t)},\mathbf{x}\ \mid\mathcal{D}_{n}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ). Borrowing the same notations in Definition 10, we have:

𝐱D(t)𝐱 if D(𝐖(t),𝐱𝒟n)<D(𝐖(t),𝐱𝒟n),superscriptsubscriptprecedes𝐷𝑡𝐱superscript𝐱 if 𝐷superscript𝐖𝑡conditional𝐱subscript𝒟𝑛𝐷superscript𝐖𝑡conditionalsuperscript𝐱subscript𝒟𝑛\displaystyle\mathbf{x}\prec_{D}^{(t)}\mathbf{x}^{\prime}\text{ if \ }D\left(% \mathbf{W}^{(t)},\mathbf{x}\ \mid\mathcal{D}_{n}\right)<D\left(\mathbf{W}^{(t)% },\mathbf{x}^{\prime}\ \mid\mathcal{D}_{n}\right),bold_x ≺ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) < italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (16)
𝐱D(t)𝐱 if D(𝐖(t),𝐱𝒟n)D(𝐖(t),𝐱𝒟n).superscriptsubscriptprecedes-or-equals𝐷𝑡𝐱superscript𝐱 if 𝐷superscript𝐖𝑡conditional𝐱subscript𝒟𝑛𝐷superscript𝐖𝑡conditionalsuperscript𝐱subscript𝒟𝑛\displaystyle\mathbf{x}\preceq_{D}^{(t)}\mathbf{x}^{\prime}\text{ if \ }D% \left(\mathbf{W}^{(t)},\mathbf{x}\ \mid\mathcal{D}_{n}\right)\leq D\left(% \mathbf{W}^{(t)},\mathbf{x}^{\prime}\ \mid\mathcal{D}_{n}\right).bold_x ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≤ italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∣ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) .

Along with Definition 10, we also have set-level notations such that 𝐗D(t)(D(t))𝐗superscriptsubscriptprecedes𝐷𝑡𝐗superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐗\mathbf{X}\prec_{D}^{(t)}(\preceq_{D}^{(t)})\mathbf{X}^{\prime}bold_X ≺ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) bold_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. Based on the triangle inequality for the lpsubscript𝑙𝑝l_{p}italic_l start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT norm and (15), we can easily draw the conclusion that the Diversity Order is also a full order. Furthermore, in the case that both 𝐱C(t)(C(t))𝐱superscriptsubscriptprecedes𝐶𝑡𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱\mathbf{x}\prec_{C}^{(t)}(\preceq_{C}^{(t)})\mathbf{x}^{\prime}bold_x ≺ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and 𝐱D(t)(D(t))𝐱,p[1,)formulae-sequencesuperscriptsubscriptprecedes𝐷𝑡𝐱superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱for-all𝑝1\mathbf{x}\prec_{D}^{(t)}(\preceq_{D}^{(t)})\mathbf{x}^{\prime},\forall p\in% \left[1,\infty\right)bold_x ≺ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ∀ italic_p ∈ [ 1 , ∞ ) hold, we denote the order relationship using (t)((t))superscriptprecedes𝑡absentsuperscriptprecedes-or-equals𝑡\prec^{(t)}(\preceq^{(t)})≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), such that 𝐱(t)((t))𝐱superscriptprecedes𝑡𝐱superscriptprecedes-or-equals𝑡superscript𝐱\mathbf{x}\prec^{(t)}(\preceq^{(t)})\ \mathbf{x}^{\prime}bold_x ≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT.

Appendix G Proofs of Main Results

In this section, we denote n𝑛nitalic_n as the number of training data in current labeled training set, which is n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT at initial stage and n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after sampling (querying). Besides, we denote the proportion of easy-to-learn data in current labeled set as τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and utilize τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT to represent the proportion of hard-to-learn data in current labeled set for notation simplicity. Notably, we can use the same techniques in Cao et al. [2022a], Kou et al. [2023b], Meng et al. [2023], Lu et al. [2023] to achieve some statistical outcomes that are not directly related to our main contribution, we exclude the proof details for those outcomes. Instead, our focus is on providing comprehensive proofs of our primary contribution.

G.1 Preliminary Lemmas

The following lemmas give finite-sample concentration results to characterize the statistical properties of the random elements involved in our problem, and hold both under the linearly separable data and XOR data (i.e., 𝝁l{𝝁l,𝐮l,𝐯l},l{1,2}subscript𝝁𝑙subscript𝝁𝑙subscript𝐮𝑙subscript𝐯𝑙for-all𝑙12\bm{\mu}_{l}\in\{\bm{\mu}_{l},\mathbf{u}_{l},\mathbf{v}_{l}\},\forall l\{1,2\}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ { bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } , ∀ italic_l { 1 , 2 }).

Lemma G.1.

Suppose that δ>0𝛿0\delta>0italic_δ > 0 and d=Ω(log(6nδ))𝑑Ω6𝑛𝛿d=\Omega(\log(\dfrac{6n}{\delta}))italic_d = roman_Ω ( roman_log ( divide start_ARG 6 italic_n end_ARG start_ARG italic_δ end_ARG ) ). Then with probability at least 1δ1𝛿1-\delta1 - italic_δ,

σp2d2𝝃i223σp2d2,superscriptsubscript𝜎𝑝2𝑑2superscriptsubscriptnormsubscript𝝃𝑖223superscriptsubscript𝜎𝑝2𝑑2\displaystyle\dfrac{\sigma_{p}^{2}d}{2}\leq\left\|\bm{\xi}_{i}\right\|_{2}^{2}% \leq 3\dfrac{\sigma_{p}^{2}d}{2},divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ≤ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 3 divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ,
|𝝃i,𝝃i|2σp2dlog(6n2δ),subscript𝝃𝑖subscript𝝃superscript𝑖2superscriptsubscript𝜎𝑝2𝑑6superscript𝑛2𝛿\displaystyle\left|\left\langle\bm{\xi}_{i},\bm{\xi}_{i^{\prime}}\right\rangle% \right|\leq 2\sigma_{p}^{2}\cdot\sqrt{d\log\left(\dfrac{6n^{2}}{\delta}\right)},| ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ | ≤ 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ square-root start_ARG italic_d roman_log ( divide start_ARG 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG ,
|𝝃i,𝝁l|𝝁l2σp2log(12nδ)subscript𝝃𝑖subscript𝝁𝑙subscriptnormsubscript𝝁𝑙2subscript𝜎𝑝212𝑛𝛿\displaystyle\left|\left\langle\bm{\xi}_{i},\bm{\mu}_{l}\right\rangle\right|% \leq\|\bm{\mu}_{l}\|_{2}\sigma_{p}\cdot\sqrt{2\log(\dfrac{12n}{\delta})}| ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | ≤ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⋅ square-root start_ARG 2 roman_log ( divide start_ARG 12 italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG

for all i,i[n],l{1,2}formulae-sequence𝑖superscript𝑖delimited-[]𝑛𝑙12i,i^{\prime}\in[n],l\in\{1,2\}italic_i , italic_i start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_n ] , italic_l ∈ { 1 , 2 }.

Proof of Lemma G.1. The proof can be found in Lemma B.2 in Cao et al. [2022a], Lemma B.4 in Kou et al. [2023b], Lemma B.3 in Meng et al. [2023] or Lemma A.3 in Lu et al. [2023].

Lemma G.2.

Suppose that δ>0,d=Ω(log(mnδ)),formulae-sequence𝛿0𝑑Ω𝑚𝑛𝛿\delta>0,d=\Omega(\log(\dfrac{mn}{\delta})),italic_δ > 0 , italic_d = roman_Ω ( roman_log ( divide start_ARG italic_m italic_n end_ARG start_ARG italic_δ end_ARG ) ) , and m=Ω(log(1δ))𝑚Ω1𝛿m=\Omega(\log(\dfrac{1}{\delta}))italic_m = roman_Ω ( roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ). Then with probability at least 1δ1𝛿1-\delta1 - italic_δ,

σ02d2𝐰j,r(0)223σ02d2,superscriptsubscript𝜎02𝑑2superscriptsubscriptnormsuperscriptsubscript𝐰𝑗𝑟0223superscriptsubscript𝜎02𝑑2\displaystyle\dfrac{\sigma_{0}^{2}d}{2}\leq\|\mathbf{w}_{j,r}^{(0)}\|_{2}^{2}% \leq 3\dfrac{\sigma_{0}^{2}d}{2},divide start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ≤ ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 3 divide start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ,
|𝐰j,r(0),𝝁l|2log(16mδ)σ0𝝁l2,\displaystyle\left|\left\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}_{l}\right% \rangle\right|\leq\sqrt{2\log(\dfrac{16m}{\delta}})\cdot\sigma_{0}\|\bm{\mu}_{% l}\|_{2},| ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | ≤ square-root start_ARG 2 roman_log ( divide start_ARG 16 italic_m end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
|𝐰j,r(0),𝝃i|2log16mnδσ0σpdsuperscriptsubscript𝐰𝑗𝑟0subscript𝝃𝑖216𝑚𝑛𝛿subscript𝜎0subscript𝜎𝑝𝑑\displaystyle\left|\left\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\right% \rangle\right|\leq 2\sqrt{\log\dfrac{16mn}{\delta}}\cdot\sigma_{0}\sigma_{p}% \sqrt{d}| ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ | ≤ 2 square-root start_ARG roman_log divide start_ARG 16 italic_m italic_n end_ARG start_ARG italic_δ end_ARG end_ARG ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG

for all r[m],j{±1},l{1,2}formulae-sequence𝑟delimited-[]𝑚formulae-sequence𝑗plus-or-minus1𝑙12r\in[m],j\in\{\pm 1\},l\in\{1,2\}italic_r ∈ [ italic_m ] , italic_j ∈ { ± 1 } , italic_l ∈ { 1 , 2 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ]. Moreover,

σ0𝝁l22maxr[m]j𝐰j,r(0),𝝁l2log(16mδ)σ0𝝁l2,\displaystyle\dfrac{\sigma_{0}\|\bm{\mu}_{l}\|_{2}}{2}\leq\max_{r\in[m]}j\cdot% \left\langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}_{l}\right\rangle\leq\sqrt{2\log(% \dfrac{16m}{\delta}})\cdot\sigma_{0}\|\bm{\mu}_{l}\|_{2},divide start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ≤ roman_max start_POSTSUBSCRIPT italic_r ∈ [ italic_m ] end_POSTSUBSCRIPT italic_j ⋅ ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ≤ square-root start_ARG 2 roman_log ( divide start_ARG 16 italic_m end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,
σ0σpd4maxr[m]j𝐰j,r(0),𝝃i2log16mnδσ0σpdsubscript𝜎0subscript𝜎𝑝𝑑4subscript𝑟delimited-[]𝑚𝑗superscriptsubscript𝐰𝑗𝑟0subscript𝝃𝑖216𝑚𝑛𝛿subscript𝜎0subscript𝜎𝑝𝑑\displaystyle\dfrac{\sigma_{0}\sigma_{p}\sqrt{d}}{4}\leq\max_{r\in[m]}j\cdot% \left\langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\right\rangle\leq 2\sqrt{\log% \dfrac{16mn}{\delta}}\cdot\sigma_{0}\sigma_{p}\sqrt{d}divide start_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG end_ARG start_ARG 4 end_ARG ≤ roman_max start_POSTSUBSCRIPT italic_r ∈ [ italic_m ] end_POSTSUBSCRIPT italic_j ⋅ ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ≤ 2 square-root start_ARG roman_log divide start_ARG 16 italic_m italic_n end_ARG start_ARG italic_δ end_ARG end_ARG ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG

for all j{±1},l{1,2}formulae-sequence𝑗plus-or-minus1𝑙12j\in\{\pm 1\},l\in\{1,2\}italic_j ∈ { ± 1 } , italic_l ∈ { 1 , 2 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ].

Proof of Lemma G.2. The proof can be found in Lemma B.3 in Cao et al. [2022a], Lemma B.5 in Kou et al. [2023b], Lemma B.4 in Meng et al. [2023] or Lemma A.4 and Lemma C.1 in Lu et al. [2023].

Next, we utilize the property of binomial tails to examine the proportion of hard-to-learn data within the subsets generated from the data distribution 𝒟𝒟\mathcal{D}caligraphic_D (i.e., the initial labeled set 𝒟n0:={𝐱(i)}i=1n0𝒟\mathcal{D}_{n_{0}}\mathrel{\mathop{:}}=\{\mathbf{x}^{(i)}\}_{i=1}^{n_{0}}% \subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT : = { bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊆ caligraphic_D, the sampling pool 𝒫𝒟𝒫𝒟\mathcal{P}\subseteq\mathcal{D}caligraphic_P ⊆ caligraphic_D, and the final labeled set 𝒟n1(random):={𝐱(random)(i)}i=1n1𝒟\mathcal{D}_{n_{1}}^{(random)}\mathrel{\mathop{:}}=\{{\mathbf{x}^{(random)}}^{% (i)}\}_{i=1}^{n_{1}}\subseteq\mathcal{D}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r italic_a italic_n italic_d italic_o italic_m ) end_POSTSUPERSCRIPT : = { bold_x start_POSTSUPERSCRIPT ( italic_r italic_a italic_n italic_d italic_o italic_m ) end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ⊆ caligraphic_D obtained through Random Sampling).

Lemma G.3.

Suppose that δ>0𝛿0\delta>0italic_δ > 0, n0,n~,|P|=Ω(1pplog(1δ))subscript𝑛0~𝑛𝑃Ω1𝑝𝑝1𝛿n_{0},\tilde{n},|P|=\Omega\left(\dfrac{1-p}{p}\log\left(\dfrac{1}{\delta}% \right)\right)italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG italic_n end_ARG , | italic_P | = roman_Ω ( divide start_ARG 1 - italic_p end_ARG start_ARG italic_p end_ARG roman_log ( divide start_ARG 1 end_ARG start_ARG italic_δ end_ARG ) ), then for n{n0,|P|,n1}𝑛subscript𝑛0𝑃subscript𝑛1n\in\left\{n_{0},|P|,n_{1}\right\}italic_n ∈ { italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , | italic_P | , italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT }. Denote npnsubscript𝑛𝑝𝑛n_{p}\leq nitalic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ italic_n as the number of hard-to-learn data among n𝑛nitalic_n, then with probability at least 1δ1𝛿1-\delta1 - italic_δ. We have

12pnnp32pn12𝑝𝑛subscript𝑛𝑝32𝑝𝑛\frac{1}{2}p\cdot n\leqslant n_{p}\leqslant\frac{3}{2}p\cdot ndivide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_p ⋅ italic_n ⩽ italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ⩽ divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_p ⋅ italic_n (17)

proof of Lemma 17. We can see npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT as a binomial random variable with probability p𝑝pitalic_p and number of experiments n𝑛nitalic_n. By Exercise 2.9.(a) in Wainwright [2019], we have

P(pn2np3pn2)12enD(p2p)𝑃𝑝𝑛2subscript𝑛𝑝3𝑝𝑛212superscript𝑒𝑛𝐷conditional𝑝2𝑝P\left(\dfrac{pn}{2}\leq n_{p}\leq\dfrac{3pn}{2}\right)\geqslant 1-2e^{-nD% \left(\frac{p}{2}\|p\right)}italic_P ( divide start_ARG italic_p italic_n end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ divide start_ARG 3 italic_p italic_n end_ARG start_ARG 2 end_ARG ) ⩾ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_n italic_D ( divide start_ARG italic_p end_ARG start_ARG 2 end_ARG ∥ italic_p ) end_POSTSUPERSCRIPT

where the quantity D(δα)𝐷conditional𝛿𝛼D(\delta\|\alpha)italic_D ( italic_δ ∥ italic_α ) for δ,α(0,12]for-all𝛿𝛼012\forall\delta,\alpha\in\left(0,\frac{1}{2}\right]∀ italic_δ , italic_α ∈ ( 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ] is defined as

D(δα):=δlog(δα)+(1δ)log(1δ1α).assign𝐷conditional𝛿𝛼𝛿𝛿𝛼1𝛿1𝛿1𝛼D(\delta\|\alpha):=\delta\log\left(\frac{\delta}{\alpha}\right)+(1-\delta)\log% \left(\frac{1-\delta}{1-\alpha}\right).italic_D ( italic_δ ∥ italic_α ) := italic_δ roman_log ( divide start_ARG italic_δ end_ARG start_ARG italic_α end_ARG ) + ( 1 - italic_δ ) roman_log ( divide start_ARG 1 - italic_δ end_ARG start_ARG 1 - italic_α end_ARG ) .

Since p2<p𝑝2𝑝\dfrac{p}{2}<pdivide start_ARG italic_p end_ARG start_ARG 2 end_ARG < italic_p. By Exercise 2.9.(b)formulae-sequence2.9𝑏2.9.(b)2.9 . ( italic_b ) in Wainwright [2019], we can obtain P(pn2np3pn2)1δ𝑃𝑝𝑛2subscript𝑛𝑝3𝑝𝑛21𝛿P\left(\dfrac{pn}{2}\leq n_{p}\leq\dfrac{3pn}{2}\right)\geq 1-\deltaitalic_P ( divide start_ARG italic_p italic_n end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ≤ divide start_ARG 3 italic_p italic_n end_ARG start_ARG 2 end_ARG ) ≥ 1 - italic_δ directly by Hoeffding Inequality.

Remark G.4.

It is important to note that the generation of 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒫𝒫\mathcal{P}caligraphic_P through sampling from 𝒟𝒟\mathcal{D}caligraphic_D is independent. However, the generation of 𝒟n1(random)superscriptsubscript𝒟subscript𝑛1𝑟𝑎𝑛𝑑𝑜𝑚\mathcal{D}_{n_{1}}^{(random)}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r italic_a italic_n italic_d italic_o italic_m ) end_POSTSUPERSCRIPT is based on 𝒟n0subscript𝒟subscript𝑛0\mathcal{D}_{n_{0}}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and 𝒫𝒫\mathcal{P}caligraphic_P. In our analysis, instead of considering martingale with the perspective of conditional probability, we consider the overall process of the labeled set obtained by Random Sampling, where 𝒟n1(random)superscriptsubscript𝒟subscript𝑛1𝑟𝑎𝑛𝑑𝑜𝑚\mathcal{D}_{n_{1}}^{(random)}caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r italic_a italic_n italic_d italic_o italic_m ) end_POSTSUPERSCRIPT is directly sampled from 𝒟𝒟\mathcal{D}caligraphic_D.

G.2 Coefficient Ratio and Scale Analysis

In this section, we provide lemmas that characterize the behavior of coefficients under gradient descent. Subsequently, we establish the scale of the coefficients in the training dynamics. It’s worth noting that in this section we assume the results in Appendix G.1 all hold with high probability.

Definition G.5.

(Equivalent techniques to Definition 4.1 in Cao et al. [2022a], Definition 5.1 in Kou et al. [2023b]) Denote 𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT for j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] as the convolution neurons/filters at the tthsuperscript𝑡𝑡t^{th}italic_t start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT timestep of gradient descent, then there exist unique coefficients γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and ρj,r,i(t)superscriptsubscript𝜌𝑗𝑟𝑖𝑡\rho_{j,r,i}^{(t)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT such that

𝐰j,r(t)=𝐰j,r(0)+jl=12γj,r,l(t)𝝁l𝝁l22+i=1nρj,r,i(t)𝝃i𝝃i22superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\sum_{l=1}^{2}\gamma_{j,r,% l}^{(t)}\cdot\dfrac{\bm{\mu}_{l}}{\|\bm{\mu}_{l}\|_{2}^{2}}+\sum_{i=1}^{n}\rho% _{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

Further denote ρ¯j,r,i(t)superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\bar{\rho}_{j,r,i}^{(t)}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as ρj,r,i(t)𝟙(ρj,r,i(t)0)superscriptsubscript𝜌𝑗𝑟𝑖𝑡1superscriptsubscript𝜌𝑗𝑟𝑖𝑡0\rho_{j,r,i}^{(t)}\mathbb{1}\left(\rho_{j,r,i}^{(t)}\geq 0\right)italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ 0 ), ρ¯j,r,i(t)superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\underline{\rho}_{j,r,i}^{(t)}under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as ρj,r,i(t)𝟙(ρj,r,i(t)0)superscriptsubscript𝜌𝑗𝑟𝑖𝑡1superscriptsubscript𝜌𝑗𝑟𝑖𝑡0\rho_{j,r,i}^{(t)}\mathbb{1}\left(\rho_{j,r,i}^{(t)}\leq 0\right)italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ 0 ). Then:

𝐰j,r(t)=𝐰j,r(0)+jl=12γj,r,l(t)𝝁l𝝁l22+i=1nρ¯j,r,i(t)𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i𝝃i22.superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\sum_{l=1}^{2}\cdot\gamma_{j,r,% l}^{(t)}\cdot\dfrac{\bm{\mu}_{l}}{\|\bm{\mu}_{l}\|_{2}^{2}}+\sum_{i=1}^{n}\bar% {\rho}_{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{% i=1}^{n}\underline{\rho}_{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}% \|_{2}^{2}}.bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (18)

We denote Ul={i[n]:𝐱(i)=[yi𝝁l,𝝃i]}subscript𝑈𝑙conditional-set𝑖delimited-[]𝑛superscript𝐱𝑖subscript𝑦𝑖subscript𝝁𝑙subscript𝝃𝑖U_{l}=\left\{i\in[n]:\mathbf{x}^{(i)}=[y_{i}\cdot\bm{\mu}_{l},\bm{\xi}_{i}]\right\}italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { italic_i ∈ [ italic_n ] : bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] }, for l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }. The following lemma presents the update rule of coefficients.

Lemma G.6.

The coefficients γj,r,l(t),ρ¯j,r,i(t),ρ¯j,r,i(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\gamma_{j,r,l}^{(t)},\bar{\rho}_{j,r,i}^{(t)},\underline{\rho}_{j,r,i}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT defined in Definition 18 satisfy the following iterative equations:

γj,r,l(0),ρ¯j,r,i(0),ρ¯j,r,i(0)=0,superscriptsubscript𝛾𝑗𝑟𝑙0superscriptsubscript¯𝜌𝑗𝑟𝑖0superscriptsubscript¯𝜌𝑗𝑟𝑖00\displaystyle\gamma_{j,r,l}^{(0)},\bar{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j% ,r,i}^{(0)}=0,italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0 ,
γj,r,l(t+1)=γj,r,l(t)ηnmiUli(t)σ(𝐰j,r(t),yi𝝁l)𝝁l22,superscriptsubscript𝛾𝑗𝑟𝑙𝑡1superscriptsubscript𝛾𝑗𝑟𝑙𝑡𝜂𝑛𝑚subscript𝑖subscript𝑈𝑙superscriptsuperscriptsubscript𝑖𝑡superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝑦𝑖subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22\displaystyle\gamma_{j,r,l}^{(t+1)}=\gamma_{j,r,l}^{(t)}-\frac{\eta}{nm}\cdot% \sum_{i\in U_{l}}{\ell_{i}^{\prime}}^{(t)}\sigma^{\prime}\left(\left\langle% \mathbf{w}_{j,r}^{(t)},y_{i}\cdot\bm{\mu}_{l}\right\rangle\right)\cdot\|\bm{% \mu}_{l}\|_{2}^{2},italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ⋅ ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j),superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡1superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝜂𝑛𝑚superscriptsuperscriptsubscript𝑖𝑡superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖221subscript𝑦𝑖𝑗\displaystyle\bar{\rho}_{j,r,i}^{(t+1)}=\bar{\rho}_{j,r,i}^{(t)}-\frac{\eta}{% nm}\cdot{\ell_{i}^{\prime}}^{(t)}\cdot\sigma^{\prime}\left(\left\langle\mathbf% {w}_{j,r}^{(t)},\bm{\xi}_{i}\right\rangle\right)\cdot\left\|\bm{\xi}_{i}\right% \|_{2}^{2}\cdot\mathbb{1}\left(y_{i}=j\right),over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ⋅ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j ) ,
ρ¯j,r,i(t+1)=ρ¯j,r,i(t)+ηnmi(t)σ(𝐰j,r(t),𝝃i)𝝃i22𝟙(yi=j),superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡1superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝜂𝑛𝑚superscriptsuperscriptsubscript𝑖𝑡superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖221subscript𝑦𝑖𝑗\displaystyle\underline{\rho}_{j,r,i}^{(t+1)}=\underline{\rho}_{j,r,i}^{(t)}+% \frac{\eta}{nm}\cdot{\ell_{i}^{\prime}}^{(t)}\cdot\sigma^{\prime}\left(\left% \langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}\right\rangle\right)\cdot\left\|\bm{% \xi}_{i}\right\|_{2}^{2}\cdot\mathbb{1}\left(y_{i}=-j\right),under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ⋅ roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_j ) ,

for all r[m],j{±1},l{1,2}formulae-sequence𝑟delimited-[]𝑚formulae-sequence𝑗plus-or-minus1𝑙12r\in[m],j\in\{\pm 1\},l\in\{1,2\}italic_r ∈ [ italic_m ] , italic_j ∈ { ± 1 } , italic_l ∈ { 1 , 2 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ].

Remark G.7.

This lemma serves as a cornerstone in our analysis of dynamics. Originally, the study of neural network dynamics under gradient descent required us to track the variations in weights. However, this Lemma enables us to view these dynamics from a new perspective, focusing on two distinct elements: feature learning (represented by γj,r,l(t+1)superscriptsubscript𝛾𝑗𝑟𝑙𝑡1\gamma_{j,r,l}^{(t+1)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT) and noise memorization (represented by ρj,r,i(t+1)superscriptsubscript𝜌𝑗𝑟𝑖𝑡1\rho_{j,r,i}^{(t+1)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT). We can easily observe that the γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is strictly increasing since i(t)superscriptsuperscriptsubscript𝑖𝑡{\ell_{i}^{\prime}}^{(t)}roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT is strictly negative.

Proof of Lemma G.6. Applying the gradient descent rule in (2), we get

𝐰j,r(t+1)=𝐰j,r(0)ηnms=0ti=1ni(s)σ(𝐰j,r(s),𝝃i)jyi𝝃isuperscriptsubscript𝐰𝑗𝑟𝑡1superscriptsubscript𝐰𝑗𝑟0𝜂𝑛𝑚superscriptsubscript𝑠0𝑡superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝝃𝑖𝑗subscript𝑦𝑖subscript𝝃𝑖\displaystyle\mathbf{w}_{j,r}^{(t+1)}=\mathbf{w}_{j,r}^{(0)}-\frac{\eta}{nm}% \sum_{s=0}^{t}\sum_{i=1}^{n}{\ell_{i}^{\prime}}^{(s)}\cdot\sigma^{\prime}\left% (\left\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}\right\rangle\right)\cdot jy_{% i}\bm{\xi}_{i}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ italic_j italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
ηnms=0ti=1ni(s)σ(𝐰j,r(s),yi𝝁l)j𝝁l.𝜂𝑛𝑚superscriptsubscript𝑠0𝑡superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝑦𝑖subscript𝝁𝑙𝑗subscript𝝁𝑙\displaystyle-\frac{\eta}{nm}\sum_{s=0}^{t}\sum_{i=1}^{n}{\ell_{i}^{\prime}}^{% (s)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(s)},y_{i}\bm{\mu}% _{l}\right\rangle\right)\cdot j\bm{\mu}_{l}.- divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) ⋅ italic_j bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT .

Based on the definition of γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and ρj,r,i(t)superscriptsubscript𝜌𝑗𝑟𝑖𝑡\rho_{j,r,i}^{(t)}italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, we consider γj,r,l(0),ρ¯j,r,i(0),ρ¯j,r,i(0)=0superscriptsubscript𝛾𝑗𝑟𝑙0superscriptsubscript¯𝜌𝑗𝑟𝑖0superscriptsubscript¯𝜌𝑗𝑟𝑖00\gamma_{j,r,l}^{(0)},\bar{\rho}_{j,r,i}^{(0)},\underline{\rho}_{j,r,i}^{(0)}=0italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = 0 and

𝐰j,r(t)=𝐰j,r(0)+jl=12γj,r,l(t)𝝁l22𝝁l+i=1nρj,r,i(t)𝝃i22𝝃i.superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡superscriptsubscriptnormsubscript𝝁𝑙22subscript𝝁𝑙superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖22subscript𝝃𝑖\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\sum_{l=1}^{2}\gamma_{j,r,% l}^{(t)}\cdot\|\bm{\mu}_{l}\|_{2}^{-2}\cdot\bm{\mu}_{l}+\sum_{i=1}^{n}\rho_{j,% r,i}^{(t)}\cdot\left\|\bm{\xi}_{i}\right\|_{2}^{-2}\cdot\bm{\xi}_{i}.bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Note that 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝝃isubscript𝝃𝑖\bm{\xi}_{i}bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are linearly independent with probability 1, thus we have the following unique representation

γj,r,l(t)=ηnms=0tiUli(s)σ(𝐰j,r(s),yi𝝁l)𝝁l22,superscriptsubscript𝛾𝑗𝑟𝑙𝑡𝜂𝑛𝑚superscriptsubscript𝑠0𝑡subscript𝑖subscript𝑈𝑙superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝑦𝑖subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22\displaystyle\gamma_{j,r,l}^{(t)}=-\frac{\eta}{nm}\sum_{s=0}^{t}\sum_{i\in U_{% l}}{\ell_{i}^{\prime}}^{(s)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{% j,r}^{(s)},y_{i}\bm{\mu}_{l}\right\rangle\right)\cdot\|\bm{\mu}_{l}\|_{2}^{2},italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,
ρj,r,i(t)=ηnms=0ti(s)σ(𝐰j,r(s),𝝃i)𝝃i22jyi.superscriptsubscript𝜌𝑗𝑟𝑖𝑡𝜂𝑛𝑚superscriptsubscript𝑠0𝑡superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22𝑗subscript𝑦𝑖\displaystyle\rho_{j,r,i}^{(t)}=-\frac{\eta}{nm}\sum_{s=0}^{t}{\ell_{i}^{% \prime}}^{(s)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(s)},\bm% {\xi}_{i}\right\rangle\right)\cdot\left\|\bm{\xi}_{i}\right\|_{2}^{2}\cdot jy_% {i}.italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_j italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

Recall Ul={i[n]:𝐱(i)=[yi𝝁l,𝝃i]}subscript𝑈𝑙conditional-set𝑖delimited-[]𝑛superscript𝐱𝑖subscript𝑦𝑖subscript𝝁𝑙subscript𝝃𝑖U_{l}=\left\{i\in[n]:\mathbf{x}^{(i)}=[y_{i}\cdot\bm{\mu}_{l},\bm{\xi}_{i}]\right\}italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = { italic_i ∈ [ italic_n ] : bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT = [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] }, we have

γj,r,l(t)=ηnms=0tiUli(s)σ(𝐰j,r(s),yi𝝁l)𝝁l22.superscriptsubscript𝛾𝑗𝑟𝑙𝑡𝜂𝑛𝑚superscriptsubscript𝑠0𝑡subscript𝑖subscript𝑈𝑙superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝑦𝑖subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22\gamma_{j,r,l}^{(t)}=-\frac{\eta}{nm}\sum_{s=0}^{t}\sum_{i\in U_{l}}{\ell_{i}^% {\prime}}^{(s)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(s)},y_% {i}\bm{\mu}_{l}\right\rangle\right)\cdot\|\bm{\mu}_{l}\|_{2}^{2}.italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_U start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (19)

Now with the notation ρ¯j,r,i(t):=ρj,r,i(t)𝟙(ρj,r,i(t)0),ρ¯j,r,i(t):=ρj,r,i(t)𝟙(ρj,r,i(t)0)\bar{\rho}_{j,r,i}^{(t)}\mathrel{\mathop{:}}=\rho_{j,r,i}^{(t)}\mathbb{1}\left% (\rho_{j,r,i}^{(t)}\geq 0\right),\underline{\rho}_{j,r,i}^{(t)}\mathrel{% \mathop{:}}=\rho_{j,r,i}^{(t)}\mathbb{1}\left(\rho_{j,r,i}^{(t)}\leq 0\right)over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : = italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ 0 ) , under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT : = italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ 0 ) and the fact i(s)<0superscriptsuperscriptsubscript𝑖𝑠0{\ell_{i}^{\prime}}^{(s)}<0roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT < 0, we get

ρ¯j,r,i(t)=ηnms=0ti(s)σ(𝐰j,r(s),𝝃i)𝝃i22𝟙(yi=j),superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝜂𝑛𝑚superscriptsubscript𝑠0𝑡superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖221subscript𝑦𝑖𝑗\bar{\rho}_{j,r,i}^{(t)}=-\frac{\eta}{nm}\sum_{s=0}^{t}{\ell_{i}^{\prime}}^{(s% )}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{i}% \right\rangle\right)\cdot\left\|\bm{\xi}_{i}\right\|_{2}^{2}\cdot\mathbb{1}% \left(y_{i}=j\right),over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = - divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j ) , (20)
ρ¯j,r,i(t)=ηnms=0ti(s)σ(𝐰j,r(s),𝝃i)𝝃i22𝟙(yi=j).superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝜂𝑛𝑚superscriptsubscript𝑠0𝑡superscriptsuperscriptsubscript𝑖𝑠superscript𝜎superscriptsubscript𝐰𝑗𝑟𝑠subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖221subscript𝑦𝑖𝑗\underline{\rho}_{j,r,i}^{(t)}=\frac{\eta}{nm}\sum_{s=0}^{t}{\ell_{i}^{\prime}% }^{(s)}\cdot\sigma^{\prime}\left(\left\langle\mathbf{w}_{j,r}^{(s)},\bm{\xi}_{% i}\right\rangle\right)\cdot\left\|\bm{\xi}_{i}\right\|_{2}^{2}\cdot\mathbb{1}% \left(y_{i}=-j\right).under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_s = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ⋅ italic_σ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ blackboard_1 ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_j ) . (21)

The proof is completed.

Remark G.8.

The proof strategy employed in this study follows the study of feature learning analysis techniques in Cao et al. [2022a], Kou et al. [2023b], Meng et al. [2023]. However, our decomposition considers two task-specific features with different proportion. This disparity would finally lead to distinct learning efficiency among samples, as well as different generalization ability.

Next, we’re dedicated to explore range scale evolution of the coefficients in the signal-noise decomposition. Let T=superscript𝑇absentT^{*}=italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = η1superscript𝜂1\eta^{-1}italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT poly (ε1,d,n,m)superscript𝜀1𝑑𝑛𝑚\left(\varepsilon^{-1},d,n,m\right)( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_d , italic_n , italic_m ) be the maximum admissible iteration. Denote

α:=4log(T),\displaystyle\alpha\mathrel{\mathop{:}}=4\log\left(T^{*}\right),italic_α : = 4 roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (22)
β:=2maxl,i,j,r{|𝐰j,r(0),𝝁l|,|𝐰j,r(0),𝝃i|},\displaystyle\beta\mathrel{\mathop{:}}=2\max_{l,i,j,r}\left\{\left|\left% \langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}_{l}\right\rangle\right|,\left|\left% \langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\right\rangle\right|\right\},italic_β : = 2 roman_max start_POSTSUBSCRIPT italic_l , italic_i , italic_j , italic_r end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | , | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ | } ,
SNRl:=𝝁l2σpd.\displaystyle\operatorname{SNR}_{l}\mathrel{\mathop{:}}=\dfrac{\|\bm{\mu}_{l}% \|_{2}}{\sigma_{p}\sqrt{d}}.roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : = divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG end_ARG .

By Lemma G.2, β𝛽\betaitalic_β can be bounded by 4σ0max{log16mnδσpd,log(16mδ)𝝁l2}4\sigma_{0}\cdot\max\left\{\sqrt{\log\dfrac{16mn}{\delta}}\cdot\sigma_{p}\sqrt% {d},\sqrt{\log(\dfrac{16m}{\delta}})\cdot\|\bm{\mu}_{l}\|_{2}\right\}4 italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ roman_max { square-root start_ARG roman_log divide start_ARG 16 italic_m italic_n end_ARG start_ARG italic_δ end_ARG end_ARG ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG , square-root start_ARG roman_log ( divide start_ARG 16 italic_m end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }. Under Condition 3.1, it is straightforward to verify the following inequality with a large constant C𝐶Citalic_C:

maxl{β,SNRl32log(12nδ)dnα,5log(6n2δ)dnα}112.subscript𝑙𝛽subscriptSNR𝑙3212𝑛𝛿𝑑𝑛𝛼56superscript𝑛2𝛿𝑑𝑛𝛼112\max_{l}\left\{\beta,\operatorname{SNR}_{l}\sqrt{\frac{32\log(\frac{12n}{% \delta})}{d}}n\alpha,5\sqrt{\frac{\log\left(\frac{6n^{2}}{\delta}\right)}{d}}n% \alpha\right\}\leq\frac{1}{12}.roman_max start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT { italic_β , roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT square-root start_ARG divide start_ARG 32 roman_log ( divide start_ARG 12 italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_d end_ARG end_ARG italic_n italic_α , 5 square-root start_ARG divide start_ARG roman_log ( divide start_ARG 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_d end_ARG end_ARG italic_n italic_α } ≤ divide start_ARG 1 end_ARG start_ARG 12 end_ARG . (23)

We then assert the following proposition hold for the entire training period. This proposition serves to show the evolution scale of the coefficients.

Proposition G.9.

Under Condition 3.1, for 0tT0𝑡superscript𝑇0\leq t\leq T^{*}0 ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, there exists a positive constant Csuperscript𝐶C^{\prime}italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT such that

0γj,r,l(t)CτlnSNRl2α0superscriptsubscript𝛾𝑗𝑟𝑙𝑡superscript𝐶subscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2𝛼\displaystyle 0\leq\gamma_{j,r,l}^{(t)}\leq C^{\prime}\cdot\tau_{l}n\cdot% \operatorname{SNR}_{l}^{2}\cdot\alpha0 ≤ italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ italic_α (24)
0ρ¯j,r,i(t)α,0superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝛼\displaystyle 0\leq\bar{\rho}_{j,r,i}^{(t)}\leq\alpha,0 ≤ over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ italic_α ,
0ρ¯j,r,i(t)β10log(6n2δ)dnαα,0superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝛽106superscript𝑛2𝛿𝑑𝑛𝛼𝛼\displaystyle 0\geq\underline{\rho}_{j,r,i}^{(t)}\geq-\beta-10\sqrt{\frac{\log% \left(\frac{6n^{2}}{\delta}\right)}{d}}n\alpha\geq-\alpha,0 ≥ under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ - italic_β - 10 square-root start_ARG divide start_ARG roman_log ( divide start_ARG 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG start_ARG italic_d end_ARG end_ARG italic_n italic_α ≥ - italic_α ,

for all j{±1},r[m],l{1,2}formulae-sequence𝑗plus-or-minus1formulae-sequence𝑟delimited-[]𝑚𝑙12j\in\{\pm 1\},r\in[m],l\in\{1,2\}italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] , italic_l ∈ { 1 , 2 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ].

Remark G.10.

Our results resemble those in the study of feature learning of CNN [Cao et al., 2022a, Kou et al., 2023b, Meng et al., 2023, Lu et al., 2023]. However, the scale of our learning progress coefficient γj,r,l(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡\gamma_{j,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT depends on its corresponding feature proportion and strength in the labeled data distribution, which will significantly impact the learning process of specific type of data.

Proof of Proposition G.9. See Proposition C.2. and Proposition C.8. in Kou et al. [2023b] or Proposition C.2 and Proposition C.8 in Meng et al. [2023] for a proof. Regardless of the variations in data settings, obtaining the result through inductive techniques is readily feasible.

Based on Proposition G.9, we can analyze the convergence of the training dynamics via identifying the degree of feature learning and noise memorization in the following section.

G.3 Feature Learning and Noise Memorization Analysis

In this section, we adopt a two-stage analysis to evaluate the evolution of the coefficients. In the first stage, the loss function’s derivative remains nearly constant due to the small weight initialization. However, in the subsequent stage, the derivative of the loss function becomes non-constant, requiring a careful analysis to address this change. We will see that the scale differences in the first stage remain the same. Worth noting that the results in this section are based on the previous results in Appendix G.2 holding with high probability.

G.3.1 First Stage: Feature Learning versus Noise Memorization

Lemma G.11.

There exist

T1=C3η1nmσp2d1,T2=C4η1nmσp2d1formulae-sequencesubscript𝑇1subscript𝐶3superscript𝜂1𝑛𝑚superscriptsubscript𝜎𝑝2superscript𝑑1subscript𝑇2subscript𝐶4superscript𝜂1𝑛𝑚superscriptsubscript𝜎𝑝2superscript𝑑1T_{1}=C_{3}\eta^{-1}nm\sigma_{p}^{-2}d^{-1},T_{2}=C_{4}\eta^{-1}nm\sigma_{p}^{% -2}d^{-1}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n italic_m italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n italic_m italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT

where C3=Θ(1)subscript𝐶3Θ1C_{3}=\Theta(1)italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT = roman_Θ ( 1 ) is a large constant and C4=Θ(1)subscript𝐶4Θ1C_{4}=\Theta(1)italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT = roman_Θ ( 1 ) is a small constant, such that

  • maxj,rγj,r,l(t)=O(τlnSNRl2)subscript𝑗𝑟superscriptsubscript𝛾𝑗𝑟𝑙𝑡𝑂subscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2\max_{j,r}\gamma_{j,r,l}^{(t)}=O(\tau_{l}n\cdot\operatorname{SNR}_{l}^{2})roman_max start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = italic_O ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , for all 0tT1,l{1,2}formulae-sequence0𝑡subscript𝑇1𝑙120\leq t\leq T_{1},l\in\{1,2\}0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_l ∈ { 1 , 2 }.

  • minj,rγj,r,l(t)=Ω(τlnSNRl2)subscript𝑗𝑟superscriptsubscript𝛾𝑗𝑟𝑙𝑡Ωsubscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2\min_{j,r}\gamma_{j,r,l}^{(t)}=\Omega(\tau_{l}n\cdot\operatorname{SNR}_{l}^{2})roman_min start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Ω ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), for all tT2,l{1,2}formulae-sequence𝑡subscript𝑇2𝑙12t\geq T_{2},l\in\{1,2\}italic_t ≥ italic_T start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_l ∈ { 1 , 2 }.

  • ρ¯j,r,i(T1)2superscriptsubscript¯𝜌𝑗superscript𝑟𝑖subscript𝑇12\bar{\rho}_{j,r^{*},i}^{\left(T_{1}\right)}\geq 2over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ≥ 2, for any rSi(0)={r[m]:𝐰yi,r(0),𝝃i>0},j{±1}formulae-sequencesuperscript𝑟superscriptsubscript𝑆𝑖0conditional-set𝑟delimited-[]𝑚superscriptsubscript𝐰subscript𝑦𝑖𝑟0subscript𝝃𝑖0𝑗plus-or-minus1r^{*}\in S_{i}^{(0)}=\left\{r\in[m]:\left\langle\mathbf{w}_{y_{i},r}^{(0)},\bm% {\xi}_{i}\right\rangle>0\right\},j\in\{\pm 1\}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { italic_r ∈ [ italic_m ] : ⟨ bold_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } , italic_j ∈ { ± 1 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] with yi=jsubscript𝑦𝑖𝑗y_{i}=jitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j.

  • maxj,r,i|ρ¯j,r,i(t)|=max{O(log(mnδ)σ0σpd),O(nlog(nδ)log(T)/d)}subscript𝑗𝑟𝑖superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝑂𝑚𝑛𝛿subscript𝜎0subscript𝜎𝑝𝑑𝑂𝑛𝑛𝛿superscript𝑇𝑑\max_{j,r,i}\left|\underline{\rho}_{j,r,i}^{(t)}\right|=\max\left\{O\left(% \sqrt{\log(\dfrac{mn}{\delta})}\cdot\sigma_{0}\sigma_{p}\sqrt{d}\right),O\left% (n\sqrt{\log(\dfrac{n}{\delta})}\log\left(T^{*}\right)/\sqrt{d}\right)\right\}roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT | under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | = roman_max { italic_O ( square-root start_ARG roman_log ( divide start_ARG italic_m italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) , italic_O ( italic_n square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) / square-root start_ARG italic_d end_ARG ) }, for all 0tT10𝑡subscript𝑇10\leq t\leq T_{1}0 ≤ italic_t ≤ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

  • maxj,rρ¯j,r,i(T1)=O(1)subscript𝑗𝑟superscriptsubscript¯𝜌𝑗𝑟𝑖subscript𝑇1𝑂1\max_{j,r}\bar{\rho}_{j,r,i}^{\left(T_{1}\right)}=O(1)roman_max start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = italic_O ( 1 ), for all i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ].

Proof of Lemma G.11. See Lemma D.1. in Kou et al. [2023b] or Lemma D.1, Proposition D.2-D.4 in Meng et al. [2023] for a proof.

G.3.2 Second Stage: Convergence of Training Error

At the end of the first stage, we have the following feature-to-noise decomposition:

𝐰j,r(T1)=𝐰j,r(0)+jl=12γj,r,l(T1)𝝁l𝝁l22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22+i=1nρ¯j,r,i(T1)𝝃i𝝃i22superscriptsubscript𝐰𝑗𝑟subscript𝑇1superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙subscript𝑇1subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖subscript𝑇1subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖subscript𝑇1subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{\left(T_{1}\right)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\sum_{l=1}^% {2}\gamma_{j,r,l}^{\left(T_{1}\right)}\cdot\frac{\bm{\mu}_{l}}{\|\bm{\mu}_{l}% \|_{2}^{2}}+\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{\left(T_{1}\right)}\cdot\frac{% \bm{\xi}_{i}}{\left\|\bm{\xi}_{i}\right\|_{2}^{2}}+\sum_{i=1}^{n}\underline{% \rho}_{j,r,i}^{\left(T_{1}\right)}\cdot\frac{\bm{\xi}_{i}}{\left\|\bm{\xi}_{i}% \right\|_{2}^{2}}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG

for j[±1]𝑗delimited-[]plus-or-minus1j\in[\pm 1]italic_j ∈ [ ± 1 ] and r[m]𝑟delimited-[]𝑚r\in[m]italic_r ∈ [ italic_m ]. Applying the results we obtain in the first stage, we have the following property holds at the beginning of this stage:

  • γj,r,l(T1)=τlnSNRl2)\gamma_{j,r,l}^{\left(T_{1}\right)}=\tau_{l}n\cdot\operatorname{SNR}_{l}^{2})italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT = italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) for any j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ].

  • ρ¯j,r,i(T1)2superscriptsubscript¯𝜌𝑗superscript𝑟𝑖subscript𝑇12\bar{\rho}_{j,r^{*},i}^{\left(T_{1}\right)}\geq 2over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT ≥ 2 for any rSi(0)={r[m]:𝐰yi,r(0),𝝃i>0},j{±1}formulae-sequencesuperscript𝑟superscriptsubscript𝑆𝑖0conditional-set𝑟delimited-[]𝑚superscriptsubscript𝐰subscript𝑦𝑖𝑟0subscript𝝃𝑖0𝑗plus-or-minus1r^{*}\in S_{i}^{(0)}=\left\{r\in[m]:\left\langle\mathbf{w}_{y_{i},r}^{(0)},\bm% {\xi}_{i}\right\rangle>0\right\},j\in\{\pm 1\}italic_r start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT = { italic_r ∈ [ italic_m ] : ⟨ bold_w start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } , italic_j ∈ { ± 1 } and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] with yi=jsubscript𝑦𝑖𝑗y_{i}=jitalic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j.

  • maxj,r,i|ρ¯j,r,i(T1)|=max{O(log(mnδ)σ0σpd),O(nlog(nδ)log(T)/d)}subscript𝑗𝑟𝑖superscriptsubscript¯𝜌𝑗𝑟𝑖subscript𝑇1𝑂𝑚𝑛𝛿subscript𝜎0subscript𝜎𝑝𝑑𝑂𝑛𝑛𝛿superscript𝑇𝑑\max_{j,r,i}\left|\underline{\rho}_{j,r,i}^{\left(T_{1}\right)}\right|=\max% \left\{O\left(\sqrt{\log(\dfrac{mn}{\delta})}\cdot\sigma_{0}\sigma_{p}\sqrt{d}% \right),O\left(n\sqrt{\log(\dfrac{n}{\delta})}\log\left(T^{*}\right)/\sqrt{d}% \right)\right\}roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT | under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT | = roman_max { italic_O ( square-root start_ARG roman_log ( divide start_ARG italic_m italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) , italic_O ( italic_n square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) / square-root start_ARG italic_d end_ARG ) }.

Following the technique in Cao et al. [2022a], now we choose 𝐖superscript𝐖\mathbf{W}^{*}bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT as follows

𝐰j,r=𝐰j,r(0)+5log(2ε)[i=1n𝟙(j=yi)𝝃i𝝃i22].superscriptsubscript𝐰𝑗𝑟superscriptsubscript𝐰𝑗𝑟052𝜀delimited-[]superscriptsubscript𝑖1𝑛1𝑗subscript𝑦𝑖subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{*}=\mathbf{w}_{j,r}^{(0)}+5\log(\dfrac{2}{\varepsilon})\left% [\sum_{i=1}^{n}\mathbb{1}\left(j=y_{i}\right)\cdot\frac{\bm{\xi}_{i}}{\left\|% \bm{\xi}_{i}\right\|_{2}^{2}}\right].bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + 5 roman_log ( divide start_ARG 2 end_ARG start_ARG italic_ε end_ARG ) [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( italic_j = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] .
Lemma G.12.

Under Condition 3.1, we have

maxj,r,i|ρ¯j,r,i(t)|=max{O(log(mnδ)σ0σpd),O(nlog(nδ)log(T)/d)},\max_{j,r,i}\left|\underline{\rho}_{j,r,i}^{(t)}\right|=\max\left\{O\left(% \sqrt{\log(\dfrac{mn}{\delta}})\cdot\sigma_{0}\sigma_{p}\sqrt{d}\right),O\left% (n\sqrt{\log(\dfrac{n}{\delta})}\log\left(T^{*}\right)/\sqrt{d}\right)\right\},roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT | under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | = roman_max { italic_O ( square-root start_ARG roman_log ( divide start_ARG italic_m italic_n end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) , italic_O ( italic_n square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) / square-root start_ARG italic_d end_ARG ) } ,

for all T1tTsubscript𝑇1𝑡superscript𝑇T_{1}\leq t\leq T^{*}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Besides,

1tT1+1s=T1tLS(𝐖(s))𝐖(T1)𝐖F2η(tT1+1)+ε1𝑡subscript𝑇11superscriptsubscript𝑠subscript𝑇1𝑡subscript𝐿𝑆superscript𝐖𝑠superscriptsubscriptnormsuperscript𝐖subscript𝑇1superscript𝐖𝐹2𝜂𝑡subscript𝑇11𝜀\frac{1}{t-T_{1}+1}\sum_{s=T_{1}}^{t}L_{S}\left(\mathbf{W}^{(s)}\right)\leq% \frac{\left\|\mathbf{W}^{\left(T_{1}\right)}-\mathbf{W}^{*}\right\|_{F}^{2}}{% \eta\left(t-T_{1}+1\right)}+\varepsilondivide start_ARG 1 end_ARG start_ARG italic_t - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_ARG ∑ start_POSTSUBSCRIPT italic_s = italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_s ) end_POSTSUPERSCRIPT ) ≤ divide start_ARG ∥ bold_W start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_η ( italic_t - italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) end_ARG + italic_ε

for all T1tTsubscript𝑇1𝑡superscript𝑇T_{1}\leq t\leq T^{*}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. Therefore, we can find an iterate with training loss smaller than 2ε2𝜀2\varepsilon2 italic_ε within T=T1+|𝐖(T1)𝐖F2/(ηε)|=T1+O~(η1ε1mnd1σp2)𝑇subscript𝑇1superscriptsubscriptnormsuperscript𝐖subscript𝑇1superscript𝐖𝐹2𝜂𝜀subscript𝑇1~𝑂superscript𝜂1superscript𝜀1𝑚𝑛superscript𝑑1superscriptsubscript𝜎𝑝2T=T_{1}+\left|\left\|\mathbf{W}^{\left(T_{1}\right)}-\mathbf{W}^{*}\right\|_{F% }^{2}/(\eta\varepsilon)\right|=T_{1}+\widetilde{O}\left(\eta^{-1}\varepsilon^{% -1}mnd^{-1}\sigma_{p}^{-2}\right)italic_T = italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + | ∥ bold_W start_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_POSTSUPERSCRIPT - bold_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_η italic_ε ) | = italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) iterations.

Proof of Lemma G.12. See Lemma D.5 in Cao et al. [2022a] or Lemma D.6. in Kou et al. [2023b] for a proof.

Worth noting that since the n𝑛nitalic_n could be n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT or n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and the τlsubscript𝜏𝑙\tau_{l}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT could be any real number denoting the proportion of specific types of data in the labeled set, we have successfully concluded the proof of training loss convergence for all three querying algorithms. The following lemma characterized the feature-to-noise ratio during the whole duration.

Lemma G.13.

Under Condition 3.1, we have

i=1nρ¯j,r,i(t)/γj,r,l(t)=Θ(τl1SNRl2)superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript𝛾superscript𝑗superscript𝑟𝑙𝑡Θsuperscriptsubscript𝜏𝑙1superscriptsubscriptSNR𝑙2\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}/\gamma_{j^{\prime},r^{\prime},l}^{(t)}=% \Theta\left(\tau_{l}^{-1}\cdot\operatorname{SNR}_{l}^{-2}\right)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT / italic_γ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT )

for all j,j{±1},r,r[m],l{1,2}formulae-sequence𝑗superscript𝑗plus-or-minus1𝑟superscript𝑟delimited-[]𝑚𝑙12j,j^{\prime}\in\{\pm 1\},r,r^{\prime}\in[m],l\in\{1,2\}italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { ± 1 } , italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_m ] , italic_l ∈ { 1 , 2 } and 0tT0𝑡superscript𝑇0\leq t\leq T^{*}0 ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Proof of Lemma G.13. See Lemma D.7. in Kou et al. [2023b] or Proposition C.8 in Meng et al. [2023] for a proof.

Now we can summarize current results into the following lemma.

Lemma G.14.

(Formal restatement of Lemma 4.1) Under Condition 3.1, there exists T1=Θ(η1nmσp2d1)subscript𝑇1Θsuperscript𝜂1𝑛𝑚superscriptsubscript𝜎𝑝2superscript𝑑1T_{1}=\Theta(\eta^{-1}nm\sigma_{p}^{2}d^{-1})italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = roman_Θ ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n italic_m italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ), for t[T1,T]𝑡subscript𝑇1superscript𝑇t\in\left[T_{1},T^{*}\right]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] we have the following hold:

  • γj,r,l(t)=Θ(τl𝝁l22dσp2)i=1nρ¯j,r,i(t)superscriptsubscript𝛾𝑗𝑟𝑙𝑡Θsubscript𝜏𝑙superscriptsubscriptnormsubscript𝝁𝑙22𝑑superscriptsubscript𝜎𝑝2superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\gamma_{j,r,l}^{(t)}=\Theta\left(\dfrac{\tau_{l}\|\bm{\mu}_{l}\|_{2}^{2}}{d% \sigma_{p}^{2}}\right)\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( divide start_ARG italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_d italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT, for all j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 } (from Lemma G.13).

  • i=1nρ¯j,r,i(t)=Ω(n)=O(nlog(T))=Θ~(n)superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡Ω𝑛𝑂𝑛superscript𝑇~Θ𝑛\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}=\Omega(n)=O(n\log(T^{*}))=\widetilde{% \Theta}(n)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Ω ( italic_n ) = italic_O ( italic_n roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ) = over~ start_ARG roman_Θ end_ARG ( italic_n ), for all j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 } (from Proposition G.9 and Lemma G.11).

  • maxj,r,i|ρ¯j,r,i(t)|=max{O(σ0σpdlog(mnδ)),O(log(nδ)log(T)n/d)}subscript𝑗𝑟𝑖superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝑂subscript𝜎0subscript𝜎𝑝𝑑𝑚𝑛𝛿𝑂𝑛𝛿superscript𝑇𝑛𝑑\max_{j,r,i}\lvert\underline{\rho}_{j,r,i}^{(t)}\rvert=\max\{O(\sigma_{0}% \sigma_{p}\sqrt{d}\cdot\sqrt{\log(\dfrac{mn}{\delta})}),O(\sqrt{\log(\dfrac{n}% {\delta})}\log(T^{*})\cdot n/\sqrt{d})\}roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT | under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | = roman_max { italic_O ( italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ⋅ square-root start_ARG roman_log ( divide start_ARG italic_m italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ) , italic_O ( square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ italic_n / square-root start_ARG italic_d end_ARG ) }, for all j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 } (from Lemma G.12).

Lemma G.15.

Under Condition 3.1, there exists t=O~(η1ε1mnd1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚𝑛superscript𝑑1superscriptsubscript𝜎𝑝2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mnd^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), we have:

𝐰j,r(t)2Θ(σp1d12n12),subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12\displaystyle\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}\leq\Theta\left(\sigma_{% p}^{-1}d^{-\frac{1}{2}}n^{\frac{1}{2}}\right),∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ) , (25)
𝐰y,r(t),y𝝁l=Θ(γy,r,l(t)),superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁𝑙Θsuperscriptsubscript𝛾𝑦𝑟𝑙𝑡\displaystyle\left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l}\right\rangle=% \Theta\left(\gamma_{y,r,l}^{(t)}\right),⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,
𝐰y,r(t),y𝝁l=Θ(γy,r,l(t))<0.superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁𝑙Θsuperscriptsubscript𝛾𝑦𝑟𝑙𝑡0\displaystyle\left\langle\mathbf{w}_{-y,r}^{(t)},y\bm{\mu}_{l}\right\rangle=-% \Theta\left(\gamma_{-y,r,l}^{(t)}\right)<0.⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = - roman_Θ ( italic_γ start_POSTSUBSCRIPT - italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) < 0 .

for all j{±1},r[m]formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚j\in\{\pm 1\},r\in[m]italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }.

Proof of Lemma G.15. Recall the signal-noise decomposition of 𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT:

𝐰j,r(t)=𝐰j,r(0)+jl=12γj,r,l(t)𝝁l𝝁l22+i=1nρ¯j,r,i(t)𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i𝝃i22.superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0𝑗superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscript𝝁𝑙superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{\left(t\right)}=\mathbf{w}_{j,r}^{(0)}+j\cdot\sum_{l=1}^{2}% \gamma_{j,r,l}^{\left(t\right)}\cdot\frac{\bm{\mu}_{l}}{\|\bm{\mu}_{l}\|_{2}^{% 2}}+\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{\left(t\right)}\cdot\frac{\bm{\xi}_{i}}{% \left\|\bm{\xi}_{i}\right\|_{2}^{2}}+\sum_{i=1}^{n}\underline{\rho}_{j,r,i}^{% \left(t\right)}\cdot\frac{\bm{\xi}_{i}}{\left\|\bm{\xi}_{i}\right\|_{2}^{2}}.bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + italic_j ⋅ ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .

For l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }, we can bound the inner product with j=y𝑗𝑦j=yitalic_j = italic_y:

𝐰y,r(t),y𝝁l=superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁𝑙absent\displaystyle\left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l}\right\rangle=⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = 𝐰y,r(0),y𝝁l+γy,r,l(t)+i=1nρ¯y,r,i(t)𝝃i22𝝃i,y𝝁l+i=1nρ¯y,r,i(t)𝝃i22𝝃i,y𝝁lsuperscriptsubscript𝐰𝑦𝑟0𝑦subscript𝝁𝑙superscriptsubscript𝛾𝑦𝑟𝑙𝑡superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑦𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖22subscript𝝃𝑖𝑦subscript𝝁𝑙superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑦𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖22subscript𝝃𝑖𝑦subscript𝝁𝑙\displaystyle\left\langle\mathbf{w}_{y,r}^{(0)},y\bm{\mu}_{l}\right\rangle+% \gamma_{y,r,l}^{(t)}+\sum_{i=1}^{n}\bar{\rho}_{y,r,i}^{(t)}\cdot\left\|\bm{\xi% }_{i}\right\|_{2}^{-2}\cdot\left\langle\bm{\xi}_{i},y\bm{\mu}_{l}\right\rangle% +\sum_{i=1}^{n}\underline{\rho}_{y,r,i}^{(t)}\cdot\left\|\bm{\xi}_{i}\right\|_% {2}^{-2}\cdot\left\langle\bm{\xi}_{i},y\bm{\mu}_{l}\right\rangle⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ + italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ (26)
\displaystyle\geq γy,r,l(t)2log(16mδ)σ0𝝁l22log(12nδ)σp𝝁l2(σp2d2)1[i=1nρ¯y,r,i(t)+i=1nρ¯y,r,i(t)]\displaystyle\gamma_{y,r,l}^{(t)}-\sqrt{2\log(\dfrac{16m}{\delta}})\cdot\sigma% _{0}\|\bm{\mu}_{l}\|_{2}-\sqrt{2\log(\dfrac{12n}{\delta})}\cdot\sigma_{p}\|\bm% {\mu}_{l}\|_{2}\cdot\left(\dfrac{\sigma_{p}^{2}d}{2}\right)^{-1}\left[\sum_{i=% 1}^{n}\bar{\rho}_{y,r,i}^{(t)}+\sum_{i=1}^{n}\mid\underline{\rho}_{y,r,i}^{(t)% }\right]italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - square-root start_ARG 2 roman_log ( divide start_ARG 16 italic_m end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - square-root start_ARG 2 roman_log ( divide start_ARG 12 italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∣ under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ]
=\displaystyle== γy,r,l(t)Θ(log(mδ)σ0𝝁l2)Θ(log(nδ)(σpd)1𝝁l2)Θ(SNRl2)γy,r,l(t)superscriptsubscript𝛾𝑦𝑟𝑙𝑡Θ𝑚𝛿subscript𝜎0subscriptnormsubscript𝝁𝑙2Θ𝑛𝛿superscriptsubscript𝜎𝑝𝑑1subscriptnormsubscript𝝁𝑙2ΘsuperscriptsubscriptSNR𝑙2superscriptsubscript𝛾𝑦𝑟𝑙𝑡\displaystyle\gamma_{y,r,l}^{(t)}-\Theta\left(\sqrt{\log(\dfrac{m}{\delta})}% \sigma_{0}\|\bm{\mu}_{l}\|_{2}\right)-\Theta\left(\sqrt{\log(\dfrac{n}{\delta}% )}\cdot\left(\sigma_{p}d\right)^{-1}\|\bm{\mu}_{l}\|_{2}\right)\cdot\Theta% \left(\operatorname{SNR}_{l}^{-2}\right)\cdot\gamma_{y,r,l}^{(t)}italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - roman_Θ ( square-root start_ARG roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) end_ARG italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - roman_Θ ( square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ roman_Θ ( roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ⋅ italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT
=\displaystyle== [1Θ(log(nδ)σp/𝝁l2)]γy,r,l(t)Θ(log(mδ)(σpd)1n𝝁l2)delimited-[]1Θ𝑛𝛿subscript𝜎𝑝subscriptnormsubscript𝝁𝑙2superscriptsubscript𝛾𝑦𝑟𝑙𝑡Θ𝑚𝛿superscriptsubscript𝜎𝑝𝑑1𝑛subscriptnormsubscript𝝁𝑙2\displaystyle{\left[1-\Theta\left(\sqrt{\log(\dfrac{n}{\delta})}\cdot\sigma_{p% }/\|\bm{\mu}_{l}\|_{2}\right)\right]\gamma_{y,r,l}^{(t)}-\Theta\left(\sqrt{% \log(\dfrac{m}{\delta})}\left(\sigma_{p}d\right)^{-1}\sqrt{n}\|\bm{\mu}_{l}\|_% {2}\right)}[ 1 - roman_Θ ( square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT / ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT - roman_Θ ( square-root start_ARG roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) end_ARG ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
=\displaystyle== Θ(γy,r,l(t)),Θsuperscriptsubscript𝛾𝑦𝑟𝑙𝑡\displaystyle\Theta\left(\gamma_{y,r,l}^{(t)}\right),roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ,

where the inequality is justified by Lemma G.1 and Lemma G.2. The second equality is obtained by substituting the coefficient scales in G.14. The third equality follows from the condition σ0C1(σpd)1nsubscript𝜎0superscript𝐶1superscriptsubscript𝜎𝑝𝑑1𝑛\sigma_{0}\leq C^{-1}\left(\sigma_{p}d\right)^{-1}\sqrt{n}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≤ italic_C start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT square-root start_ARG italic_n end_ARG in Condition 3.1 and the feature-to-noise ratio SNRl=𝝁l2σpdsubscriptSNR𝑙subscriptnormsubscript𝝁𝑙2subscript𝜎𝑝𝑑\operatorname{SNR}_{l}=\dfrac{\|\bm{\mu}_{l}\|_{2}}{\sigma_{p}\sqrt{d}}roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG end_ARG. For the fourth equality, it should be noted that γj,r,l(t)=Ω(τlnSNRl2)superscriptsubscript𝛾𝑗𝑟𝑙𝑡Ωsubscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2\gamma_{j,r,l}^{(t)}=\Omega(\tau_{l}n\cdot\operatorname{SNR}_{l}^{2})italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Ω ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and also log(nδ)σp𝝁l21/C𝑛𝛿subscript𝜎𝑝subscriptnormsubscript𝝁𝑙21𝐶\sqrt{\log(\dfrac{n}{\delta})}\cdot\dfrac{\sigma_{p}}{\|\bm{\mu}_{l}\|_{2}}% \leq 1/\sqrt{C}square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≤ 1 / square-root start_ARG italic_C end_ARG and log(mδ)(σpd)1n𝝁l2τlnSNRl2=𝑚𝛿superscriptsubscript𝜎𝑝𝑑1𝑛subscriptnormsubscript𝝁𝑙2subscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2absent\sqrt{\log(\dfrac{m}{\delta})}\left(\sigma_{p}d\right)^{-1}\dfrac{\sqrt{n}\|% \bm{\mu}_{l}\|_{2}}{\tau_{l}n\cdot\operatorname{SNR}_{l}^{2}}=square-root start_ARG roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) end_ARG ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT divide start_ARG square-root start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG = log(mδ)σpτln𝝁l2log(mδ)/n1/(Clog(nδ))1/(Clog(nδ))𝑚𝛿subscript𝜎𝑝subscript𝜏𝑙𝑛subscriptnormsubscript𝝁𝑙2𝑚𝛿𝑛1𝐶𝑛𝛿1𝐶𝑛𝛿\sqrt{\log(\dfrac{m}{\delta})}\dfrac{\sigma_{p}}{\tau_{l}\sqrt{n}\|\bm{\mu}_{l% }\|_{2}}\leq\sqrt{\log(\dfrac{m}{\delta})/n}\cdot 1/(\sqrt{C\log(\dfrac{n}{% \delta})})\leq 1/(C\sqrt{\log(\dfrac{n}{\delta})})square-root start_ARG roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) end_ARG divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT square-root start_ARG italic_n end_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≤ square-root start_ARG roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) / italic_n end_ARG ⋅ 1 / ( square-root start_ARG italic_C roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ) ≤ 1 / ( italic_C square-root start_ARG roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ), which holds due to 𝝁l22Cσp2log(nδ)superscriptsubscriptnormsubscript𝝁𝑙22𝐶superscriptsubscript𝜎𝑝2𝑛𝛿\|\bm{\mu}_{l}\|_{2}^{2}\geq C\cdot\sigma_{p}^{2}\log(\dfrac{n}{\delta})∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_C ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( divide start_ARG italic_n end_ARG start_ARG italic_δ end_ARG ) and nClog(mδ)𝑛𝐶𝑚𝛿n\geq C\log(\dfrac{m}{\delta})italic_n ≥ italic_C roman_log ( divide start_ARG italic_m end_ARG start_ARG italic_δ end_ARG ) in Condition 3.1. Therefore, for a sufficiently large constant C𝐶Citalic_C, the equality holds. Moreover, we can deduce in a similar manner that

𝐰y,r(t),y𝝁l=superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁𝑙absent\displaystyle\left\langle\mathbf{w}_{-y,r}^{(t)},y\bm{\mu}_{l}\right\rangle=⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ = 𝐰y,r(0),y𝝁lγy,r,l(t)+i=1nρ¯y,r,i(t)𝝃i22𝝃i,y𝝁l+i=1nρ¯y,r,i(t)𝝃i22𝝃i,y𝝁lsuperscriptsubscript𝐰𝑦𝑟0𝑦subscript𝝁𝑙superscriptsubscript𝛾𝑦𝑟𝑙𝑡superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑦𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖22subscript𝝃𝑖𝑦subscript𝝁𝑙superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑦𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖22subscript𝝃𝑖𝑦subscript𝝁𝑙\displaystyle\left\langle\mathbf{w}_{-y,r}^{(0)},y\bm{\mu}_{l}\right\rangle-% \gamma_{-y,r,l}^{(t)}+\sum_{i=1}^{n}\bar{\rho}_{-y,r,i}^{(t)}\cdot\left\|\bm{% \xi}_{i}\right\|_{2}^{-2}\cdot\left\langle\bm{\xi}_{i},-y\bm{\mu}_{l}\right% \rangle+\sum_{i=1}^{n}\underline{\rho}_{-y,r,i}^{(t)}\cdot\left\|\bm{\xi}_{i}% \right\|_{2}^{-2}\cdot\left\langle\bm{\xi}_{i},y\bm{\mu}_{l}\right\rangle⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ - italic_γ start_POSTSUBSCRIPT - italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , - italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ (27)
\displaystyle\leq γy,r,l(t)+2log(16mδ)σ0𝝁l2+2log(12nδ)σp𝝁l2(σp2d2)1[i=1nρ¯y,r,i(t)+i=1n|ρ¯y,r,i(t)|]\displaystyle-\gamma_{-y,r,l}^{(t)}+\sqrt{2\log(\dfrac{16m}{\delta}})\cdot% \sigma_{0}\|\bm{\mu}_{l}\|_{2}+\sqrt{2\log(\dfrac{12n}{\delta})}\cdot\sigma_{p% }\|\bm{\mu}_{l}\|_{2}\cdot(\dfrac{\sigma_{p}^{2}d}{2})^{-1}[\sum_{i=1}^{n}\bar% {\rho}_{-y,r,i}^{(t)}+\sum_{i=1}^{n}\lvert\underline{\rho}_{-y,r,i}^{(t)}\rvert]- italic_γ start_POSTSUBSCRIPT - italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + square-root start_ARG 2 roman_log ( divide start_ARG 16 italic_m end_ARG start_ARG italic_δ end_ARG end_ARG ) ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + square-root start_ARG 2 roman_log ( divide start_ARG 12 italic_n end_ARG start_ARG italic_δ end_ARG ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 2 end_ARG ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT [ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT | under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT | ]
=\displaystyle== Θ(γy,r,l(t))<0.Θsuperscriptsubscript𝛾𝑦𝑟𝑙𝑡0\displaystyle-\Theta\left(\gamma_{-y,r,l}^{(t)}\right)<0.- roman_Θ ( italic_γ start_POSTSUBSCRIPT - italic_y , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) < 0 .

Next, we seek to upper bound 𝐰j,r(t)2subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2\|\mathbf{w}_{j,r}^{(t)}\|_{2}∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The techniques are similar to Proposition D.5 in Meng et al. [2023]. We first tackle the noise term in the decomposition, namely:

i=1nρj,r,i(t)𝝃i𝝃i2222superscriptsubscriptnormsuperscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖2222\displaystyle\left\|\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{% \|\bm{\xi}_{i}\|_{2}^{2}}\right\|_{2}^{2}∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (28)
=\displaystyle== i=1nρj,r,i(t)𝝃i22+21i1<i2nρj,r,i1(t)ρj,r,i2(t)𝝃i1,𝝃i2𝝃i122𝝃i222superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡superscriptsubscriptnormsubscript𝝃𝑖222subscript1subscript𝑖1subscript𝑖2𝑛superscriptsubscript𝜌𝑗𝑟subscript𝑖1𝑡superscriptsubscript𝜌𝑗𝑟subscript𝑖2𝑡subscript𝝃subscript𝑖1subscript𝝃subscript𝑖2superscriptsubscriptnormsubscript𝝃subscript𝑖122superscriptsubscriptnormsubscript𝝃subscript𝑖222\displaystyle\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\cdot\|\bm{\xi}_{i}\|_{2}^{-2}+2% \sum_{1\leq i_{1}<i_{2}\leq n}\rho_{j,r,i_{1}}^{(t)}\rho_{j,r,i_{2}}^{(t)}% \cdot\dfrac{\left\langle\bm{\xi}_{i_{1}},\bm{\xi}_{i_{2}}\right\rangle}{\|\bm{% \xi}_{i_{1}}\|_{2}^{2}\cdot\|\bm{\xi}_{i_{2}}\|_{2}^{2}}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT + 2 ∑ start_POSTSUBSCRIPT 1 ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG ⟨ bold_italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG
\displaystyle\leq 4σp2d1i=1nρj,r,i(t)+221i1<i2nρj,r,i1(t)ρj,r,i2(t)(16σp4d2)(2σp2dlog(6n2δ))\displaystyle 4\sigma_{p}^{-2}d^{-1}\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}{}^{2}+2% \sum_{1\leq i_{1}<i_{2}\leq n}\rho_{j,r,i_{1}}^{(t)}\rho_{j,r,i_{2}}^{(t)}% \cdot\left(16\sigma_{p}^{-4}d^{-2}\right)\cdot\left(2\sigma_{p}^{2}\sqrt{d\log% \left(\dfrac{6n^{2}}{\delta}\right)}\right)4 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT + 2 ∑ start_POSTSUBSCRIPT 1 ≤ italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_n end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ ( 16 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) ⋅ ( 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT square-root start_ARG italic_d roman_log ( divide start_ARG 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG )
=\displaystyle== 4σp2d1i=1nρj,r,i(t)+232σp2d3/2log(6n2δ)[(i=1nρj,r,i(t))2i=1nρj,r,i(t)]2\displaystyle 4\sigma_{p}^{-2}d^{-1}\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}{}^{2}+32% \sigma_{p}^{-2}d^{-3/2}\sqrt{\log\left(\dfrac{6n^{2}}{\delta}\right)}\left[% \left(\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\right)^{2}-\sum_{i=1}^{n}\rho_{j,r,i}^{% (t)}{}^{2}\right]4 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT + 32 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG roman_log ( divide start_ARG 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_δ end_ARG ) end_ARG [ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT ]
=\displaystyle== Θ(σp2d1)i=1nρj,r,i(t)+Θ~(σp2d3/2)(i=1nρj,r,i(t))2Θsuperscriptsubscript𝜎𝑝2superscript𝑑1superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡~Θsuperscriptsubscript𝜎𝑝2superscript𝑑32superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡2\displaystyle\Theta\left(\sigma_{p}^{-2}d^{-1}\right)\sum_{i=1}^{n}\rho_{j,r,i% }^{(t)}+\widetilde{\Theta}\left(\sigma_{p}^{-2}d^{-3/2}\right)\left(\sum_{i=1}% ^{n}\rho_{j,r,i}^{(t)}\right)^{2}roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + over~ start_ARG roman_Θ end_ARG ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 3 / 2 end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
\displaystyle\leq [Θ(σp2d1n1)+Θ~(σp2d3/2)](i=1nρ¯j,r,i(t)+i=1nρj,r,i(t))2delimited-[]Θsuperscriptsubscript𝜎𝑝2superscript𝑑1superscript𝑛1~Θsuperscriptsubscript𝜎𝑝2superscript𝑑32superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡2\displaystyle{\left[\Theta\left(\sigma_{p}^{-2}d^{-1}n^{-1}\right)+\widetilde{% \Theta}\left(\sigma_{p}^{-2}d^{-3/2}\right)\right]\left(\sum_{i=1}^{n}\bar{% \rho}_{j,r,i}^{(t)}+\sum_{i=1}^{n}\rho_{j,r,i}^{(t)}\right)^{2}}[ roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) + over~ start_ARG roman_Θ end_ARG ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 3 / 2 end_POSTSUPERSCRIPT ) ] ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
=\displaystyle== Θ(σp2d1n1)(i=1nρ¯j,r,i(t))2,Θsuperscriptsubscript𝜎𝑝2superscript𝑑1superscript𝑛1superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡2\displaystyle\Theta\left(\sigma_{p}^{-2}d^{-1}n^{-1}\right)\left(\sum_{i=1}^{n% }\bar{\rho}_{j,r,i}^{(t)}\right)^{2},roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

where the first inequality is by Lemma G.1; the second inequality is by the Cauchy Schwartz Inequality on (i=1nρj,r,i(t))2superscriptsuperscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡2(\sum_{i=1}^{n}\rho_{j,r,i}^{(t)})^{2}( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We can then upper bound the 𝐰j,r(t)2subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2\|\mathbf{w}_{j,r}^{(t)}\|_{2}∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT as:

𝐰j,r(t)2subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2\displaystyle\|\mathbf{w}_{j,r}^{(t)}\|_{2}∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT 𝐰j,r(0)2+l=12γj,r,l(t)𝝁l2+i=1nρj,r,i(t)𝝃i𝝃i222absentsubscriptnormsuperscriptsubscript𝐰𝑗𝑟02superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscriptnormsubscript𝝁𝑙2subscriptnormsuperscriptsubscript𝑖1𝑛superscriptsubscript𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖222\displaystyle\leq\left\|\mathbf{w}_{j,r}^{(0)}\right\|_{2}+\sum_{l=1}^{2}% \dfrac{\gamma_{j,r,l}^{(t)}}{\|\bm{\mu}_{l}\|_{2}}+\left\|\sum_{i=1}^{n}\rho_{% j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}\right\|_{2}≤ ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + ∥ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (29)
𝐰j,r(0)2+l=12γj,r,l(t)𝝁l2+Θ(σp1d1/2n1/2)i=1nρ¯j,r,i(t)absentsubscriptnormsuperscriptsubscript𝐰𝑗𝑟02superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscriptnormsubscript𝝁𝑙2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\displaystyle\leq\left\|\mathbf{w}_{j,r}^{(0)}\right\|_{2}+\sum_{l=1}^{2}% \dfrac{\gamma_{j,r,l}^{(t)}}{\|\bm{\mu}_{l}\|_{2}}+\Theta\left(\sigma_{p}^{-1}% d^{-1/2}n^{-1/2}\right)\cdot\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}≤ ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG + roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT
=Θ(σp1d1/2n1/2)i=1nρ¯j,r,i(t),absentΘsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\displaystyle=\Theta\left(\sigma_{p}^{-1}d^{-1/2}n^{-1/2}\right)\cdot\sum_{i=1% }^{n}\bar{\rho}_{j,r,i}^{(t)},= roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ,

where the first inequality is due to the triangle inequality, the second inequality is by (28), and the third equality is due to the following comparisons:

γj,r,l(t)𝝁l2Θ(σp1d1/2n1/2)i=1nρ¯j,r,i(t)=Θ(σpd1/2n1/2𝝁l21SNRl2)=Θ(σp1d1/2n1/2𝝁l2)=O(1),superscriptsubscript𝛾𝑗𝑟𝑙𝑡subscriptnormsubscript𝝁𝑙2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡Θsubscript𝜎𝑝superscript𝑑12superscript𝑛12superscriptsubscriptnormsubscript𝝁𝑙21superscriptsubscriptSNR𝑙2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12subscriptnormsubscript𝝁𝑙2𝑂1\frac{\dfrac{\gamma_{j,r,l}^{(t)}}{\|\bm{\mu}_{l}\|_{2}}}{\Theta\left(\sigma_{% p}^{-1}d^{-1/2}n^{-1/2}\right)\cdot\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}}=% \Theta\left(\sigma_{p}d^{1/2}n^{1/2}\|\bm{\mu}_{l}\|_{2}^{-1}\operatorname{SNR% }_{l}^{2}\right)=\Theta\left(\sigma_{p}^{-1}d^{-1/2}n^{1/2}\|\bm{\mu}_{l}\|_{2% }\right)=O(1),divide start_ARG divide start_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG end_ARG start_ARG roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_O ( 1 ) ,

which is by the coefficient scales in Lemma G.14, the coefficient order i=1nρ¯j,r,i(t)γj,r,l(t)=Θ(SNRl2)superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript𝛾𝑗𝑟𝑙𝑡ΘsuperscriptsubscriptSNR𝑙2\dfrac{\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}}{\gamma_{j,r,l}^{(t)}}=\Theta% \left(\operatorname{SNR}_{l}^{-2}\right)divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = roman_Θ ( roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), and the d𝑑ditalic_d condition in Condition 3.1; and also we have:

𝐰j,r(0)2Θ(σp1d1/2n1/2)i=1nρ¯j,r,i(t)=Θ(σ0d)Θ(σp1d1/2n1/2)i=1nρ¯j,r,i(t)=O(σ0σpdn1/2)=O(1),subscriptnormsuperscriptsubscript𝐰𝑗𝑟02Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡Θsubscript𝜎0𝑑Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝑂subscript𝜎0subscript𝜎𝑝𝑑superscript𝑛12𝑂1\frac{\left\|\mathbf{w}_{j,r}^{(0)}\right\|_{2}}{\Theta\left(\sigma_{p}^{-1}d^% {-1/2}n^{-1/2}\right)\cdot\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}}=\frac{\Theta% \left(\sigma_{0}\sqrt{d}\right)}{\Theta\left(\sigma_{p}^{-1}d^{-1/2}n^{-1/2}% \right)\cdot\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}}=O\left(\sigma_{0}\sigma_{p% }dn^{-1/2}\right)=O(1),divide start_ARG ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = divide start_ARG roman_Θ ( italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG ) end_ARG start_ARG roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = italic_O ( italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT italic_d italic_n start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) = italic_O ( 1 ) ,

which is by the coefficient scales in Lemma G.14, and the condition for σ0subscript𝜎0\sigma_{0}italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Condition 3.1. Apply the coefficient order i=1nρ¯j,r,i(t)=Ω(n)superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡Ω𝑛\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}=\Omega(n)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Ω ( italic_n ) to (29), we directly have 𝐰j,r(t)2Θ(σp1d12n12)subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}\leq\Theta\left(\sigma_{p}^{-1}d^{-% \frac{1}{2}}n^{\frac{1}{2}}\right)∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT ).

G.4 Order-dependent Sampling (Querying) Analysis

Based on the scale of 𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the inner product between it and features, we can now characterize the querying situation of two query criteria-based NAL methods. First, to address the issue of Θ(|𝒫|2)Θsuperscript𝒫2\Theta(\lvert\mathcal{P}\rvert^{2})roman_Θ ( | caligraphic_P | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) comparisons in 𝒫𝒫\mathcal{P}caligraphic_P, we employ a full-order-based technique. We introduce the concepts of Uncertainty Order and Diversity Order in Appendix F.2. Subsequently, we delve into the order of the samples in 𝒫𝒫\mathcal{P}caligraphic_P in the following proposition.

Proposition G.16.

Under the same conditions of Proposition 3, there exist t=O~(η1ε1mnd1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚𝑛superscript𝑑1superscriptsubscript𝜎𝑝2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mnd^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) that for 𝐱,𝐱𝒫𝒟for-all𝐱superscript𝐱𝒫𝒟\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{P}\subsetneq\mathcal{D}∀ bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ⊊ caligraphic_D where 𝐱𝐱\mathbf{x}bold_x contains weak feature patch while 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains strong feature patch, with probability at least 1-δsuperscript𝛿\delta^{\prime}italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have 𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x.

Proof of Proposition G.16. Firstly, suggest 𝐱=[y𝝁2,𝐳2],𝐱=[y𝝁1,𝐳1]formulae-sequence𝐱𝑦subscript𝝁2subscript𝐳2superscript𝐱superscript𝑦subscript𝝁1subscript𝐳1\mathbf{x}=[y\cdot\bm{\mu}_{2},\mathbf{z}_{2}],\mathbf{x}^{\prime}=[y^{\prime}% \cdot\bm{\mu}_{1},\mathbf{z}_{1}]bold_x = [ italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ], where 𝐳1,𝐳2N(𝟎,σp2𝐈)similar-tosubscript𝐳1subscript𝐳2𝑁0superscriptsubscript𝜎𝑝2𝐈\mathbf{z}_{1},\mathbf{z}_{2}\sim N(\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ):

f(𝐖(t),𝐱)=j,rjm[σ(𝐰j,r(t),y𝝁2)+σ(𝐰j,r(t),𝐳2)],𝑓superscript𝐖𝑡𝐱subscript𝑗𝑟𝑗𝑚delimited-[]𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝑦subscript𝝁2𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳2\displaystyle f\!\left(\mathbf{W}^{(t)},\mathbf{x}\right)\!=\sum_{j,r}\frac{j}% {m}\left[\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}_{2}\right% \rangle\right)\thinspace+\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},% \mathbf{z}_{2}\right\rangle\right)\!\right],italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) ] ,
f(𝐖(t),𝐱)=j,rjm[σ(𝐰j,r(t),y𝝁1)+σ(𝐰j,r(t),𝐳1)].𝑓superscript𝐖𝑡superscript𝐱subscript𝑗𝑟𝑗𝑚delimited-[]𝜎superscriptsubscript𝐰𝑗𝑟𝑡superscript𝑦subscript𝝁1𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳1\displaystyle f\!\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\!=\sum_{j,r% }\frac{j}{m}\left[\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},y^{\prime}% \bm{\mu}_{1}\right\rangle\right)\!+\sigma\!\left(\left\langle\mathbf{w}_{j,r}^% {(t)},\mathbf{z}_{1}\right\rangle\right)\!\right].italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) ] .

By (11) in Lemma 11 and (16) in Definition16, we have the following

𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq_{C}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x |f(𝐖(t),𝐱)|<|f(𝐖(t),𝐱)|ΩC,absentsubscript𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱subscriptΩ𝐶\displaystyle\Leftrightarrow\underbrace{\left|f\left(\mathbf{W}^{(t)},\mathbf{% x}\right)\right|<\left|f\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)% \right|}_{\Omega_{C}},⇔ under⏟ start_ARG | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | < | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
𝐱D(t)𝐱superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq_{D}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0)ΩD,absentsubscript𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0subscriptΩ𝐷\displaystyle\Leftrightarrow\underbrace{D\left(\mathbf{W}^{(t)},\mathbf{x},p\ % \mid\mathcal{D}_{n_{0}}\right)>D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ % \mid\mathcal{D}_{n_{0}}\right)}_{\Omega_{D}},⇔ under⏟ start_ARG italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x {ΩCΩD,p[1,)}ΩabsentsubscriptsubscriptΩ𝐶subscriptΩ𝐷for-all𝑝1Ω\displaystyle\Leftrightarrow\underbrace{\{\Omega_{C}\cap\Omega_{D},\forall p% \in\left[1,\infty\right)\}}_{\Omega}⇔ under⏟ start_ARG { roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , ∀ italic_p ∈ [ 1 , ∞ ) } end_ARG start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT

Denote jjσ(𝐰j,r(t),𝐳1)subscript𝑗𝑗𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳1\sum_{j}j\cdot\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{1}% \right\rangle\right)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_j ⋅ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ), jjσ(𝐰j,r(t),𝐳2)subscript𝑗𝑗𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳2\sum_{j}j\cdot\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{2}% \right\rangle\right)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_j ⋅ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) as gr(𝐳1)subscript𝑔𝑟subscript𝐳1g_{r}(\mathbf{z}_{1})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), gr(𝐳2)subscript𝑔𝑟subscript𝐳2g_{r}(\mathbf{z}_{2})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) respectively, Notice that for 𝐳N(𝟎,σp2𝐈)similar-to𝐳𝑁0superscriptsubscript𝜎𝑝2𝐈\mathbf{z}\sim N(\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})bold_z ∼ italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ):

𝐰j,r(t),𝐳𝒩(0,𝐰j,r(t)22σp2𝐈),similar-tosuperscriptsubscript𝐰𝑗𝑟𝑡𝐳𝒩0superscriptsubscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡22superscriptsubscript𝜎𝑝2𝐈\displaystyle\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}\right\rangle\sim% \mathcal{N}\left(0,\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}^{2}\sigma_{p}^{2}% \cdot\mathbf{I}\right),⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z ⟩ ∼ caligraphic_N ( 0 , ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ) , (30)
σ(𝐰j,r(t),𝐳)𝒩R(0,𝐰j,r(t)22σp2𝐈).similar-to𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝐳superscript𝒩𝑅0superscriptsubscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡22superscriptsubscript𝜎𝑝2𝐈\displaystyle\sigma(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}\right\rangle% )\sim\mathcal{N}^{R}\left(0,\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}^{2}% \sigma_{p}^{2}\cdot\mathbf{I}\right).italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z ⟩ ) ∼ caligraphic_N start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( 0 , ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ) .

Then:

P(ΩC)𝑃subscriptΩ𝐶\displaystyle P(\Omega_{C})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) =P(|f(𝐖(t),𝐱)|<|f(𝐖(t),𝐱)|)absent𝑃𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱\displaystyle=P(\left|f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\right|<\left|f% \left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\right|)= italic_P ( | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | < | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) (31)
P(l(r|gr(𝐳l)|)<r(Θ(γy,r,1)Θ(γy,r,2)))absent𝑃subscript𝑙subscript𝑟subscript𝑔𝑟subscript𝐳𝑙subscript𝑟Θsubscript𝛾superscript𝑦𝑟1Θsubscript𝛾𝑦𝑟2\displaystyle\geq P(\sum_{l}(\sum_{r}\lvert g_{r}(\mathbf{z}_{l})\rvert)<\sum_% {r}(\Theta(\gamma_{y^{\prime},r,1})-\Theta(\gamma_{y,r,2})))≥ italic_P ( ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ) < ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) )
P(mmaxj,r,l{|𝐰j,r(t),𝐳l|}<m(Θ(𝔼𝑟(γy,r,1))Θ(𝔼𝑟(γy,r,2))))absent𝑃𝑚subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙𝑚Θ𝑟𝔼subscript𝛾superscript𝑦𝑟1Θ𝑟𝔼subscript𝛾𝑦𝑟2\displaystyle\geq P(m\cdot\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t% )},\mathbf{z}_{l}\right\rangle\right|\}<m(\Theta(\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,1}))-\Theta(\underset{r}{\mathbb{E}}(\gamma_{y,r,2}))))≥ italic_P ( italic_m ⋅ roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m ( roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) ) - roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) ) )
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,1)𝔼𝑟(γy,r,2))Ωγ).\displaystyle=P(\underbrace{\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{% (t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,1})-\underset{r}{\mathbb{E}}(\gamma_{y,r,2}))}_{\Omega_{% \gamma}}).= italic_P ( under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

The second inequality is by triangle inequality and (25) in Lemma G.15; the third inequality is by Lemma G.14.

For ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, denoting U0l={𝐱𝒟0𝐱signal part =𝝁l}superscriptsubscript𝑈0𝑙conditional-set𝐱subscript𝒟0subscript𝐱signal part subscript𝝁𝑙U_{0}^{l}=\{\mathbf{x}\in\mathcal{D}_{0}\mid\mathbf{x}_{\text{signal part }}=% \bm{\mu}_{l}\}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { bold_x ∈ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } as the set of indices of 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT where the data’s feature patch is 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, We then take a look at the rthsuperscript𝑟𝑡r^{th}italic_r start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of the Feature Distance 𝐙(𝐱,t)𝐙𝐱𝑡\mathbf{Z}(\mathbf{x},t)bold_Z ( bold_x , italic_t ), which we denote as 𝐙r(𝐱,t)subscript𝐙𝑟𝐱𝑡\mathbf{Z}_{r}(\mathbf{x},t)bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ):

𝐙r(𝐱,t)subscript𝐙𝑟𝐱𝑡\displaystyle\mathbf{Z}_{r}(\mathbf{x},t)bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) =j(σ(𝐰j,r,y𝝁2)+σ(𝐰j,r,𝐳r))absentsubscript𝑗𝜎subscript𝐰𝑗𝑟𝑦subscript𝝁2𝜎subscript𝐰𝑗𝑟subscript𝐳𝑟\displaystyle=\sum_{j}\left(\sigma\left(\left\langle\mathbf{w}_{j,r},y\cdot\bm% {\mu}_{2}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_{j,r},\mathbf% {z}_{r}\right\rangle\right)\right)= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⟩ ) ) (32)
=Θ(γy,r,2)+gr(𝐳2)absentΘsubscript𝛾𝑦𝑟2subscript𝑔𝑟subscript𝐳2\displaystyle=\Theta\left(\gamma_{y,r,2}\right)+g_{r}(\mathbf{z}_{2})= roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\displaystyle\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG =i,jσ(𝐰j,r,yi𝝁(i))+σ(𝐰j,r,𝝃i)n0absentsubscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝑦𝑖superscript𝝁𝑖𝜎subscript𝐰𝑗𝑟subscript𝝃𝑖subscript𝑛0\displaystyle=\sum_{i,j}\frac{\sigma\left(\left\langle\mathbf{w}_{j,r},y_{i}% \cdot\bm{\mu}^{(i)}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_{j,% r},\bm{\xi}_{i}\right\rangle\right)}{n_{0}}= ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT divide start_ARG italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (33)
=[lτln0𝔼ilU0lΘ(γyil,r,l)+ijΘ(ρ¯j,r,i)]n0absentdelimited-[]subscript𝑙subscript𝜏𝑙subscript𝑛0subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼Θsubscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖subscript𝑛0\displaystyle=\dfrac{\left[\sum_{l}\tau_{l}\cdot n_{0}\cdot\underset{i_{l}\in U% _{0}^{l}}{\mathbb{E}}\Theta(\gamma_{y_{i_{l}},r,l})+\sum_{i}\sum_{j}\Theta% \left(\bar{\rho}_{j,r,i}\right)\right]}{n_{0}}= divide start_ARG [ ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) ] end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG

Let (32) - (33), we have:

𝐙r(𝐱,t)i𝐙r(𝐱(i),t)n0=Θ(γy,r,2)+gr(𝐳2)i𝐙r(𝐱(i),t)n0subscript𝐙𝑟𝐱𝑡subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0Θsubscript𝛾𝑦𝑟2subscript𝑔𝑟subscript𝐳2subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\mathbf{Z}_{r}(\mathbf{x},t)-\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)% }{n_{0}}=\Theta(\gamma_{y,r,2})+g_{r}(\mathbf{z}_{2})-\sum_{i}\dfrac{\mathbf{Z% }_{r}(\mathbf{x}^{(i)},t)}{n_{0}}bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (34)

Now we can estimate D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ):

D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0\displaystyle D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =𝐙(𝐱,t)i=1n0𝐙(𝐱(i),t)n0pabsentsubscriptnorm𝐙𝐱𝑡superscriptsubscript𝑖1subscript𝑛0𝐙superscript𝐱𝑖𝑡subscript𝑛0𝑝\displaystyle=\|\mathbf{Z}(\mathbf{x},t)-\sum_{i=1}^{n_{0}}\dfrac{\mathbf{Z}(% \mathbf{x}^{(i)},t)}{n_{0}}\|_{p}= ∥ bold_Z ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG bold_Z ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (35)
=(r|𝐙r(𝐱,t)i𝐙r(𝐱(i),t)n0|p)1pabsentsuperscriptsubscript𝑟superscriptsubscript𝐙𝑟𝐱𝑡subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝\displaystyle=\left(\sum_{r}\lvert\mathbf{Z}_{r}(\mathbf{x},t)-\sum_{i}\dfrac{% \mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}\rvert^{p}\right)^{\frac{1}{p}}= ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT
=(r|Θ(γy,r,2)+gr(𝐳2)i𝐙r(𝐱(i),t)n0|p)1pabsentsuperscriptsubscript𝑟superscriptΘsubscript𝛾𝑦𝑟2subscript𝑔𝑟subscript𝐳2subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝\displaystyle=\left(\sum_{r}\lvert\Theta(\gamma_{y,r,2})+g_{r}(\mathbf{z}_{2})% -\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}\rvert^{p}\right)^{% \frac{1}{p}}= ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT

Similarly, the D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) could be written as:

D(𝐖(t),𝐱,p𝒟n0)=(r|Θ(γy,r,1)+gr(𝐳1)i𝐙r(𝐱(i),t)n0|p)1p𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0superscriptsubscript𝑟superscriptΘsubscript𝛾𝑦𝑟1subscript𝑔𝑟subscript𝐳1subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)=% \left(\sum_{r}\lvert\Theta(\gamma_{y,r,1})+g_{r}(\mathbf{z}_{1})-\sum_{i}% \dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}\rvert^{p}\right)^{\frac{1}{p}}italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 1 end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT (36)

To compare D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), we first see that both expressions in the r𝑟ritalic_r-th filter owns

i𝐙r(𝐱(i),t)n0=lτlΘ(𝔼ilU0l(γyil,r,l))n01ijΘ(ρ¯j,r,i).subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙superscriptsubscript𝑛01subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖-\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}=-\sum_{l}\tau_{l}% \cdot\Theta(\underset{i_{l}\in U_{0}^{l}}{\mathbb{E}}(\gamma_{y_{i_{l}},r,l}))% -n_{0}^{-1}\sum_{i}\sum_{j}\Theta\left(\bar{\rho}_{j,r,i}\right).- ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) .

By Condition 3.1, we see that σp2d/(n0𝝁122)=Ω(log(T))superscriptsubscript𝜎𝑝2𝑑subscript𝑛0superscriptsubscriptnormsubscript𝝁122Ωsuperscript𝑇\sigma_{p}^{2}d/(n_{0}\|\bm{\mu}_{1}\|_{2}^{2})=\Omega(\log(T^{*}))italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_Ω ( roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ). We see that as Tsuperscript𝑇T^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the substantially large maximum admissible iterations, collaborating with (25), (33) and (30), it holds that the order of n01i,jσ(𝐰j,r,𝝃i)=n01ijΘ(ρ¯j,r,i)superscriptsubscript𝑛01subscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝝃𝑖superscriptsubscript𝑛01subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖n_{0}^{-1}\sum_{i,j}\sigma\left(\left\langle\mathbf{w}_{j,r},\bm{\xi}_{i}% \right\rangle\right)=n_{0}^{-1}\sum_{i}\sum_{j}\Theta\left(\bar{\rho}_{j,r,i}\right)italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) in i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is indeed can dominate n01i,jσ(𝐰j,r,yi𝝁(i))=lτlΘ(𝔼ilU0l(γyil,r,l))superscriptsubscript𝑛01subscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝑦𝑖superscript𝝁𝑖subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙n_{0}^{-1}\sum_{i,j}\sigma\left(\left\langle\mathbf{w}_{j,r},y_{i}\cdot\bm{\mu% }^{(i)}\right\rangle\right)=\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{% 0}^{l}}{\mathbb{E}}(\gamma_{y_{i_{l}},r,l}))italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⟩ ) = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ), Θ(γy,r,1)Θsubscript𝛾𝑦𝑟1\Theta(\gamma_{y,r,1})roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 1 end_POSTSUBSCRIPT ) and gr(𝐳1)subscript𝑔𝑟subscript𝐳1g_{r}(\mathbf{z}_{1})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). As i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is shared by both D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) in the r𝑟ritalic_r-th filter, a sufficient event for D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)>D\left(% \mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is that for r[m]for-all𝑟delimited-[]𝑚\forall r\in[m]∀ italic_r ∈ [ italic_m ], we have

|lτlΘ(𝔼ilU0l(γyil,r,l))Θ(γy,r,2)gr(𝐳2)|>|max{lτlΘ(𝔼ilU0l(γyil,r,l))Θ(γy,r,1)gr(𝐳1),0}|.subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙Θsubscript𝛾𝑦𝑟2subscript𝑔𝑟subscript𝐳2subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙Θsubscript𝛾𝑦𝑟1subscript𝑔𝑟subscript𝐳10\lvert\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l}}{\mathbb{E}}(% \gamma_{y_{i_{l}},r,l}))-\Theta(\gamma_{y,r,2})-g_{r}(\mathbf{z}_{2})\rvert>% \lvert\max\{\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l}}{\mathbb{% E}}(\gamma_{y_{i_{l}},r,l}))-\Theta(\gamma_{y,r,1})-g_{r}(\mathbf{z}_{1}),0\}\rvert.| ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | > | roman_max { ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 1 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , 0 } | .

Utilizing those results, we now could estimate the chance of event ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT:

P(ΩD)𝑃subscriptΩ𝐷\displaystyle P(\Omega_{D})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) =P(D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0))absent𝑃𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0\displaystyle=P(D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}% \right)>D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}% \right))= italic_P ( italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (37)
P(m1pl(maxr|gr(𝐳l)|)<m1p(|Θ(𝔼𝑟(γy,r,2))lτlΘ(𝔼ilU0l,r(γyil,r,l))|\displaystyle\geq P(m^{\frac{1}{p}}\sum_{l}(\max_{r}\lvert g_{r}(\mathbf{z}_{l% })\rvert)<m^{\frac{1}{p}}(\lvert\Theta(\underset{r}{\mathbb{E}}(\gamma_{y,r,2}% ))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l},r}{\mathbb{E}}(% \gamma_{y_{i_{l}},r,l}))\rvert≥ italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ) < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ( | roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) |
|Θ(𝔼𝑟(γy,r,1))lτlΘ(𝔼ilU0l,r(γyil,r,l))|)\displaystyle\phantom{\geq P(m^{\frac{1}{p}}\sum_{l}(\sum_{r}\lvert g_{r}(% \mathbf{z}_{l})\rvert)<m^{\frac{1}{p}}}-\lvert\Theta(\underset{r}{\mathbb{E}}(% \gamma_{y,r,1}))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l},r}{% \mathbb{E}}(\gamma_{y_{i_{l}},r,l}))\rvert)- | roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 1 end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) ) | )
P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1p((τ1τ2)Θ(𝔼j,r(γj,r,1))(τ1τ2)Θ(𝔼j,r(γj,r,2)))\displaystyle\geq P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_% {j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\left((\tau_{1% }-\tau_{2})\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,1}))-(\tau_{1}-\tau_{% 2})\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,2}))\right)≥ italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ( ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) ) - ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT ) ) )
=P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1pΘ(τ1(τ1τ2)𝝁122τ2(τ1τ2)𝝁222σp2d/n0))absent𝑃superscript𝑚1𝑝subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙superscript𝑚1𝑝Θsubscript𝜏1subscript𝜏1subscript𝜏2superscriptsubscriptnormsubscript𝝁122subscript𝜏2subscript𝜏1subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0\displaystyle=P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r% }^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\Theta(\dfrac{\tau% _{1}(\tau_{1}-\tau_{2})\|\bm{\mu}_{1}\|_{2}^{2}-\tau_{2}(\tau_{1}-\tau_{2})\|% \bm{\mu}_{2}\|_{2}^{2}}{\sigma_{p}^{2}d/n_{0}}))= italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_Θ ( divide start_ARG italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) )
=P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1pΘ(𝔼𝑟(γy,r,1)𝔼𝑟(γy,r,2)))absent𝑃superscript𝑚1𝑝subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙superscript𝑚1𝑝Θ𝑟𝔼subscript𝛾superscript𝑦𝑟1𝑟𝔼subscript𝛾𝑦𝑟2\displaystyle=P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r% }^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\Theta(\underset{r% }{\mathbb{E}}(\gamma_{y^{\prime},r,1})-\underset{r}{\mathbb{E}}(\gamma_{y,r,2}% )))= italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) )
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,1)𝔼𝑟(γy,r,2))Ωγ),\displaystyle=P(\underbrace{\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{% (t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,1})-\underset{r}{\mathbb{E}}(\gamma_{y,r,2}))}_{\Omega_{% \gamma}}),= italic_P ( under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where the first inequality is by Lemma G.14, triangle inequality, (25), (35) and (36); The forth equality is by (30). Easy to see that if p=𝑝p=\inftyitalic_p = ∞, the third equality would be zero, thus our condition p<𝑝p<\inftyitalic_p < ∞ avoid this case. Now we take a look at the event ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT:

P(Ωγ)𝑃subscriptΩ𝛾\displaystyle P(\Omega_{\gamma})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) =P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,1)𝔼𝑟(γy,r,2)))\displaystyle=P(\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf% {z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(\gamma_{y^{% \prime},r,1})-\underset{r}{\mathbb{E}}(\gamma_{y,r,2})))= italic_P ( roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) ) (38)
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ([τ1𝝁122τ2𝝁222]σp2d/n0))absent𝑃subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙Θdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0\displaystyle=P(\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf% {z}_{l}\right\rangle\right|\}<\Theta\left(\dfrac{\left[\tau_{1}\left\|\bm{\mu}% _{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2}\right]}{% \sigma_{p}^{2}d/n_{0}}\right))= italic_P ( roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) )
P(j,r,l{|𝐰j,r(t),𝐳l0|<Θ([τ1𝝁122τ2𝝁222]σp2d/n0)}Ω^j,r,l)absent𝑃subscript𝑗𝑟𝑙subscriptsuperscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙0Θdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0subscript^Ω𝑗𝑟𝑙\displaystyle\geq P(\bigcup_{j,r,l}\underbrace{\{\left|\left\langle\mathbf{w}_% {j,r}^{(t)},\mathbf{z}_{l}\right\rangle-0\right|<\Theta\left(\dfrac{\left[\tau% _{1}\left\|\bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2% }^{2}\right]}{\sigma_{p}^{2}d/n_{0}}\right)\}}_{\hat{\Omega}_{j,r,l}})≥ italic_P ( ⋃ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT under⏟ start_ARG { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ - 0 | < roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) } end_ARG start_POSTSUBSCRIPT over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=j,r,lP(Ω^j,r,l),absentsubscript𝑗𝑟𝑙𝑃subscript^Ω𝑗𝑟𝑙\displaystyle=\sum_{j,r,l}P(\hat{\Omega}_{j,r,l}),= ∑ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT italic_P ( over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT ) ,

where the second equality is by the first inference statement of Lemma G.14; the third inequality is by the equivalence property of the union by events; the last equality is by the Union Rule. Then, by Gaussian tail bound, we have:

P(Ω^j,r,l)12exp{Θ([τ1𝝁122τ2𝝁222]2σp6d2/n02wj,r(t)22)}𝑃subscript^Ω𝑗𝑟𝑙12Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁2222superscriptsubscript𝜎𝑝6superscript𝑑2superscriptsubscript𝑛02superscriptsubscriptnormsuperscriptsubscript𝑤𝑗𝑟𝑡22P(\hat{\Omega}_{j,r,l})\geq 1-2\exp\left\{-\Theta\left(\dfrac{\left[\tau_{1}% \left\|\bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2% }\right]^{2}}{\sigma_{p}^{6}d^{2}/n_{0}^{2}\left\|w_{j,r}^{(t)}\right\|_{2}^{2% }}\right)\right\}italic_P ( over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT ) ≥ 1 - 2 roman_exp { - roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) }

Finally, with conditions on 𝝁122𝝁222superscriptsubscriptnormsubscript𝝁122superscriptsubscriptnormsubscript𝝁222\|\bm{\mu}_{1}\|_{2}^{2}-\|\bm{\mu}_{2}\|_{2}^{2}∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT in Proposition 3, Lemma 17, (25) in Lemma G.15 and union bound, we have the conclusion for event ΩΩ\Omegaroman_Ω:

P(Ω)P(Ωγ)absent𝑃Ω𝑃subscriptΩ𝛾\displaystyle\Rightarrow P(\Omega)\geq P(\Omega_{\gamma})⇒ italic_P ( roman_Ω ) ≥ italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) 18mexp{Θ([τ1𝝁122τ2𝝁222]2σp4d/n0)}absent18𝑚Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁2222superscriptsubscript𝜎𝑝4𝑑subscript𝑛0\displaystyle\geqslant 1-8m\exp\left\{-\Theta\left(\frac{\left[\tau_{1}\left\|% \bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2}\right% ]^{2}}{\sigma_{p}^{4}d/n_{0}}\right)\right\}⩾ 1 - 8 italic_m roman_exp { - roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) } (39)
1δ,absent1superscript𝛿\displaystyle\geqslant 1-\delta^{\prime},⩾ 1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

for p[1,)for-all𝑝1\forall p\in\left[1,\infty\right)∀ italic_p ∈ [ 1 , ∞ ).

Remark G.17.

We can observe that the Uncertainty Order and Diversity Order of samples rely heavily on the model’s learning progresss upon them. By Lemma G.14, the learning progresss of samples depend heavily on the feature strength 𝝁l2subscriptnormsubscript𝝁𝑙2\|\bm{\mu}_{l}\|_{2}∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and data proportion τlsubscript𝜏𝑙\tau_{l}italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. That is to say, in our case, the perplexing samples are the samples containing weak feature 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. In the next section, we would show that the number of those perplexing samples in the labeled set after querying would determine the algorithm’s generalization ability.

From the above proving process, we can deduce some important findings, which can be summarized in the following lemmas.

The following lemma shows that Uncertainty Sampling and Diversity Sampling correspond to different comparisons on the model’s learning progress over samples in 𝒫𝒫\mathcal{P}caligraphic_P.

Lemma G.18.

(Restatement of Lemma 4.2) Under the same conditions in Proposition 3, with the same notations in Proposition G.16, there exists certain constants c1,c2,c3,c4,c5,c6>0subscript𝑐1subscript𝑐2subscript𝑐3subscript𝑐4subscript𝑐5subscript𝑐60c_{1},c_{2},c_{3},c_{4},c_{5},c_{6}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT > 0, such that

  • 𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡𝐱superscript𝐱\mathbf{x}\preceq_{C}^{(t)}\mathbf{x}^{\prime}bold_x ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has a sufficient event that

    {c1𝔼𝑟(γy,r,1)c2𝔼𝑟(γy,r,2)>maxj,r,l{|𝐰j,r(t),𝐳l|}},subscript𝑐1𝑟𝔼subscript𝛾superscript𝑦𝑟1subscript𝑐2𝑟𝔼subscript𝛾𝑦𝑟2subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙\{c_{1}\underset{r}{\mathbb{E}}(\gamma_{y^{\prime},r,1})-c_{2}\underset{r}{% \mathbb{E}}(\gamma_{y,r,2})>\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{% (t)},\mathbf{z}_{l}\right\rangle\right|\}\},{ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } , (40)

    among which the left side of the inequality corresponds to the comparison of learning progress of samples with different type of feature patch.

  • 𝐱D(t)𝐱,p[1,)formulae-sequencesuperscriptsubscriptprecedes-or-equals𝐷𝑡𝐱superscript𝐱for-all𝑝1\mathbf{x}\preceq_{D}^{(t)}\mathbf{x}^{\prime},\forall p\in[1,\infty)bold_x ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ∀ italic_p ∈ [ 1 , ∞ ) has a sufficient event that

    {|c3𝔼𝑟(γy,r,2)c4lτl𝔼ilU0l,r(γyil,r,l)||c5𝔼𝑟(γy,r,1)c6lτl𝔼ilU0l,r(γyil,r,l)|>maxj,r,l{|𝐰j,r(t),𝐳l|}},subscript𝑐3𝑟𝔼subscript𝛾𝑦𝑟2subscript𝑐4subscript𝑙subscript𝜏𝑙subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝑟𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙subscript𝑐5𝑟𝔼subscript𝛾superscript𝑦𝑟1subscript𝑐6subscript𝑙subscript𝜏𝑙subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝑟𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟𝑙subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙\{\lvert c_{3}\underset{r}{\mathbb{E}}(\gamma_{y,r,2})-c_{4}\sum_{l}\tau_{l}% \cdot\underset{i_{l}\in U_{0}^{l},r}{\mathbb{E}}(\gamma_{y_{i_{l}},r,l})\rvert% -\lvert c_{5}\underset{r}{\mathbb{E}}(\gamma_{y^{\prime},r,1})-c_{6}\sum_{l}% \tau_{l}\cdot\underset{i_{l}\in U_{0}^{l},r}{\mathbb{E}}(\gamma_{y_{i_{l}},r,l% })\rvert>\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{l}% \right\rangle\right|\}\},{ | italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) | - | italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , italic_l end_POSTSUBSCRIPT ) | > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } , (41)

    among which the left side of the inequality corresponds to the comparison of the disparity between learning toward samples and labeled training set.

Proof of Lemma G.18. The first bullet point can be easily derived from (31), while the second bullet point is readily apparent from (35), (36), and (37).

During the proving process of Proposition G.16, it is observed that for any p[1,)𝑝1p\in[1,\infty)italic_p ∈ [ 1 , ∞ ), there exists a shared sufficient event for (40) and (41). This implies that it is also a shared sufficient event for the events ΩCsubscriptΩ𝐶\Omega_{C}roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, denoted as ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT:

Ωγ:={maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,1)𝔼𝑟(γy,r,2))}.\Omega_{\gamma}\mathrel{\mathop{:}}=\{\max_{j,r,l}\{\left|\left\langle\mathbf{% w}_{j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{% \mathbb{E}}(\gamma_{y^{\prime},r,1})-\underset{r}{\mathbb{E}}(\gamma_{y,r,2}))\}.roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT : = { roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , 1 end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 2 end_POSTSUBSCRIPT ) ) } .

By the first inference statement of Lemma G.14, we have

Ωγ={maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼j,r(γj,r,1)𝔼j,r(γj,r,2))}.\Omega_{\gamma}=\{\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},% \mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{j,r}{\mathbb{E}}(\gamma% _{j,r,1})-\underset{j,r}{\mathbb{E}}(\gamma_{j,r,2}))\}.roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = { roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) - start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT ) ) } . (42)

Therefore, we can conclude that the significant difference in the model’s learning of the feature 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is what causes the sufficient event for both event ΩCsubscriptΩ𝐶\Omega_{C}roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. By (39), we have:

P(Ωγ)18mexp{Θ(𝔼j,r(γj,r,1)𝔼j,r(γj,r,2))}.𝑃subscriptΩ𝛾18𝑚Θ𝑗𝑟𝔼subscript𝛾𝑗𝑟1𝑗𝑟𝔼subscript𝛾𝑗𝑟2P(\Omega_{\gamma})\geq 1-8m\exp\left\{-\Theta\left(\underset{j,r}{\mathbb{E}}(% \gamma_{j,r,1})-\underset{j,r}{\mathbb{E}}(\gamma_{j,r,2})\right)\right\}.italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) ≥ 1 - 8 italic_m roman_exp { - roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) - start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT ) ) } . (43)

Based on Lemma G.14, we see that the 𝔼j,r(γj,r,1)𝑗𝑟𝔼subscript𝛾𝑗𝑟1\underset{j,r}{\mathbb{E}}(\gamma_{j,r,1})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 1 end_POSTSUBSCRIPT ) is significant larger than 𝔼j,r(γj,r,2)𝑗𝑟𝔼subscript𝛾𝑗𝑟2\underset{j,r}{\mathbb{E}}(\gamma_{j,r,2})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , 2 end_POSTSUBSCRIPT ) under our conditions, which causes the sufficient event ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT.

Based on the above results, we can have a look on the overall order situation of the sampling pool 𝒫𝒫\mathcal{P}caligraphic_P.

Lemma G.19.

(Restatement of Lemma 4.4) Under Condition 3.1, when the results of Proposition 3.2 and Proposition G.16 hold at the initial stage and querying stage at a certain tT𝑡superscript𝑇t\leq T^{*}italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, denoting 𝐗𝒫1𝒫superscriptsubscript𝐗𝒫1𝒫\mathbf{X}_{\mathcal{P}}^{1}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of all the data points with strong feature 𝛍1subscript𝛍1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in 𝒫𝒫\mathcal{P}caligraphic_P, and 𝐗𝒫2𝒫superscriptsubscript𝐗𝒫2𝒫\mathbf{X}_{\mathcal{P}}^{2}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of data points with weak feature 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have the conclusion that with probability more than 1-Θ(δ)Θsuperscript𝛿\Theta(\delta^{\prime})roman_Θ ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), 𝐗𝒫1(t)𝐗𝒫2superscriptprecedes𝑡superscriptsubscript𝐗𝒫1superscriptsubscript𝐗𝒫2\mathbf{X}_{\mathcal{P}}^{1}\prec^{(t)}\mathbf{X}_{\mathcal{P}}^{2}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT holds.

proof of Lemma G.19. By Proposition G.16, 𝐱𝐗𝒫1for-allsuperscript𝐱superscriptsubscript𝐗𝒫1\forall\mathbf{x}^{\prime}\in\mathbf{X}_{\mathcal{P}}^{1}∀ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, and 𝐱𝐗𝒫2for-all𝐱superscriptsubscript𝐗𝒫2\forall\mathbf{x}\in\mathbf{X}_{\mathcal{P}}^{2}∀ bold_x ∈ bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, 𝐱(t)𝐱superscriptprecedes𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\prec^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x with at least probability δsuperscript𝛿\delta^{\prime}italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. It’s natural to see comparing every pairs in 𝐗𝒫1superscriptsubscript𝐗𝒫1\mathbf{X}_{\mathcal{P}}^{1}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and 𝐗𝒫2superscriptsubscript𝐗𝒫2\mathbf{X}_{\mathcal{P}}^{2}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT as independent random events. Then given a certain 𝐱𝐗𝒫1superscript𝐱superscriptsubscript𝐗𝒫1\mathbf{x}^{\prime}\in\mathbf{X}_{\mathcal{P}}^{1}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, the chance that 𝐱𝐗𝒫2for-all𝐱superscriptsubscript𝐗𝒫2\forall\mathbf{x}\in\mathbf{X}_{\mathcal{P}}^{2}∀ bold_x ∈ bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT satisfies 𝐱(t)𝐱superscriptprecedes𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\prec^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x is Θ((1δ)|𝐗𝒫2|)Θsuperscript1superscript𝛿superscriptsubscript𝐗𝒫2\Theta((1-\delta^{\prime})^{|\mathbf{X}_{\mathcal{P}}^{2}|})roman_Θ ( ( 1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ), therefore, for 𝐱𝐗𝒫1for-allsuperscript𝐱superscriptsubscript𝐗𝒫1\forall\mathbf{x}^{\prime}\in\mathbf{X}_{\mathcal{P}}^{1}∀ bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, the chance is Θ((1δ)|𝐗𝒫2||𝐗𝒫1|)=Θ((1δ)p(1p)|𝒫|2)=1Θ(δ)Θsuperscript1superscript𝛿superscriptsubscript𝐗𝒫2superscriptsubscript𝐗𝒫1Θsuperscript1superscript𝛿𝑝1𝑝superscript𝒫21Θsuperscript𝛿\Theta((1-\delta^{\prime})^{|\mathbf{X}_{\mathcal{P}}^{2}|\cdot|\mathbf{X}_{% \mathcal{P}}^{1}|})=\Theta((1-\delta^{\prime})^{p(1-p)\lvert\mathcal{P}\rvert^% {2}})=1-\Theta(\delta^{\prime})roman_Θ ( ( 1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT | bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | ⋅ | bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT | end_POSTSUPERSCRIPT ) = roman_Θ ( ( 1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT italic_p ( 1 - italic_p ) | caligraphic_P | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ) = 1 - roman_Θ ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) as δ1much-less-thansuperscript𝛿1\delta^{\prime}\ll 1italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≪ 1.

Based on Lemma G.19 and (42), we directly have the following lemma demonstrate that both NAL algorithms would all prioritize those poor learning samples.

Lemma G.20.

(Restatement of Proposition 3) Under the same conditions in Proposition 3.2, the Uncertainty Order and Diversity Order of the samples [(y𝛍l)T,ξT]Tsuperscriptsuperscript𝑦subscript𝛍𝑙𝑇superscript𝜉𝑇𝑇[(y\cdot\bm{\mu}_{l})^{T},\mathbf{\xi}^{T}]^{T}[ ( italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in sampling pool 𝒫𝒫\mathcal{P}caligraphic_P follows the order of 𝔼j,rγj,r,l(t)𝑗𝑟𝔼superscriptsubscript𝛾𝑗𝑟𝑙𝑡\displaystyle\underset{j,r}{\mathbb{E}}\gamma_{j,r,l}^{(t)}start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

G.5 Label Complexity-based Test Error Analysis

In this section, we suggest the results in the previous sections all hold with high probability. With the results of the final scale of the coefficients as well as the order situation of the data in sampling pool 𝒫𝒫\mathcal{P}caligraphic_P, we can now take a look on the test error upper and lower bound under distinct conditions before and after querying.

Lemma G.21.

(Partial restatement of Lemma 4.5) Under Condition 3.1, for a test set 𝒟𝒟superscript𝒟superscript𝒟\mathcal{D^{*}}\subseteq\mathcal{D^{*}}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊆ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with occurrence probability psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-equipped data, then t=O~(η1ε1mn0d1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚subscript𝑛0superscript𝑑1superscriptsubscript𝜎𝑝2\exists\ t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mn_{0}d^{-1}\sigma_{p}^% {-2}\right)∃ italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), we have the following two situations before and after querying (i.e., s{0,1}for-all𝑠01\forall s\in\{0,1\}∀ italic_s ∈ { 0 , 1 }):

  • If l{1,2},ns,lC1σp4d𝝁l24formulae-sequencefor-all𝑙12subscript𝑛𝑠𝑙subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁𝑙24\forall l\in\{1,2\},n_{s,l}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{l}\|_{% 2}^{4}}∀ italic_l ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG holds, we have the test error:

    L𝒟01(𝐖(t))(1p)exp(ns,1𝝁124C3σp4d)+pexp(ns,2𝝁224C4σp4d).superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡1superscript𝑝subscript𝑛𝑠1superscriptsubscriptnormsubscript𝝁124subscript𝐶3superscriptsubscript𝜎𝑝4𝑑superscript𝑝subscript𝑛𝑠2superscriptsubscriptnormsubscript𝝁224subscript𝐶4superscriptsubscript𝜎𝑝4𝑑L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq(1-p^{*})\cdot\exp% \left(\dfrac{-n_{s,1}\|\bm{\mu}_{1}\|_{2}^{4}}{C_{3}\sigma_{p}^{4}d}\right)+p^% {*}\cdot\exp\left(\dfrac{-n_{s,2}\|\bm{\mu}_{2}\|_{2}^{4}}{C_{4}\sigma_{p}^{4}% d}\right).italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) . (44)
  • If l{1,2}ns,lC2σp4d𝝁l24superscript𝑙12subscript𝑛𝑠superscript𝑙subscript𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁superscript𝑙24\exists l^{\prime}\in\{1,2\}n_{s,l^{\prime}}\leq\dfrac{C_{2}\sigma_{p}^{4}d}{% \|\bm{\mu}_{l^{\prime}}\|_{2}^{4}}∃ italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , 2 } italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG holds, where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is from Condition 3.1, we have the test error

    L𝒟01(𝐖(t))0.12pl.superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡0.12subscriptsuperscript𝑝superscript𝑙L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq 0.12\cdot p^{*}_{l^% {\prime}}.italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ 0.12 ⋅ italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (45)

Here plsubscriptsuperscript𝑝superscript𝑙p^{*}_{l^{\prime}}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the occurrence probability of feature 𝛍lsubscript𝛍superscript𝑙\bm{\mu}_{l^{\prime}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are some positive constants.

Proof of Lemma G.21. Recall the test error definition and consider the proportion of different type of data in the testing set 𝒟superscript𝒟\mathcal{D}^{*}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, we have:

L𝒟01(𝐖)superscriptsubscript𝐿superscript𝒟01𝐖\displaystyle L_{\mathcal{D}^{*}}^{0-1}(\mathbf{W})italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W ) =(𝐱,y)𝒟[yf(𝐖,𝐱)<0]absentsubscriptsimilar-to𝐱𝑦superscript𝒟delimited-[]𝑦𝑓𝐖𝐱0\displaystyle=\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}[y\cdot f(\mathbf{% W},\mathbf{x})<0]= blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] (46)
=(1p)(𝐱,y)𝒟𝝁1[yf(𝐖,𝐱)<0]+p(𝐱,y)𝒟𝝁2[yf(𝐖,𝐱)<0],absent1superscript𝑝subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁1delimited-[]𝑦𝑓𝐖𝐱0superscript𝑝subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁2delimited-[]𝑦𝑓𝐖𝐱0\displaystyle=(1-p^{*})\cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu% }_{1}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0]+p^{*}\cdot\mathbb{P}_{(\mathbf{% x},y)\sim\mathcal{D}_{\bm{\mu}_{2}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0],= ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] ,

where 𝒟𝝁1superscriptsubscript𝒟subscript𝝁1\mathcal{D}_{\bm{\mu}_{1}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝒟𝝁2superscriptsubscript𝒟subscript𝝁2\mathcal{D}_{\bm{\mu}_{2}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the collection of data points in 𝒟𝒟\mathcal{D}caligraphic_D containing feature 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively.

First, we seek to prove the first bullet point. We utilize the techniques similar to the proofs of Theorem 1 in Chatterji and Long [2021], Lemma 3 in Frei et al. [2022], Theorem E.1 in Kou et al. [2023b] and Theorem 3.2 in Meng et al. [2023]. Denote the feature patch in 𝐱𝐱\mathbf{x}bold_x as 𝝁lxsubscript𝝁subscript𝑙𝑥\bm{\mu}_{l_{x}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT (lx{1,2}subscript𝑙𝑥12l_{x}\in\{1,2\}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }), we first take a look at the product

yf(𝐖(t),𝐱)𝑦𝑓superscript𝐖𝑡𝐱\displaystyle y\cdot f\left(\mathbf{W}^{(t)},\mathbf{x}\right)italic_y ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) =1mj,ryj[σ(𝐰j,r(t),y𝝁lx)+σ(𝐰j,r(t),𝝃)]absent1𝑚subscript𝑗𝑟𝑦𝑗delimited-[]𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝝃\displaystyle=\frac{1}{m}\sum_{j,r}yj\left[\sigma\left(\left\langle\mathbf{w}_% {j,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)+\sigma\left(\left\langle% \mathbf{w}_{j,r}^{(t)},\bm{\xi}\right\rangle\right)\right]= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_y italic_j [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ] (47)
=1mr[σ(𝐰y,r(t),y𝝁lx)+σ(𝐰y,r(t),𝝃)]1mr[σ(𝐰y,r(t),y𝝁lx)+σ(𝐰y,r(t),𝝃)]absent1𝑚subscript𝑟delimited-[]𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃1𝑚subscript𝑟delimited-[]𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃\displaystyle=\frac{1}{m}\sum_{r}\left[\sigma\left(\left\langle\mathbf{w}_{y,r% }^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)+\sigma\left(\left\langle\mathbf{% w}_{y,r}^{(t)},\bm{\xi}\right\rangle\right)\right]-\frac{1}{m}\sum_{r}\left[% \sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle% \right)+\sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle% \right)\right]= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ] - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ]
1m[rσ(𝐰y,r(t),y𝝁lx)rσ(𝐰y,r(t),𝝃)].absent1𝑚delimited-[]subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃\displaystyle\leq\frac{1}{m}\left[\sum_{r}\sigma\left(\left\langle\mathbf{w}_{% y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)-\sum_{r}\sigma\left(\left% \langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)\right].≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG [ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ] .

Denote g(𝝃)𝑔𝝃g(\bm{\xi})italic_g ( bold_italic_ξ ) as rσ(𝐰y,r(t),𝝃)subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃\sum_{r}\sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ). Since 𝐰y,r(t),𝝃𝒩(0,𝐰y,r(t)22σp2)similar-tosuperscriptsubscript𝐰𝑦𝑟𝑡𝝃𝒩0superscriptsubscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡22superscriptsubscript𝜎𝑝2\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\sim\mathcal{N}\left(% 0,\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}^{2}\sigma_{p}^{2}\right)⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ∼ caligraphic_N ( 0 , ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), we can get

𝔼g(𝝃)=r=1m𝔼σ(𝐰y,r(t),𝝃)=r=1m𝐰y,r(t)2σp2π=σp2πr=1m𝐰y,r(t)2.𝔼𝑔𝝃superscriptsubscript𝑟1𝑚𝔼𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2subscript𝜎𝑝2𝜋subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\mathbb{E}g(\bm{\xi})=\sum_{r=1}^{m}\mathbb{E}\sigma\left(\left\langle\mathbf{% w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)=\sum_{r=1}^{m}\frac{\left\|% \mathbf{w}_{-y,r}^{(t)}\right\|_{2}\sigma_{p}}{\sqrt{2\pi}}=\frac{\sigma_{p}}{% \sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}.blackboard_E italic_g ( bold_italic_ξ ) = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (48)

Then we can obtain the following test error upper bound on 𝒟𝝁lxsuperscriptsubscript𝒟subscript𝝁subscript𝑙𝑥\mathcal{D}_{\bm{\mu}_{l_{x}}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by adding 𝔼g(𝝃)𝔼𝑔𝝃\mathbb{E}g(\bm{\xi})blackboard_E italic_g ( bold_italic_ξ ) and σp2πr=1m𝐰y,r(t)2subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\dfrac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}% \right\|_{2}divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at two sides of the inequality:

(𝐱,y)𝒟𝝁lx(yf(𝑾(t),𝐱)0)subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁subscript𝑙𝑥𝑦𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{l_{x}}}^{*}}% \left(yf\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) (𝐱,y)𝒟(rσ(𝐰y,r(t),𝝃)rσ(𝐰y,r(t),y𝝁lx))absentsubscriptsimilar-to𝐱𝑦𝒟subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥\displaystyle\leq\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}}\left(\sum_{r}% \sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)% \geq\sum_{r}\sigma\left(\left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}% \right\rangle\right)\right)≤ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ≥ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) ) (49)
=(𝐱,y)𝒟(g(𝝃)𝔼g(𝝃)rσ(𝐰y,r(t),y𝝁lx)σp2πr=1m𝐰y,r(t)2).absentsubscriptsimilar-to𝐱𝑦𝒟𝑔𝝃𝔼𝑔𝝃subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\displaystyle=\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}}\left(g(\bm{\xi})-% \mathbb{E}g(\bm{\xi})\geq\sum_{r}\sigma\left(\left\langle\mathbf{w}_{y,r}^{(t)% },y\bm{\mu}_{l_{x}}\right\rangle\right)-\frac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=% 1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}\right).= blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D end_POSTSUBSCRIPT ( italic_g ( bold_italic_ξ ) - blackboard_E italic_g ( bold_italic_ξ ) ≥ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

By the results in Lemma G.14 and Lemma G.15, we take a look at the comparison of the two terms at the right side of the inequality:

rσ(𝐰y,r(t),y𝝁lx)σpr=1m𝐰y,r(t)2Θ(rγy,r,lx(t))Θ(d1/2ns1/2)r,iρ¯y,r,i(t)=Θ(τlxd1/2ns1/2SNRlx2)=Θ(τlxns1/2𝝁lx22/(σp2d1/2)),subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2Θsubscript𝑟superscriptsubscript𝛾𝑦𝑟subscript𝑙𝑥𝑡Θsuperscript𝑑12superscriptsubscript𝑛𝑠12subscript𝑟𝑖superscriptsubscript¯𝜌𝑦𝑟𝑖𝑡Θsubscript𝜏subscript𝑙𝑥superscript𝑑12superscriptsubscript𝑛𝑠12superscriptsubscriptSNRsubscript𝑙𝑥2Θsubscript𝜏subscript𝑙𝑥superscriptsubscript𝑛𝑠12superscriptsubscriptnormsubscript𝝁subscript𝑙𝑥22superscriptsubscript𝜎𝑝2superscript𝑑12\frac{\sum_{r}\sigma\left(\left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}% \right\rangle\right)}{\sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}% \right\|_{2}}\geq\frac{\Theta\left(\sum_{r}\gamma_{y,r,l_{x}}^{(t)}\right)}{% \Theta\left(d^{-1/2}n_{s}^{-1/2}\right)\cdot\sum_{r,i}\bar{\rho}_{-y,r,i}^{(t)% }}=\Theta\left(\tau_{l_{x}}d^{1/2}n_{s}^{1/2}\operatorname{SNR}_{l_{x}}^{2}% \right)=\Theta\left(\tau_{l_{x}}n_{s}^{1/2}\|\bm{\mu}_{l_{x}}\|_{2}^{2}/(% \sigma_{p}^{2}d^{1/2})\right),divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG roman_Θ ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_y , italic_r , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_Θ ( italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - italic_y , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_SNR start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) ) , (50)

where τlxsubscript𝜏subscript𝑙𝑥\tau_{l_{x}}italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the proportion of feature 𝝁lxsubscript𝝁subscript𝑙𝑥\bm{\mu}_{l_{x}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT in current training data set (before or after querying). Worth noting that we have assumption in the first bullet that l{1,2},ns,lC1σp4d𝝁l24formulae-sequencefor-all𝑙12subscript𝑛𝑠𝑙subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁𝑙24\forall l\in\{1,2\},n_{s,l}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{l}\|_{% 2}^{4}}∀ italic_l ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, which means n1,lx𝝁1242C1σp4d,lx{1,2}formulae-sequencesubscript𝑛1subscript𝑙𝑥superscriptsubscriptnormsubscript𝝁1242subscript𝐶1superscriptsubscript𝜎𝑝4𝑑for-allsubscript𝑙𝑥12n_{1,l_{x}}\|\bm{\mu}_{1}\|_{2}^{4}\geq 2C_{1}\sigma_{p}^{4}d,\forall l_{x}\in% \{1,2\}italic_n start_POSTSUBSCRIPT 1 , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≥ 2 italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d , ∀ italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }. Since C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a sufficiently large constant, it directly follows that

rσ(𝐰y,r(t),y𝝁lx)σp2πr=1m𝐰y,r(t)2>0.subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡20\sum_{r}\sigma\left(\left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}\right% \rangle\right)-\frac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{% -y,r}^{(t)}\right\|_{2}>0.∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 .

By Theorem 5.2.2 in Vershynin [2018], we know that for any x0𝑥0x\geq 0italic_x ≥ 0, the following holds

P(g(𝝃)𝔼g(𝝃)x)exp(cx2σp2gLip 2),𝑃𝑔𝝃𝔼𝑔𝝃𝑥𝑐superscript𝑥2superscriptsubscript𝜎𝑝2superscriptsubscriptnorm𝑔Lip 2P(g(\bm{\xi})-\mathbb{E}g(\bm{\xi})\geq x)\leq\exp\left(-\frac{cx^{2}}{\sigma_% {p}^{2}\|g\|_{\text{Lip }}^{2}}\right),italic_P ( italic_g ( bold_italic_ξ ) - blackboard_E italic_g ( bold_italic_ξ ) ≥ italic_x ) ≤ roman_exp ( - divide start_ARG italic_c italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g ∥ start_POSTSUBSCRIPT Lip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (51)

where c𝑐citalic_c is a constant. To calculate the Lipschitz norm, we have

|g(𝝃)g(𝝃)|𝑔𝝃𝑔superscript𝝃\displaystyle\left|g(\bm{\xi})-g\left(\bm{\xi}^{\prime}\right)\right|| italic_g ( bold_italic_ξ ) - italic_g ( bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | =|r=1mσ(𝐰y,r(t),𝝃)r=1mσ(𝐰y,r(t),𝝃)|absentsuperscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃superscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰𝑦𝑟𝑡superscript𝝃\displaystyle=\left|\sum_{r=1}^{m}\sigma\left(\left\langle\mathbf{w}_{-y,r}^{(% t)},\bm{\xi}\right\rangle\right)-\sum_{r=1}^{m}\sigma\left(\left\langle\mathbf% {w}_{-y,r}^{(t)},\bm{\xi}^{\prime}\right\rangle\right)\right|= | ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ) |
r=1m|σ(𝐰y,r(t),𝝃)σ(𝐰y,r(t),𝝃)|absentsuperscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃𝜎superscriptsubscript𝐰𝑦𝑟𝑡superscript𝝃\displaystyle\leq\sum_{r=1}^{m}\left|\sigma\left(\left\langle\mathbf{w}_{-y,r}% ^{(t)},\bm{\xi}\right\rangle\right)-\sigma\left(\left\langle\mathbf{w}_{-y,r}^% {(t)},\bm{\xi}^{\prime}\right\rangle\right)\right|≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ) |
r=1m|𝐰y,r(t),𝝃𝝃|absentsuperscriptsubscript𝑟1𝑚superscriptsubscript𝐰𝑦𝑟𝑡𝝃superscript𝝃\displaystyle\leq\sum_{r=1}^{m}\left|\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{% \xi}-\bm{\xi}^{\prime}\right\rangle\right|≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ - bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ |
r=1m𝐰y,r(t)2𝝃𝝃2,absentsuperscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2subscriptnorm𝝃superscript𝝃2\displaystyle\leq\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}\cdot% \left\|\bm{\xi}-\bm{\xi}^{\prime}\right\|_{2},≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_ξ - bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where the first inequality is by triangle inequality; the second inequality is by the property of ReLU; the last inequality is by Cauchy Schwartz Inequality. Therefore, we have

gLip r=1m𝐰y,r(t)2.subscriptnorm𝑔Lip superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\|g\|_{\text{Lip }}\leq\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}.∥ italic_g ∥ start_POSTSUBSCRIPT Lip end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (52)

Utilize (51) and (52) in (49), we have

(𝐱,y)𝒟𝝁lx(yf(𝑾(t),𝐱)0)subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁subscript𝑙𝑥𝑦𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{l_{x}}}^{*}}% \left(yf\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) exp[c(rσ(𝐰y,r(t),y𝝁lx)(σp2π)r=1m𝐰y,r(t)2)2σp2(r=1m𝐰y,r(t)2)2]absent𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡22superscriptsubscript𝜎𝑝2superscriptsuperscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡22\displaystyle\leq\exp\left[-\frac{c\left(\sum_{r}\sigma\left(\left\langle% \mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)-\left(\dfrac{% \sigma_{p}}{\sqrt{2\pi}}\right)\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}% \right\|_{2}\right)^{2}}{\sigma_{p}^{2}\left(\sum_{r=1}^{m}\left\|\mathbf{w}_{% -y,r}^{(t)}\right\|_{2}\right)^{2}}\right]≤ roman_exp [ - divide start_ARG italic_c ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ) ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] (53)
=exp[c(rσ(𝐰y,r(t),y𝝁lx)σpr=1m𝐰y,r(t)212π)2]absent𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡212𝜋2\displaystyle=\exp\left[-c\left(\frac{\sum_{r}\sigma\left(\left\langle\mathbf{% w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)}{\sigma_{p}\sum_{r=1}^{m}% \|\mathbf{w}_{-y,r}^{(t)}\|_{2}}-\dfrac{1}{\sqrt{2\pi}}\right)^{2}\right]= roman_exp [ - italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
exp(c/2π)exp(0.5c(rσ(𝐰y,r(t),y𝝁lx)σpr=1m𝐰y,r(t)2)2),absent𝑐2𝜋0.5𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡22\displaystyle\leq\exp(c/2\pi)\exp\left(-0.5c\left(\frac{\sum_{r}\sigma\left(% \left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)}{% \sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}}\right)^{2}% \right),≤ roman_exp ( italic_c / 2 italic_π ) roman_exp ( - 0.5 italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where the third inequality is by (st)2s2/2t2,s,t0formulae-sequencesuperscript𝑠𝑡2superscript𝑠22superscript𝑡2for-all𝑠𝑡0(s-t)^{2}\geq s^{2}/2-t^{2},\forall s,t\geq 0( italic_s - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_s , italic_t ≥ 0. And then by (50) and (53), we can have

(𝐱,y)𝒟𝝁lx(yf(𝑾(t),𝐱)0)subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁subscript𝑙𝑥𝑦𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{l_{x}}}^{*}}% \left(yf\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_y italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) exp(c/2π)exp(0.5c(rσ(𝐰y,r(t),y𝝁lx)σpr=1m𝐰y,r(t)2)2)absent𝑐2𝜋0.5𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝝁subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡22\displaystyle\leq\exp(c/2\pi)\exp\left(-0.5c\left(\frac{\sum_{r}\sigma\left(% \left\langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l_{x}}\right\rangle\right)}{% \sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}}\right)^{2}\right)≤ roman_exp ( italic_c / 2 italic_π ) roman_exp ( - 0.5 italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (54)
=exp(c2πτlxns,lx𝝁lx24Cσp4d)absent𝑐2𝜋subscript𝜏subscript𝑙𝑥subscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝝁subscript𝑙𝑥24𝐶superscriptsubscript𝜎𝑝4𝑑\displaystyle=\exp\left(\frac{c}{2\pi}-\frac{\tau_{l_{x}}n_{s,l_{x}}\|\bm{\mu}% _{l_{x}}\|_{2}^{4}}{C\sigma_{p}^{4}d}\right)= roman_exp ( divide start_ARG italic_c end_ARG start_ARG 2 italic_π end_ARG - divide start_ARG italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )
=exp(c2πns,lx𝝁lx24Clxσp4d)absent𝑐2𝜋subscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝝁subscript𝑙𝑥24subscript𝐶subscript𝑙𝑥superscriptsubscript𝜎𝑝4𝑑\displaystyle=\exp\left(\frac{c}{2\pi}-\frac{n_{s,l_{x}}\|\bm{\mu}_{l_{x}}\|_{% 2}^{4}}{C_{l_{x}}\sigma_{p}^{4}d}\right)= roman_exp ( divide start_ARG italic_c end_ARG start_ARG 2 italic_π end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )
exp(ns,lx𝝁lx242Clxσp4d)absentsubscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝝁subscript𝑙𝑥242subscript𝐶subscript𝑙𝑥superscriptsubscript𝜎𝑝4𝑑\displaystyle\leq\exp\left(-\frac{n_{s,l_{x}}\|\bm{\mu}_{l_{x}}\|_{2}^{4}}{2C_% {l_{x}}\sigma_{p}^{4}d}\right)≤ roman_exp ( - divide start_ARG italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )

where Clx=C/τlx=O(1)subscript𝐶subscript𝑙𝑥𝐶subscript𝜏𝑙𝑥𝑂1C_{l_{x}}=C/\tau_{lx}=O(1)italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_C / italic_τ start_POSTSUBSCRIPT italic_l italic_x end_POSTSUBSCRIPT = italic_O ( 1 ); the last inequality holds if we choose C1cClx/πsubscript𝐶1𝑐subscript𝐶subscript𝑙𝑥𝜋C_{1}\geq cC_{l_{x}}/\piitalic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_c italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_π, for any lx{1,2}subscript𝑙𝑥12l_{x}\in\{1,2\}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }. If we choose C3subscript𝐶3C_{3}italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as 2Cl12subscript𝐶subscript𝑙12C_{l_{1}}2 italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and C4subscript𝐶4C_{4}italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT as 2Cl22subscript𝐶subscript𝑙22C_{l_{2}}2 italic_C start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, by (46) and (54) we have

L𝒟01(𝐖(t))(1p)exp(ns,1𝝁124C3σp4d)+pexp(ns,2𝝁224C4σp4d).superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡1superscript𝑝subscript𝑛𝑠1superscriptsubscriptnormsubscript𝝁124subscript𝐶3superscriptsubscript𝜎𝑝4𝑑superscript𝑝subscript𝑛𝑠2superscriptsubscriptnormsubscript𝝁224subscript𝐶4superscriptsubscript𝜎𝑝4𝑑L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq(1-p^{*})\cdot\exp% \left(\dfrac{-n_{s,1}\|\bm{\mu}_{1}\|_{2}^{4}}{C_{3}\sigma_{p}^{4}d}\right)+p^% {*}\cdot\exp\left(\dfrac{-n_{s,2}\|\bm{\mu}_{2}\|_{2}^{4}}{C_{4}\sigma_{p}^{4}% d}\right).italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) .

Next, we serve to prove the second bullet point. We utilize the pigeonhole principle technique in Kou et al. [2023b], Meng et al. [2023], which is based on the following two lemmas.

Lemma G.22.

For t[T1,T]𝑡subscript𝑇1superscript𝑇t\in\left[T_{1},T^{*}\right]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ], denote g(𝛏)=j,rσ(𝐰j,r(t),𝛏)𝑔𝛏subscript𝑗𝑟𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝛏g(\bm{\xi})=\sum_{j,r}\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}% \right\rangle\right)italic_g ( bold_italic_ξ ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ). There exists a fixed vector 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with 𝐯l20.02σpsubscriptnormsubscript𝐯𝑙20.02subscript𝜎𝑝\|\mathbf{v}_{l}\|_{2}\leq 0.02\sigma_{p}∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0.02 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and constant C6subscript𝐶6C_{6}italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT such that

j{±1}[g(j𝝃+𝐯l)g(j𝝃)]4C6maxj,l{rγj,r,l(t)},subscriptsuperscript𝑗plus-or-minus1delimited-[]𝑔superscript𝑗𝝃subscript𝐯𝑙𝑔superscript𝑗𝝃4subscript𝐶6subscript𝑗𝑙subscript𝑟superscriptsubscript𝛾𝑗𝑟𝑙𝑡\sum_{j^{\prime}\in\{\pm 1\}}\left[g\left(j^{\prime}\bm{\xi}+\mathbf{v}_{l}% \right)-g\left(j^{\prime}\bm{\xi}\right)\right]\geq 4C_{6}\max_{j,l}\left\{% \sum_{r}\gamma_{j,r,l}^{(t)}\right\},∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { ± 1 } end_POSTSUBSCRIPT [ italic_g ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_g ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_ξ ) ] ≥ 4 italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } ,

for all 𝛏d𝛏superscript𝑑\bm{\xi}\in\mathbb{R}^{d}bold_italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Proof of Lemma G.22. See Lemma 5.8 in Kou et al. [2023b] or Theorem 3.2 in Meng et al. [2023] for a proof, where we utilize a large enough C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the condition given in the second bullet point (ns,lC2σp4d𝝁l24subscript𝑛𝑠superscript𝑙subscript𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁superscript𝑙24n_{s,{l^{\prime}}}\leq\dfrac{C_{2}\sigma_{p}^{4}d}{\|\bm{\mu}_{l^{\prime}}\|_{% 2}^{4}}italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG) to control the norm of 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

Lemma G.23.

(Proposition 2.1 in Devroye et al. [2023]). The TV distance between 𝒩(0,σp2𝐈d)𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑\mathcal{N}\left(0,\sigma_{p}^{2}\mathbf{I}_{d}\right)caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and 𝒩(𝐯l,σp2𝐈d)𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑\mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}_{d}\right)caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is smaller than 𝐯l2/2σpsubscriptnormsubscript𝐯𝑙22subscript𝜎𝑝\|\mathbf{v}_{l}\|_{2}/2\sigma_{p}∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Proof of Lemma G.23. See Proposition 2.1 in Devroye et al. [2023] for a proof.

Now we take a look at L𝒟01(𝐖(t))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), by (46) we have:

L𝒟01(𝐖(t))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡\displaystyle L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) =τ1(𝐱,y)𝒟𝝁1[yf(𝐖,𝐱)<0]+τ2)(𝐱,y)𝒟𝝁2[yf(𝐖,𝐱)<0]\displaystyle=\tau^{*}_{1}\cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{% \mu}_{1}}^{*}}\left[y\cdot f(\mathbf{W},\mathbf{x})<0\right]+\tau^{*}_{2})% \cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{2}}}\left[y\cdot f(% \mathbf{W},\mathbf{x})<0\right]= italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] + italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] (55)
τl(𝐱,y)𝒟𝝁l[yf(𝐖,𝐱)<0]absentsubscriptsuperscript𝜏superscript𝑙subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁superscript𝑙delimited-[]𝑦𝑓𝐖𝐱0\displaystyle\geq\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim% \mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0]≥ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ]
=τl(𝐱,y)𝒟𝝁l(rσ(𝐰y,r(t),𝝃)rσ(𝐰y,r(t),𝝃)\displaystyle=\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal% {D}_{\bm{\mu}_{l^{\prime}}}^{*}}\Big{(}\sum_{r}\sigma\left(\left\langle\mathbf% {w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)-\sum_{r}\sigma\left(\left\langle% \mathbf{w}_{y,r}^{(t)},\bm{\xi}\right\rangle\right)= italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ )
rσ(𝐰y,r(t),y𝝁l)rσ(𝐰y,r(t),y𝝁l))\displaystyle\phantom{\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim% \mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}\Big{(}}\geq\sum_{r}\sigma\left(\left% \langle\mathbf{w}_{y,r}^{(t)},y\bm{\mu}_{l^{\prime}}\right\rangle\right)-\sum_% {r}\sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},y\bm{\mu}_{l^{\prime}}% \right\rangle\right)\Big{)}≥ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ ) )
0.5τl(𝐱,y)𝒟𝝁l(|rσ(𝐰1,r(t),𝝃)rσ(𝐰1,r(t),𝝃)|C6max{rγ1,r,l(t),rγ1,r,l(t)})absent0.5subscriptsuperscript𝜏superscript𝑙subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁superscript𝑙subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃subscript𝐶6subscript𝑟superscriptsubscript𝛾1𝑟superscript𝑙𝑡subscript𝑟superscriptsubscript𝛾1𝑟superscript𝑙𝑡\displaystyle\geq 0.5\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim% \mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}\left(\left|\sum_{r}\sigma\left(\left% \langle\mathbf{w}_{1,r}^{(t)},\bm{\xi}\right\rangle\right)-\sum_{r}\sigma\left% (\left\langle\mathbf{w}_{-1,r}^{(t)},\bm{\xi}\right\rangle\right)\right|\geq C% _{6}\max\left\{\sum_{r}\gamma_{1,r,{l^{\prime}}}^{(t)},\sum_{r}\gamma_{-1,r,{l% ^{\prime}}}^{(t)}\right\}\right)≥ 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( | ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) | ≥ italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 , italic_r , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT - 1 , italic_r , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } )
=0.5τlP(Ω𝝃),absent0.5subscriptsuperscript𝜏superscript𝑙𝑃subscriptΩ𝝃\displaystyle=0.5\tau^{*}_{l^{\prime}}\cdot P(\Omega_{\bm{\xi}}),= 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ,

where Ω𝝃:={𝝃||g(𝝃)C6max{rγ1,r,l(t),rγ1,r,l(t)}}assignsubscriptΩ𝝃conditional-set𝝃delimited-|∣𝑔𝝃subscript𝐶6subscript𝑟superscriptsubscript𝛾1𝑟superscript𝑙𝑡subscript𝑟superscriptsubscript𝛾1𝑟superscript𝑙𝑡\Omega_{\bm{\xi}}:=\left\{\bm{\xi}||g(\bm{\xi})\mid\geq C_{6}\max\left\{\sum_{% r}\gamma_{1,r,{l^{\prime}}}^{(t)},\sum_{r}\gamma_{-1,r,{l^{\prime}}}^{(t)}% \right\}\right\}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT := { bold_italic_ξ | | italic_g ( bold_italic_ξ ) ∣ ≥ italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 , italic_r , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT - 1 , italic_r , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } }. The last inequality holds since we can always have a corresponding y𝑦yitalic_y to make a wrong prediction if given 𝝃𝝃\bm{\xi}bold_italic_ξ, the |rσ(𝐰1,r(t),𝝃)rσ(𝐰1,r(t),𝝃)|subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃\left|\sum_{r}\sigma\left(\left\langle\mathbf{w}_{1,r}^{(t)},\bm{\xi}\right% \rangle\right)-\sum_{r}\sigma\left(\left\langle\mathbf{w}_{-1,r}^{(t)},\bm{\xi% }\right\rangle\right)\right|| ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) | is large enough.

Next, we seek a lower bound of P(Ω𝝃)𝑃subscriptΩ𝝃P(\Omega_{\bm{\xi}})italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ). By Lemma G.22, we have that j[g(j𝝃+𝐯l)g(j𝝃)]subscript𝑗delimited-[]𝑔𝑗𝝃subscript𝐯𝑙𝑔𝑗𝝃absent\sum_{j}[g(j\bm{\xi}+\mathbf{v}_{l})-g(j\bm{\xi})]\geq∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_g ( italic_j bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_g ( italic_j bold_italic_ξ ) ] ≥ 4C6maxj,l{rγj,r,l(t)}4subscript𝐶6subscript𝑗𝑙subscript𝑟superscriptsubscript𝛾𝑗𝑟𝑙𝑡4C_{6}\max_{j,l}\left\{\sum_{r}\gamma_{j,r,l}^{(t)}\right\}4 italic_C start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT }. Then by pigeon’s hole principle, there must exist one of the 𝝃,𝝃+𝐯l𝝃𝝃subscript𝐯𝑙\bm{\xi},\bm{\xi}+\mathbf{v}_{l}bold_italic_ξ , bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, 𝝃,𝝃+𝐯l𝝃𝝃subscript𝐯𝑙-\bm{\xi},-\bm{\xi}+\mathbf{v}_{l}- bold_italic_ξ , - bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT belongs Ω𝝃subscriptΩ𝝃\Omega_{\bm{\xi}}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT. So we have proved that Ω𝝃Ω𝝃Ω𝝃{𝐯l}Ω𝝃{𝐯l}=d\Omega_{\bm{\xi}}\cup-\Omega_{\bm{\xi}}\cup\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}% \}\cup-\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}\}=\mathbb{R}^{d}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∪ - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∪ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ∪ - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Therefore at least one of P(Ω𝝃),P(Ω𝝃),P(Ω𝝃{𝐯l}),P(Ω𝝃{𝐯l}),P(Ω𝝃{𝐯l})𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃subscript𝐯𝑙𝑃subscriptΩ𝝃subscript𝐯𝑙𝑃subscriptΩ𝝃subscript𝐯𝑙P(\Omega_{\bm{\xi}}),P(-\Omega_{\bm{\xi}}),P(\Omega_{\bm{\xi}}-\{\mathbf{v}_{l% }\}),P(\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}\}),P(-\Omega_{\bm{\xi}}-\{\mathbf{v}% _{l}\})italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) , italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) , italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) , italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) , italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) is greater than 0.25. By the definition of TV distance, we have:

|P(Ω𝝃)P(Ω𝝃𝐯l)|𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃subscript𝐯𝑙\displaystyle|P(\Omega_{\bm{\xi}})-P(\Omega_{\bm{\xi}}-\mathbf{v}_{l})|| italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) - italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | =|𝝃𝒩(0,σp2𝐈d)(𝝃Ω𝝃)𝝃𝒩(𝐯l,σp2𝐈d)(𝝃Ω𝝃)|absentsubscriptsimilar-to𝝃𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝝃subscriptΩ𝝃subscriptsimilar-to𝝃𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝝃subscriptΩ𝝃\displaystyle=\left|\mathbb{P}_{\bm{\xi}\sim\mathcal{N}\left(0,\sigma_{p}^{2}% \mathbf{I}_{d}\right)}(\bm{\xi}\in\Omega_{\bm{\xi}})-\mathbb{P}_{\bm{\xi}\sim% \mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}_{d}\right)}(\bm{\xi}% \in\Omega_{\bm{\xi}})\right|= | blackboard_P start_POSTSUBSCRIPT bold_italic_ξ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_italic_ξ ∈ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) - blackboard_P start_POSTSUBSCRIPT bold_italic_ξ ∼ caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_italic_ξ ∈ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) |
TV(𝒩(0,σp2𝐈d),𝒩(𝐯l,σp2𝐈d))absentTV𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑\displaystyle\leq\operatorname{TV}\left(\mathcal{N}\left(0,\sigma_{p}^{2}% \mathbf{I}_{d}\right),\mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}% _{d}\right)\right)≤ roman_TV ( caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
𝐯l22σpabsentsubscriptnormsubscript𝐯𝑙22subscript𝜎𝑝\displaystyle\leq\frac{\|\mathbf{v}_{l}\|_{2}}{2\sigma_{p}}≤ divide start_ARG ∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG
0.02.absent0.02\displaystyle\leq 0.02.≤ 0.02 .

Also, notice that P(Ω𝝃)=P(Ω𝝃)𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃P(-\Omega_{\bm{\xi}})=P(\Omega_{\bm{\xi}})italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) = italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ), we have 4P(Ω𝝃)120.024𝑃subscriptΩ𝝃120.024P(\Omega_{\bm{\xi}})\geq 1-2\cdot 0.024 italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ≥ 1 - 2 ⋅ 0.02. Thus L𝒟01(𝐖(t))0.5τl0.24=0.12τlsuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡0.5subscriptsuperscript𝜏superscript𝑙0.240.12subscriptsuperscript𝜏superscript𝑙L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq 0.5\tau^{*}_{l^{% \prime}}\cdot 0.24=0.12\cdot\tau^{*}_{l^{\prime}}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ 0.24 = 0.12 ⋅ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The proofs complete.

Based on Lemma G.21, our focus is to verify whether the NAL algorithms satisfy the condition stated in the first bullet point. On the other hand, it is highly likely that Random Sampling fulfills the condition stated in the second bullet point. The following proposition validates this intuition.

Proposition G.24.

When Lemma G.19 holds, and the sampling size of algorithm satisfies C1σp4d𝛍224pn02n=Θ(n~n0)n~n0subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝛍224𝑝subscript𝑛02superscript𝑛Θ~𝑛subscript𝑛0~𝑛subscript𝑛0\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}-\dfrac{pn_{0}}{2}\leq n% ^{*}=\Theta(\widetilde{n}-n_{0})\leq\widetilde{n}-n_{0}divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Θ ( over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have the following:

  • The number of data with strong feature patch ns,1subscript𝑛𝑠1n_{s,1}italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT satisfies ns,1C1σp4d𝝁124,s{0,1}formulae-sequencesubscript𝑛𝑠1subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124for-all𝑠01n_{s,1}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{1}\|_{2}^{4}},\forall s\in% \{0,1\}italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG , ∀ italic_s ∈ { 0 , 1 }.

  • The number of data with weak feature patch ns,2subscript𝑛𝑠2n_{s,2}italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT before querying and after Random Sampling satisfies ns,2C2σp4d𝝁224,s{0,1}formulae-sequencesubscript𝑛𝑠2subscript𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224for-all𝑠01n_{s,2}\leq\dfrac{C_{2}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}},\forall s\in% \{0,1\}italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ≤ divide start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG , ∀ italic_s ∈ { 0 , 1 }.

  • The total number of data with weak feature patch n1,2subscript𝑛12n_{1,2}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT after Uncertainty Sampling and Diversity Sampling satisfies n1,2C1σp4d𝝁224subscript𝑛12subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224n_{1,2}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG .

For the sake of coherence, here C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C2subscript𝐶2C_{2}italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are some constants shared with Theorem 3.4 and Lemma 4.5.

Proof of Proposition G.24. By conditions in Definition 1, we have (132p)n0C1σp4d𝝁124132𝑝subscript𝑛0subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124(1-\dfrac{3}{2}p)n_{0}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{1}\|_{2}^{4}}( 1 - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_p ) italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG for a large constant C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Then by plugging the results of npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in Lemma 17, as well as the definition of ns,lsubscript𝑛𝑠𝑙n_{s,l}italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT, we have

n1,1n0,1(132p)n0C1σp4d𝝁124.subscript𝑛11subscript𝑛01132𝑝subscript𝑛0subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124n_{1,1}\geq n_{0,1}\geq(1-\dfrac{3}{2}p)n_{0}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{% \|\bm{\mu}_{1}\|_{2}^{4}}.italic_n start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ≥ ( 1 - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_p ) italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG .

For the second bullet, by Lemma 17, Lemma G.19 and conditions nC1σp4d𝝁224pn02superscript𝑛subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224𝑝subscript𝑛02n^{*}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}-\dfrac{pn_{0}}% {2}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, we have:

n1,2pn02+nC1σp4d𝝁224subscript𝑛12𝑝subscript𝑛02superscript𝑛subscript𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224n_{1,2}\geq\dfrac{pn_{0}}{2}+n^{*}\geq\dfrac{C_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_% {2}\|_{2}^{4}}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ≥ divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG

Besides, by Lemma 17 and the condition n~2C2σp4d3p𝝁224~𝑛2subscript𝐶2superscriptsubscript𝜎𝑝4𝑑3𝑝superscriptsubscriptnormsubscript𝝁224\widetilde{n}\leq\dfrac{2C_{2}\sigma_{p}^{4}d}{3p\|\bm{\mu}_{2}\|_{2}^{4}}over~ start_ARG italic_n end_ARG ≤ divide start_ARG 2 italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 3 italic_p ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, the third bullet holds straightforwardly.

By the result of Lemma G.21 and Proposition G.24, the results of Proposition 3.2 and Theorem 3.4 holds directly.

Lemma G.25.

(Restatement of Corollary 3.5) Under the same conditions as stated in Theorem 3.4, with a probability of at least 1Θ(δ+δ)1Θ𝛿superscript𝛿1-\Theta(\delta+\delta^{\prime})1 - roman_Θ ( italic_δ + italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), we observe distinct label complexities for traditional 2-layer ReLU CNN and NAL algorithms in achieving Bayes-optimal generalization ability:

  • For a fully trained neural model, the label complexity nCNNsubscript𝑛𝐶𝑁𝑁n_{CNN}italic_n start_POSTSUBSCRIPT italic_C italic_N italic_N end_POSTSUBSCRIPT requires Ω(p1σp2d𝝁224)Ωsuperscript𝑝1superscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(p^{-1}\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ).

  • For two NAL algorithms, the maximum label complexity n~~𝑛\widetilde{n}over~ start_ARG italic_n end_ARG only requires Ω(σp2d𝝁224)Ωsuperscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ).

Proof of Lemma G.25. According to Lemma G.21, to adequately learn the signal 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT for any l{1,2}𝑙12l\in\{1,2\}italic_l ∈ { 1 , 2 }, one needs at least C^1σp4d𝝁l24^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁𝑙24\hat{C}1\sigma_{p}^{4}d\|\bm{\mu}_{l}\|_{2}^{-4}over^ start_ARG italic_C end_ARG 1 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. Since the occurrence probability of 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is low (p𝑝pitalic_p), Random Sampling without any strategy requires a label complexity of at least Ω(p1σp2d𝝁224)Ωsuperscript𝑝1superscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(p^{-1}\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_p start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ) to capture sufficient instances of 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the training distribution. On the other hand, by leveraging the insights from Lemma G.19 and Lemma G.20, both Uncertainty Sampling and Diversity Sampling can effectively query yet-to-be-learned perplexing samples, which are typically samples associated with 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT by Lemma G.14. Hence, both querying algorithms only require a label complexity of Ω(σp2d𝝁224)Ωsuperscriptsubscript𝜎𝑝2𝑑superscriptsubscriptnormsubscript𝝁224\Omega(\sigma_{p}^{2}d\|\bm{\mu}_{2}\|_{2}^{-4})roman_Ω ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT ).

Appendix H Proofs of Main Results: XOR data version

In this section, we first introduce some notations. We denote n𝑛nitalic_n as the number of training data in the current labeled training set, which is initially n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and becomes n1subscript𝑛1n_{1}italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT after sampling (querying). We define 𝐮l=𝐚l+𝐛lsubscript𝐮𝑙subscript𝐚𝑙subscript𝐛𝑙\mathbf{u}_{l}=\mathbf{a}_{l}+\mathbf{b}_{l}bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐯l=𝐚l𝐛lsubscript𝐯𝑙subscript𝐚𝑙subscript𝐛𝑙\mathbf{v}_{l}=\mathbf{a}_{l}-\mathbf{b}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = bold_a start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - bold_b start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT. The proportion of easy-to-learn data 𝝁1=±(𝐚1±𝐛1)subscript𝝁1plus-or-minusplus-or-minussubscript𝐚1subscript𝐛1\bm{\mu}_{1}=\pm(\mathbf{a}_{1}\pm\mathbf{b}_{1})bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ± ( bold_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ± bold_b start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) in the current labeled set is denoted as τ1subscript𝜏1\tau_{1}italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while τ2subscript𝜏2\tau_{2}italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT represents the proportion of hard-to-learn data 𝝁2=±(𝐚2±𝐛2)subscript𝝁2plus-or-minusplus-or-minussubscript𝐚2subscript𝐛2\bm{\mu}_{2}=\pm(\mathbf{a}_{2}\pm\mathbf{b}_{2})bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ± ( bold_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ± bold_b start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ). In a manner similar to the proofs provided in Appendix G, in this section we utilize the techniques employed in Kou et al. [2023b], Meng et al. [2023] to obtain results that are not directly related to our main contribution. For the sake of brevity, we omit most of the proof details of those outcomes, as our setting aligns with the one considered in [Meng et al., 2023], despite the fact that we examine multiple task-oriented features. Instead, our focus is on providing comprehensive proofs of our primary contributions.

First, we claim that all preliminary Lemmas in Appendix G.1 hold with high probability. It is evident from Definition 8 that F+1(𝐖+1,𝐱)subscript𝐹1subscript𝐖1𝐱F_{+1}\left(\mathbf{W}_{+1},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , bold_x ) always contributes to the prediction of class +11+1+ 1, while F1(𝐖1,𝐱)subscript𝐹1subscript𝐖1𝐱F_{-1}\left(\mathbf{W}_{-1},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , bold_x ) always contributes to the prediction of class 11-1- 1. Therefore, the jobs of 𝐰+1,rsubscript𝐰1𝑟\mathbf{w}_{+1,r}bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT and 𝐰1,rsubscript𝐰1𝑟\mathbf{w}_{-1,r}bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT are learning ±𝐮plus-or-minus𝐮\pm\mathbf{u}± bold_u and ±𝐯plus-or-minus𝐯\pm\mathbf{v}± bold_v respectively. Then, similar to (18), we take a look at the coefficient updates with signal-noise decomposition techniques, specified as the following.

𝐰j,r(t)=𝐰j,r(0)+l=12γj,r,𝐮l(t)j𝐮l𝐮l22l=12γj,r,𝐯l(t)j𝐯l𝐯l22+i=1nρ¯j,r,i(t)𝝃i𝝃i22+i=1nρ¯j,r,i(t)𝝃i𝝃i22,superscriptsubscript𝐰𝑗𝑟𝑡superscriptsubscript𝐰𝑗𝑟0superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟subscript𝐮𝑙𝑡𝑗subscript𝐮𝑙superscriptsubscriptnormsubscript𝐮𝑙22superscriptsubscript𝑙12superscriptsubscript𝛾𝑗𝑟subscript𝐯𝑙𝑡𝑗subscript𝐯𝑙superscriptsubscriptnormsubscript𝐯𝑙22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡subscript𝝃𝑖superscriptsubscriptnormsubscript𝝃𝑖22\mathbf{w}_{j,r}^{(t)}=\mathbf{w}_{j,r}^{(0)}+\sum_{l=1}^{2}\gamma_{j,r,% \mathbf{u}_{l}}^{(t)}\cdot\dfrac{j\cdot\mathbf{u}_{l}}{\|\mathbf{u}_{l}\|_{2}^% {2}}-\sum_{l=1}^{2}\gamma_{j,r,\mathbf{v}_{l}}^{(t)}\cdot\dfrac{j\cdot\mathbf{% v}_{l}}{\|\mathbf{v}_{l}\|_{2}^{2}}+\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}% \cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}}+\sum_{i=1}^{n}\underline{% \rho}_{j,r,i}^{(t)}\cdot\dfrac{\bm{\xi}_{i}}{\|\bm{\xi}_{i}\|_{2}^{2}},bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_j ⋅ bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG - ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_j ⋅ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ⋅ divide start_ARG bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (56)

where we denote ρ¯j,r,i(t)superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\bar{\rho}_{j,r,i}^{(t)}over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as ρj,r,i(t)𝟙(ρj,r,i(t)0)superscriptsubscript𝜌𝑗𝑟𝑖𝑡1superscriptsubscript𝜌𝑗𝑟𝑖𝑡0\rho_{j,r,i}^{(t)}\mathbb{1}\left(\rho_{j,r,i}^{(t)}\geq 0\right)italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ 0 ), ρ¯j,r,i(t)superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\underline{\rho}_{j,r,i}^{(t)}under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT as ρj,r,i(t)𝟙(ρj,r,i(t)0)superscriptsubscript𝜌𝑗𝑟𝑖𝑡1superscriptsubscript𝜌𝑗𝑟𝑖𝑡0\rho_{j,r,i}^{(t)}\mathbb{1}\left(\rho_{j,r,i}^{(t)}\leq 0\right)italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 ( italic_ρ start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ 0 ). Here γj,r,𝐮l(t)superscriptsubscript𝛾𝑗𝑟subscript𝐮𝑙𝑡\gamma_{j,r,\mathbf{u}_{l}}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are mainly contributed by F+1(𝐖+1,𝐱)subscript𝐹1subscript𝐖1𝐱F_{+1}\left(\mathbf{W}_{+1},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT , bold_x ), and γ±1,r,𝐮l(t)𝐰j,r(t),±𝐮superscriptsubscript𝛾plus-or-minus1𝑟subscript𝐮𝑙𝑡superscriptsubscript𝐰𝑗𝑟𝑡plus-or-minus𝐮\gamma_{\pm 1,r,\mathbf{u}_{l}}^{(t)}\approx\left\langle\mathbf{w}_{j,r}^{(t)}% ,\pm\mathbf{u}\right\rangleitalic_γ start_POSTSUBSCRIPT ± 1 , italic_r , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≈ ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ± bold_u ⟩. Similarly γj,r,𝐯l(t)superscriptsubscript𝛾𝑗𝑟subscript𝐯𝑙𝑡\gamma_{j,r,\mathbf{v}_{l}}^{(t)}italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT are mainly contributed by F1(𝐖1,𝐱)subscript𝐹1subscript𝐖1𝐱F_{-1}\left(\mathbf{W}_{-1},\mathbf{x}\right)italic_F start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT - 1 end_POSTSUBSCRIPT , bold_x ), and γ±1,r,𝐯l(t)𝐰j,r(t),±𝐯superscriptsubscript𝛾plus-or-minus1𝑟subscript𝐯𝑙𝑡superscriptsubscript𝐰𝑗𝑟𝑡plus-or-minus𝐯\gamma_{\pm 1,r,\mathbf{v}_{l}}^{(t)}\approx\left\langle\mathbf{w}_{j,r}^{(t)}% ,\pm\mathbf{v}\right\rangleitalic_γ start_POSTSUBSCRIPT ± 1 , italic_r , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≈ ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ± bold_v ⟩. Worth noting that j{±1}𝑗plus-or-minus1j\in\{\pm 1\}italic_j ∈ { ± 1 } here denote the signal of 𝐮lsubscript𝐮𝑙\mathbf{u}_{l}bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, but not the signal of Fj(𝐖j,𝐱),j{±1}subscript𝐹superscript𝑗subscript𝐖superscript𝑗𝐱superscript𝑗plus-or-minus1F_{j^{\prime}}\left(\mathbf{W}_{j^{\prime}},\mathbf{x}\right),j^{\prime}\in\{% \pm 1\}italic_F start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_W start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , bold_x ) , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { ± 1 }.

Specifically, the update rule can be written as:

𝐰j,r(t+1)=𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡1superscriptsubscript𝐰𝑗𝑟𝑡\displaystyle\mathbf{w}_{j,r}^{(t+1)}=\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t + 1 ) end_POSTSUPERSCRIPT = bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ηjnmiS+𝐮l,+1S𝐮l,1i(t)𝟙{𝐰j,r(t),𝝁i>0}𝐮l+ηjnmiS𝐮l,+1S+𝐮l,1i(t)𝟙{𝐰j,r(t),𝝁i>0}𝐮l𝜂𝑗𝑛𝑚subscript𝑖subscript𝑆subscript𝐮𝑙1subscript𝑆subscript𝐮𝑙1superscriptsubscript𝑖𝑡1superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝁𝑖0subscript𝐮𝑙𝜂𝑗𝑛𝑚subscript𝑖subscript𝑆subscript𝐮𝑙1subscript𝑆subscript𝐮𝑙1superscriptsuperscriptsubscript𝑖𝑡1superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝁𝑖0subscript𝐮𝑙\displaystyle-\frac{\eta j}{nm}\sum_{i\in S_{+\mathbf{u}_{l},+1}\cup S_{-% \mathbf{u}_{l},-1}}\ell_{i}^{(t)}\mathbb{1}\left\{\left\langle\mathbf{w}_{j,r}% ^{(t)},\bm{\mu}_{i}\right\rangle>0\right\}\mathbf{u}_{l}+\frac{\eta j}{nm}\sum% _{i\in S_{-\mathbf{u}_{l},+1}\cup S_{+\mathbf{u}_{l},-1}}{\ell_{i}^{\prime}}^{% (t)}\mathbb{1}\left\{\left\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}_{i}\right% \rangle>0\right\}\mathbf{u}_{l}- divide start_ARG italic_η italic_j end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT + bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , + 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 { ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + divide start_ARG italic_η italic_j end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT - bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , + 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT + bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 { ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT (57)
+ηjnmiS+𝐯l,1S𝐯l,+1i(t)𝟙{𝐰j,r(t),𝝁i>0}𝐯lηjnmiS𝐯l,1S+𝐯l,+1i(t)𝟙{𝐰j,r(t),𝝁i>0}𝐯l𝜂𝑗𝑛𝑚subscript𝑖subscript𝑆subscript𝐯𝑙1subscript𝑆subscript𝐯𝑙1superscriptsuperscriptsubscript𝑖𝑡1superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝁𝑖0subscript𝐯𝑙𝜂𝑗𝑛𝑚subscript𝑖subscript𝑆subscript𝐯𝑙1subscript𝑆subscript𝐯𝑙1superscriptsubscript𝑖𝑡1superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝁𝑖0subscript𝐯𝑙\displaystyle+\frac{\eta j}{nm}\sum_{i\in S_{+\mathbf{v}_{l},-1}\cup S_{-% \mathbf{v}_{l},+1}}{\ell_{i}^{\prime}}^{(t)}\mathbb{1}\left\{\left\langle% \mathbf{w}_{j,r}^{(t)},\bm{\mu}_{i}\right\rangle>0\right\}\mathbf{v}_{l}-\frac% {\eta j}{nm}\sum_{i\in S_{-\mathbf{v}_{l},-1}\cup S_{+\mathbf{v}_{l},+1}}\ell_% {i}^{(t)}\mathbb{1}\left\{\left\langle\mathbf{w}_{j,r}^{(t)},\bm{\mu}_{i}% \right\rangle>0\right\}\mathbf{v}_{l}+ divide start_ARG italic_η italic_j end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , - 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 { ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT - divide start_ARG italic_η italic_j end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S start_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , - 1 end_POSTSUBSCRIPT ∪ italic_S start_POSTSUBSCRIPT + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT blackboard_1 { ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT
ηnmi=1ni(t)(jyi)𝟙{𝐰j,r(t),𝝃i>0}𝝃i,𝜂𝑛𝑚superscriptsubscript𝑖1𝑛superscriptsuperscriptsubscript𝑖𝑡𝑗subscript𝑦𝑖1superscriptsubscript𝐰𝑗𝑟𝑡subscript𝝃𝑖0subscript𝝃𝑖\displaystyle-\frac{\eta}{nm}\sum_{i=1}^{n}{\ell_{i}^{\prime}}^{(t)}\left(jy_{% i}\right)\mathbb{1}\left\{\left\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}_{i}% \right\rangle>0\right\}\bm{\xi}_{i},- divide start_ARG italic_η end_ARG start_ARG italic_n italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ( italic_j italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) blackboard_1 { ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ > 0 } bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

where S𝝁,j={i[n],𝝁i=𝝁,yi=j}subscript𝑆𝝁𝑗formulae-sequence𝑖delimited-[]𝑛formulae-sequencesubscript𝝁𝑖𝝁subscript𝑦𝑖𝑗S_{\bm{\mu},j}=\{i\in[n],\bm{\mu}_{i}=\bm{\mu},y_{i}=j\}italic_S start_POSTSUBSCRIPT bold_italic_μ , italic_j end_POSTSUBSCRIPT = { italic_i ∈ [ italic_n ] , bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_μ , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_j }. Here 𝝁{±𝐮1,±𝐮2,±𝐯1,±𝐯2},j{±1}formulae-sequence𝝁plus-or-minussubscript𝐮1plus-or-minussubscript𝐮2plus-or-minussubscript𝐯1plus-or-minussubscript𝐯2𝑗plus-or-minus1\bm{\mu}\in\{\pm\mathbf{u}_{1},\pm\mathbf{u}_{2},\pm\mathbf{v}_{1},\pm\mathbf{% v}_{2}\},j\in\{\pm 1\}bold_italic_μ ∈ { ± bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ± bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ± bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ± bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , italic_j ∈ { ± 1 }, and we let 𝝁isubscript𝝁𝑖\bm{\mu}_{i}bold_italic_μ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the feature in 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝝃isubscript𝝃𝑖\bm{\xi}_{i}bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the noise in 𝐱isubscript𝐱𝑖\mathbf{x}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

The following lemma shows that a specific discrete process can be bounded by its continuous counterpart, which would be useful in bounding the coefficient i=1nρ¯j,r,i(t)superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the derivative of loss function.

Lemma H.1.

(Lemma C.1 in Meng et al. [2023]) Suppose that a sequence at,t0subscript𝑎𝑡𝑡0a_{t},t\geq 0italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ≥ 0 follows the iterative formula

at+1=at+c1+beat,subscript𝑎𝑡1subscript𝑎𝑡𝑐1𝑏superscript𝑒subscript𝑎𝑡a_{t+1}=a_{t}+\frac{c}{1+be^{a_{t}}},italic_a start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + divide start_ARG italic_c end_ARG start_ARG 1 + italic_b italic_e start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ,

for some 1c01𝑐01\geq c\geq 01 ≥ italic_c ≥ 0 and b0𝑏0b\geq 0italic_b ≥ 0. Then it holds that

xtatc1+bea0+xtsubscript𝑥𝑡subscript𝑎𝑡𝑐1𝑏superscript𝑒subscript𝑎0subscript𝑥𝑡x_{t}\leq a_{t}\leq\frac{c}{1+be^{a_{0}}}+x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG italic_c end_ARG start_ARG 1 + italic_b italic_e start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG + italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT

for all t0𝑡0t\geq 0italic_t ≥ 0. Here, xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the unique solution of

xt+bext=ct+a0+bea0.subscript𝑥𝑡𝑏superscript𝑒subscript𝑥𝑡𝑐𝑡subscript𝑎0𝑏superscript𝑒subscript𝑎0x_{t}+be^{x_{t}}=ct+a_{0}+be^{a_{0}}.italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_b italic_e start_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = italic_c italic_t + italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_b italic_e start_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT .

H.1 Coefficient Ratio and Scale Analysis: XOR data version

Similar to the processes in Appendix G, we assume the results in the previous section hold with high probability. Meanwhile, let T=superscript𝑇absentT^{*}=italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = η1superscript𝜂1\eta^{-1}italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT poly (ε1,d,n,m)superscript𝜀1𝑑𝑛𝑚\left(\varepsilon^{-1},d,n,m\right)( italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , italic_d , italic_n , italic_m ) be the maximum admissible iteration. We adopt similar notations as those in (22):

α:=4log(T),\displaystyle\alpha\mathrel{\mathop{:}}=4\log\left(T^{*}\right),italic_α : = 4 roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) , (58)
β:=2maxl,i,j,r{|𝐰j,r(0),𝝁l|,|𝐰j,r(0),𝝃i|},\displaystyle\beta\mathrel{\mathop{:}}=2\max_{l,i,j,r}\left\{\left|\left% \langle\mathbf{w}_{j,r}^{(0)},\bm{\mu}_{l}\right\rangle\right|,\left|\left% \langle\mathbf{w}_{j,r}^{(0)},\bm{\xi}_{i}\right\rangle\right|\right\},italic_β : = 2 roman_max start_POSTSUBSCRIPT italic_l , italic_i , italic_j , italic_r end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | , | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ | } ,
SNRl:=𝝁l2σpd,\displaystyle\operatorname{SNR}_{l}\mathrel{\mathop{:}}=\dfrac{\|\bm{\mu}_{l}% \|_{2}}{\sigma_{p}\sqrt{d}},roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT : = divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG end_ARG ,
κ=56log(6n2/δ)dnlog(T)+10log(16mn/δ)σ0σpd+l=1264τlnSNRl2log(T).𝜅566superscript𝑛2𝛿𝑑𝑛superscript𝑇1016𝑚𝑛𝛿subscript𝜎0subscript𝜎𝑝𝑑superscriptsubscript𝑙1264subscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2superscript𝑇\displaystyle\kappa=56\sqrt{\frac{\log\left(6n^{2}/\delta\right)}{d}}n\log% \left(T^{*}\right)+10\sqrt{\log(16mn/\delta)}\cdot\sigma_{0}\sigma_{p}\sqrt{d}% +\sum_{l=1}^{2}64\tau_{l}n\cdot\operatorname{SNR}_{l}^{2}\log\left(T^{*}\right).italic_κ = 56 square-root start_ARG divide start_ARG roman_log ( 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_δ ) end_ARG start_ARG italic_d end_ARG end_ARG italic_n roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) + 10 square-root start_ARG roman_log ( 16 italic_m italic_n / italic_δ ) end_ARG ⋅ italic_σ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT square-root start_ARG italic_d end_ARG + ∑ start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT 64 italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) .

Then, similar to our results in Proposition G.9, we here also have the coefficient scale as below.

Proposition H.2.

If Condition C.3 holds, then for any 0tT,j{±1},r[m]formulae-sequence0𝑡superscript𝑇formulae-sequence𝑗plus-or-minus1𝑟delimited-[]𝑚0\leq t\leq T^{*},j\in\{\pm 1\},r\in[m]0 ≤ italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j ∈ { ± 1 } , italic_r ∈ [ italic_m ] and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ], it holds that

0|𝐰+1,r(t),𝐮l|,|𝐰1,r(t),𝐯l|=Θ(γj,r,𝐮l(t)),Θ(γj,r,𝐯l(t))32τlnSNRl2α,formulae-sequence0superscriptsubscript𝐰1𝑟𝑡subscript𝐮𝑙formulae-sequencesuperscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙Θsuperscriptsubscript𝛾𝑗𝑟subscript𝐮𝑙𝑡Θsuperscriptsubscript𝛾𝑗𝑟subscript𝐯𝑙𝑡32subscript𝜏𝑙𝑛superscriptsubscriptSNR𝑙2𝛼\displaystyle 0\leq\lvert\left\langle\mathbf{w}_{+1,r}^{(t)},\mathbf{u}_{l}% \right\rangle\rvert,\lvert\left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{v}_{l}% \right\rangle\rvert=\Theta(\gamma_{j,r,\mathbf{u}_{l}}^{(t)}),\Theta(\gamma_{j% ,r,\mathbf{v}_{l}}^{(t)})\leq 32\tau_{l}n\cdot\operatorname{SNR}_{l}^{2}\alpha,0 ≤ | ⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | , | ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | = roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) , roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ 32 italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_α ,
0ρ¯j,r,i(t)4α,0ρ¯j,r,i(t)β32log(6n2/δ)dnα,formulae-sequence0superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡4𝛼0superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡𝛽326superscript𝑛2𝛿𝑑𝑛𝛼\displaystyle 0\leq\bar{\rho}_{j,r,i}^{(t)}\leq 4\alpha,\quad 0\geq\underline{% \rho}_{j,r,i}^{(t)}\geq-\beta-32\sqrt{\frac{\log\left(6n^{2}/\delta\right)}{d}% }n\alpha,0 ≤ over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ 4 italic_α , 0 ≥ under¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≥ - italic_β - 32 square-root start_ARG divide start_ARG roman_log ( 6 italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_δ ) end_ARG start_ARG italic_d end_ARG end_ARG italic_n italic_α ,
κ2+1mr=1mρ¯yi,r,i(t)yif(𝐖(t),𝐱i)κ2+1mr=1mρ¯yi,r,i(t).𝜅21𝑚superscriptsubscript𝑟1𝑚superscriptsubscript¯𝜌subscript𝑦𝑖𝑟𝑖𝑡subscript𝑦𝑖𝑓superscript𝐖𝑡subscript𝐱𝑖𝜅21𝑚superscriptsubscript𝑟1𝑚superscriptsubscript¯𝜌subscript𝑦𝑖𝑟𝑖𝑡\displaystyle-\frac{\kappa}{2}+\frac{1}{m}\sum_{r=1}^{m}\bar{\rho}_{y_{i},r,i}% ^{(t)}\leq y_{i}f\left(\mathbf{W}^{(t)},\mathbf{x}_{i}\right)\leq\frac{\kappa}% {2}+\frac{1}{m}\sum_{r=1}^{m}\bar{\rho}_{y_{i},r,i}^{(t)}.- divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT .

Moreover, define c¯=2ησp2dnm,c¯=ησp2d3nm,b¯=eκformulae-sequence¯𝑐2𝜂superscriptsubscript𝜎𝑝2𝑑𝑛𝑚formulae-sequence¯𝑐𝜂superscriptsubscript𝜎𝑝2𝑑3𝑛𝑚¯𝑏superscript𝑒𝜅\bar{c}=\dfrac{2\eta\sigma_{p}^{2}d}{nm},\underline{c}=\dfrac{\eta\sigma_{p}^{% 2}d}{3nm},\bar{b}=e^{-\kappa}over¯ start_ARG italic_c end_ARG = divide start_ARG 2 italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n italic_m end_ARG , under¯ start_ARG italic_c end_ARG = divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 3 italic_n italic_m end_ARG , over¯ start_ARG italic_b end_ARG = italic_e start_POSTSUPERSCRIPT - italic_κ end_POSTSUPERSCRIPT and b¯=eκ¯𝑏superscript𝑒𝜅\underline{b}=e^{\kappa}under¯ start_ARG italic_b end_ARG = italic_e start_POSTSUPERSCRIPT italic_κ end_POSTSUPERSCRIPT, and let x¯t,x¯tsubscript¯𝑥𝑡subscript¯𝑥𝑡\bar{x}_{t},\underline{x}_{t}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT be the unique solution of

x¯t+b¯ex¯t=c¯t+b¯,subscript¯𝑥𝑡¯𝑏superscript𝑒subscript¯𝑥𝑡¯𝑐𝑡¯𝑏\displaystyle\bar{x}_{t}+\bar{b}e^{\bar{x}_{t}}=\bar{c}t+\bar{b},over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over¯ start_ARG italic_b end_ARG italic_e start_POSTSUPERSCRIPT over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = over¯ start_ARG italic_c end_ARG italic_t + over¯ start_ARG italic_b end_ARG ,
x¯t+b¯ex¯t=c¯t+b¯,subscript¯𝑥𝑡¯𝑏superscript𝑒subscript¯𝑥𝑡¯𝑐𝑡¯𝑏\displaystyle\underline{x}_{t}+\underline{b}e^{\underline{x}_{t}}=\underline{c% }t+\underline{b},under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + under¯ start_ARG italic_b end_ARG italic_e start_POSTSUPERSCRIPT under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUPERSCRIPT = under¯ start_ARG italic_c end_ARG italic_t + under¯ start_ARG italic_b end_ARG ,

it holds that

x¯t1mr=1mρ¯yi,r,i(t)x¯t+c¯/(1+b¯),log(ησp2d8nmt+2/3)x¯t,x¯tlog(2ησp2dnmt+1)formulae-sequencesubscript¯𝑥𝑡1𝑚superscriptsubscript𝑟1𝑚superscriptsubscript¯𝜌subscript𝑦𝑖𝑟𝑖𝑡subscript¯𝑥𝑡¯𝑐1¯𝑏formulae-sequence𝜂superscriptsubscript𝜎𝑝2𝑑8𝑛𝑚𝑡23subscript¯𝑥𝑡subscript¯𝑥𝑡2𝜂superscriptsubscript𝜎𝑝2𝑑𝑛𝑚𝑡1\underline{x}_{t}\leq\frac{1}{m}\sum_{r=1}^{m}\bar{\rho}_{y_{i},r,i}^{(t)}\leq% \bar{x}_{t}+\bar{c}/(1+\bar{b}),\quad\log\left(\frac{\eta\sigma_{p}^{2}d}{8nm}% t+2/3\right)\leq\bar{x}_{t},\underline{x}_{t}\leq\log\left(\frac{2\eta\sigma_{% p}^{2}d}{nm}t+1\right)under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over¯ start_ARG italic_c end_ARG / ( 1 + over¯ start_ARG italic_b end_ARG ) , roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 8 italic_n italic_m end_ARG italic_t + 2 / 3 ) ≤ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ roman_log ( divide start_ARG 2 italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n italic_m end_ARG italic_t + 1 ) (59)

for all r[m]𝑟delimited-[]𝑚r\in[m]italic_r ∈ [ italic_m ] and i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ].

Proof of Proposition H.2. Please refer to Proposition C.2, Proposition C.8 and Lemma C.9 in Meng et al. [2023] for a proof. Regardless of the variations in data settings, it is feasible to obtain the result through inductive techniques [Cao et al., 2022a, Frei et al., 2022, Kou et al., 2023b, Lu et al., 2023].

Building upon Proposition H.2, we can further analyze the convergence of the training dynamics by examining the extent of feature learning and noise memorization in the subsequent section.

H.2 Feature Learning and Noise Memorization Analysis: XOR data version

Similar to Lemma G.13 and Lemma G.15 for linearly separable data, we can also determine the scale of coefficients and inner products as follows.

Proposition H.3.

Under Condition C.3, the following points hold (n>n0𝑛subscript𝑛0n>n_{0}italic_n > italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) for l{1,2}for-all𝑙12\forall l\in\{1,2\}∀ italic_l ∈ { 1 , 2 }:

  1. 1.

    For any r[m]𝑟delimited-[]𝑚r\in[m]italic_r ∈ [ italic_m ], 𝐰+1,r(t),𝐮lsuperscriptsubscript𝐰1𝑟𝑡subscript𝐮𝑙\left\langle\mathbf{w}_{+1,r}^{(t)},\mathbf{u}_{l}\right\rangle⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ (or 𝐰1,r(t),𝐯l)or superscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙(\text{or }\left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{v}_{l}\right\rangle)( or ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) increases if 𝐰+1,r(0),𝐮l>0( or 𝐰1,r(t),𝐯l<0)superscriptsubscript𝐰1𝑟0subscript𝐮𝑙0 or superscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙0\left\langle\mathbf{w}_{+1,r}^{(0)},\mathbf{u}_{l}\right\rangle>0(\text{ or }% \left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{v}_{l}\right\rangle<0)⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ > 0 ( or ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ < 0 ), 𝐰+1,r(t),𝐮lsuperscriptsubscript𝐰1𝑟𝑡subscript𝐮𝑙\left\langle\mathbf{w}_{+1,r}^{(t)},\mathbf{u}_{l}\right\rangle⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ( or 𝐰1,r(t),𝐯l) or superscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙(\text{ or }\left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{v}_{l}\right\rangle)( or ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) decreases if 𝐰+1,r(0),𝐮l<0superscriptsubscript𝐰1𝑟0subscript𝐮𝑙0\left\langle\mathbf{w}_{+1,r}^{(0)},\mathbf{u}_{l}\right\rangle<0⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ < 0 ( or 𝐰1,r(t),𝐯l)>0 or superscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙0(\text{ or }\left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{v}_{l}\right\rangle)>0( or ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ ) > 0. Moreover, it holds that

    γj,r,𝐮l(t),γj,r,𝐯l(t)=Θ(τln𝝁l22σp2dlog(ησp2dtnm)),|𝐰+1,r(t),𝐮l|,|𝐰1,r(t)=Θ(τln𝝁l22σp2dlog(ησp2dtnm)),𝐯l|,formulae-sequencesuperscriptsubscript𝛾𝑗𝑟subscript𝐮𝑙𝑡superscriptsubscript𝛾𝑗𝑟subscript𝐯𝑙𝑡Θsubscript𝜏𝑙𝑛superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝜎𝑝2𝑑𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚superscriptsubscript𝐰1𝑟𝑡subscript𝐮𝑙delimited-⟨⟩superscriptsubscript𝐰1𝑟𝑡Θsubscript𝜏𝑙𝑛superscriptsubscriptnormsubscript𝝁𝑙22superscriptsubscript𝜎𝑝2𝑑𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚subscript𝐯𝑙\displaystyle\gamma_{j,r,\mathbf{u}_{l}}^{(t)},\gamma_{j,r,\mathbf{v}_{l}}^{(t% )}=\Theta(\dfrac{\tau_{l}n\|\bm{\mu}_{l}\|_{2}^{2}}{\sigma_{p}^{2}d}\cdot\log% \left(\dfrac{\eta\sigma_{p}^{2}dt}{nm}\right)),\lvert\left\langle\mathbf{w}_{+% 1,r}^{(t)},\mathbf{u}_{l}\right\rangle\rvert,\lvert\left\langle\mathbf{w}_{-1,% r}^{(t)}=\Theta(\dfrac{\tau_{l}n\|\bm{\mu}_{l}\|_{2}^{2}}{\sigma_{p}^{2}d}% \cdot\log\left(\dfrac{\eta\sigma_{p}^{2}dt}{nm}\right)),\mathbf{v}_{l}\right% \rangle\rvert,italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( divide start_ARG italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) , | ⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | , | ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( divide start_ARG italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_n ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | , (60)
    |𝐰1,r(t),𝐮l||𝐰1,r(0),𝐮l|+η𝝁l22/m,|𝐰+1,r(t),𝐯l||𝐰+1,r(0),𝐯l|+η𝝁l22/m.formulae-sequencesuperscriptsubscript𝐰1𝑟𝑡subscript𝐮𝑙superscriptsubscript𝐰1𝑟0subscript𝐮𝑙𝜂superscriptsubscriptnormsubscript𝝁𝑙22𝑚superscriptsubscript𝐰1𝑟𝑡subscript𝐯𝑙superscriptsubscript𝐰1𝑟0subscript𝐯𝑙𝜂superscriptsubscriptnormsubscript𝝁𝑙22𝑚\displaystyle\lvert\left\langle\mathbf{w}_{-1,r}^{(t)},\mathbf{u}_{l}\right% \rangle\rvert\leq\lvert\left\langle\mathbf{w}_{-1,r}^{(0)},\mathbf{u}_{l}% \right\rangle\rvert+\eta\|\bm{\mu}_{l}\|_{2}^{2}/m,\lvert\left\langle\mathbf{w% }_{+1,r}^{(t)},\mathbf{v}_{l}\right\rangle\rvert\leq\lvert\left\langle\mathbf{% w}_{+1,r}^{(0)},\mathbf{v}_{l}\right\rangle\rvert+\eta\|\bm{\mu}_{l}\|_{2}^{2}% /m.| ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | ≤ | ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | + italic_η ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m , | ⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | ≤ | ⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 0 ) end_POSTSUPERSCRIPT , bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | + italic_η ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m .
  2. 2.

    Let x¯tsubscript¯𝑥𝑡\underline{x}_{t}under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT defined in Proposition H.2, we have

    Ω(n)n5(x¯t1x¯1)i=1nρ¯j,r,i(t)3nx¯t3nlog(2ησp2dnmt+1)=Θ(nlog(ησp2dtnm)),Ω𝑛𝑛5subscript¯𝑥𝑡1subscript¯𝑥1superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡3𝑛subscript¯𝑥𝑡3𝑛2𝜂superscriptsubscript𝜎𝑝2𝑑𝑛𝑚𝑡1Θ𝑛𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚\Omega(n)\leq\frac{n}{5}\cdot\left(\bar{x}_{t-1}-\bar{x}_{1}\right)\leq\sum_{i% =1}^{n}\bar{\rho}_{j,r,i}^{(t)}\leq 3n\underline{x}_{t}\leq 3n\cdot\log\left(% \frac{2\eta\sigma_{p}^{2}d}{nm}t+1\right)=\Theta(n\cdot\log\left(\dfrac{\eta% \sigma_{p}^{2}dt}{nm}\right)),roman_Ω ( italic_n ) ≤ divide start_ARG italic_n end_ARG start_ARG 5 end_ARG ⋅ ( over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT - over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ≤ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ≤ 3 italic_n under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≤ 3 italic_n ⋅ roman_log ( divide start_ARG 2 italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n italic_m end_ARG italic_t + 1 ) = roman_Θ ( italic_n ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) , (61)

    for all t[T]𝑡delimited-[]superscript𝑇t\in\left[T^{*}\right]italic_t ∈ [ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ] and r[m]𝑟delimited-[]𝑚r\in[m]italic_r ∈ [ italic_m ]. Moreover, we have:

    i=1nρ¯j,r,i(t)/γ𝝁l,j,r,l(t)=Θ(τl1SNRl2)=i=1nρ¯j,r,i(t)/|𝐰±1,r(t),𝝁l|,superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript𝛾subscript𝝁𝑙superscript𝑗superscript𝑟𝑙𝑡Θsuperscriptsubscript𝜏𝑙1superscriptsubscriptSNR𝑙2superscriptsubscript𝑖1𝑛superscriptsubscript¯𝜌𝑗𝑟𝑖𝑡superscriptsubscript𝐰plus-or-minus1superscript𝑟𝑡subscript𝝁𝑙\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}/\gamma_{\bm{\mu}_{l},j^{\prime},r^{% \prime},l}^{(t)}=\Theta\left(\tau_{l}^{-1}\cdot\operatorname{SNR}_{l}^{-2}% \right)=\sum_{i=1}^{n}\bar{\rho}_{j,r,i}^{(t)}/\lvert\left\langle\mathbf{w}_{% \pm 1,r^{\prime}}^{(t)},\bm{\mu}_{l}\right\rangle\rvert,∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT / italic_γ start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ roman_SNR start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT / | ⟨ bold_w start_POSTSUBSCRIPT ± 1 , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | ,

    for all j,j{±1},r,r[m]formulae-sequence𝑗superscript𝑗plus-or-minus1𝑟superscript𝑟delimited-[]𝑚j,j^{\prime}\in\{\pm 1\},r,r^{\prime}\in[m]italic_j , italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { ± 1 } , italic_r , italic_r start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ italic_m ].

  3. 3.

    For t=Ω(nm/(ησp2d))𝑡Ω𝑛𝑚𝜂superscriptsubscript𝜎𝑝2𝑑t=\Omega\left(nm/\left(\eta\sigma_{p}^{2}d\right)\right)italic_t = roman_Ω ( italic_n italic_m / ( italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d ) ), the bound for 𝐰j,r(t)2subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by:

    𝐰j,r(t)2=Θ(σp1d1/2n1/2log(ησp2dtnm)).subscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡2Θsuperscriptsubscript𝜎𝑝1superscript𝑑12superscript𝑛12𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}=\Theta\left(\sigma_{p}^{-1}d^{-1/2}n% ^{1/2}\cdot\log\left(\dfrac{\eta\sigma_{p}^{2}dt}{nm}\right)\right).∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = roman_Θ ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) . (62)

Proof of Proposition H.2. The basic techniques are the same as Lemma G.13 and Lemma G.15 despite variation in data settings. Please refer to Proposition 4.2, Proposition D.3-5 in Meng et al. [2023] for a comprehensive proof.

H.3 Order-dependent Sampling (Querying) Analysis: XOR data version

Based on the scale of 𝐰j,r(t)superscriptsubscript𝐰𝑗𝑟𝑡\mathbf{w}_{j,r}^{(t)}bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT and the inner product between it and features, we can now characterize the querying situation of the two NAL methods based on the query criteria. Similar to the order-dependent analysis techniques utilized in Appendix G.4, we employ a full-order-based technique to tackle the problem of Θ(|𝒫|2)Θsuperscript𝒫2\Theta(\lvert\mathcal{P}\rvert^{2})roman_Θ ( | caligraphic_P | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) comparisons in 𝒫𝒫\mathcal{P}caligraphic_P. The concepts of Uncertainty Order and Diversity Order are introduced in Appendix F.2. We then proceed to examine the order of the samples in 𝒫𝒫\mathcal{P}caligraphic_P in the following proposition.

Proposition H.4.

Under the same conditions of Proposition C.5, there exist t=O~(η1ε1mnd1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚𝑛superscript𝑑1superscriptsubscript𝜎𝑝2t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}mnd^{-1}\sigma_{p}^{-2}\right)italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m italic_n italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ) that for 𝐱,𝐱𝒫𝒟for-all𝐱superscript𝐱𝒫𝒟\forall\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{P}\subsetneq\mathcal{D}∀ bold_x , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_P ⊊ caligraphic_D where 𝐱𝐱\mathbf{x}bold_x contains hard-to-learn feature patch while 𝐱superscript𝐱\mathbf{x}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT contains easy-to-learn feature patch, with probability at least 1-δsuperscript𝛿\delta^{\prime}italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, we have 𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x.

Proof of Proposition H.4. Firstly, suggest 𝐱=[y𝝁2,𝐳2],𝐱=[y𝝁1,𝐳1]formulae-sequence𝐱𝑦subscript𝝁2subscript𝐳2superscript𝐱superscript𝑦subscript𝝁1subscript𝐳1\mathbf{x}=[y\cdot\bm{\mu}_{2},\mathbf{z}_{2}],\mathbf{x}^{\prime}=[y^{\prime}% \cdot\bm{\mu}_{1},\mathbf{z}_{1}]bold_x = [ italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = [ italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ], where 𝝁1{𝐮1,𝐯1},𝝁2{𝐮2,𝐯2},y,y[±1],𝐳1,𝐳2N(𝟎,σp2𝐈)formulae-sequencesubscript𝝁1subscript𝐮1subscript𝐯1formulae-sequencesubscript𝝁2subscript𝐮2subscript𝐯2𝑦formulae-sequencesuperscript𝑦delimited-[]plus-or-minus1subscript𝐳1similar-tosubscript𝐳2𝑁0superscriptsubscript𝜎𝑝2𝐈\bm{\mu}_{1}\in\{\mathbf{u}_{1},\mathbf{v}_{1}\},\bm{\mu}_{2}\in\{\mathbf{u}_{% 2},\mathbf{v}_{2}\},y,y^{\prime}\in[\pm 1],\mathbf{z}_{1},\mathbf{z}_{2}\sim N% (\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ { bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT } , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ { bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT } , italic_y , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ [ ± 1 ] , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ):

f(𝐖(t),𝐱)=j,rjm[σ(𝐰j,r(t),y𝝁2)+σ(𝐰j,r(t),𝐳2)],𝑓superscript𝐖𝑡𝐱subscript𝑗𝑟𝑗𝑚delimited-[]𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝑦subscript𝝁2𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳2\displaystyle f\!\left(\mathbf{W}^{(t)},\mathbf{x}\right)\!=\sum_{j,r}\frac{j}% {m}\left[\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},y\bm{\mu}_{2}\right% \rangle\right)\thinspace+\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},% \mathbf{z}_{2}\right\rangle\right)\!\right],italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) ] ,
f(𝐖(t),𝐱)=j,rjm[σ(𝐰j,r(t),y𝝁1)+σ(𝐰j,r(t),𝐳1)].𝑓superscript𝐖𝑡superscript𝐱subscript𝑗𝑟𝑗𝑚delimited-[]𝜎superscriptsubscript𝐰𝑗𝑟𝑡superscript𝑦subscript𝝁1𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳1\displaystyle f\!\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\!=\sum_{j,r% }\frac{j}{m}\left[\sigma\!\left(\left\langle\mathbf{w}_{j,r}^{(t)},y^{\prime}% \bm{\mu}_{1}\right\rangle\right)\!+\sigma\!\left(\left\langle\mathbf{w}_{j,r}^% {(t)},\mathbf{z}_{1}\right\rangle\right)\!\right].italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT divide start_ARG italic_j end_ARG start_ARG italic_m end_ARG [ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ) ] .

By (11) in Lemma 11 and (16) in Definition 16, we have the following

𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq_{C}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x |f(𝐖(t),𝐱)|<|f(𝐖(t),𝐱)|ΩC,absentsubscript𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱subscriptΩ𝐶\displaystyle\Leftrightarrow\underbrace{\left|f\left(\mathbf{W}^{(t)},\mathbf{% x}\right)\right|<\left|f\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)% \right|}_{\Omega_{C}},⇔ under⏟ start_ARG | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | < | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
𝐱D(t)𝐱superscriptsubscriptprecedes-or-equals𝐷𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq_{D}^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0)ΩD,absentsubscript𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0subscriptΩ𝐷\displaystyle\Leftrightarrow\underbrace{D\left(\mathbf{W}^{(t)},\mathbf{x},p\ % \mid\mathcal{D}_{n_{0}}\right)>D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ % \mid\mathcal{D}_{n_{0}}\right)}_{\Omega_{D}},⇔ under⏟ start_ARG italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,
𝐱(t)𝐱superscriptprecedes-or-equals𝑡superscript𝐱𝐱\displaystyle\mathbf{x}^{\prime}\preceq^{(t)}\mathbf{x}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⪯ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x {ΩCΩD,p[1,)}ΩabsentsubscriptsubscriptΩ𝐶subscriptΩ𝐷for-all𝑝1Ω\displaystyle\Leftrightarrow\underbrace{\{\Omega_{C}\cap\Omega_{D},\forall p% \in\left[1,\infty\right)\}}_{\Omega}⇔ under⏟ start_ARG { roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ∩ roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT , ∀ italic_p ∈ [ 1 , ∞ ) } end_ARG start_POSTSUBSCRIPT roman_Ω end_POSTSUBSCRIPT

Denote jjσ(𝐰j,r(t),𝐳1)subscript𝑗𝑗𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳1\sum_{j}j\cdot\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{1}% \right\rangle\right)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_j ⋅ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⟩ ), jjσ(𝐰j,r(t),𝐳2)subscript𝑗𝑗𝜎superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳2\sum_{j}j\cdot\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{2}% \right\rangle\right)∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_j ⋅ italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) as gr(𝐳1)subscript𝑔𝑟subscript𝐳1g_{r}(\mathbf{z}_{1})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ), gr(𝐳2)subscript𝑔𝑟subscript𝐳2g_{r}(\mathbf{z}_{2})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) respectively, Notice that for 𝐳N(𝟎,σp2𝐈)similar-to𝐳𝑁0superscriptsubscript𝜎𝑝2𝐈\mathbf{z}\sim N(\mathbf{0},\sigma_{p}^{2}\cdot\mathbf{I})bold_z ∼ italic_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ):

𝐰j,r(t),𝐳𝒩(0,𝐰j,r(t)22σp2𝐈),similar-tosuperscriptsubscript𝐰𝑗𝑟𝑡𝐳𝒩0superscriptsubscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡22superscriptsubscript𝜎𝑝2𝐈\displaystyle\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}\right\rangle\sim% \mathcal{N}\left(0,\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}^{2}\sigma_{p}^{2}% \cdot\mathbf{I}\right),⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z ⟩ ∼ caligraphic_N ( 0 , ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ) , (63)
σ(𝐰j,r(t),𝐳)𝒩R(0,𝐰j,r(t)22σp2𝐈).similar-to𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝐳superscript𝒩𝑅0superscriptsubscriptnormsuperscriptsubscript𝐰𝑗𝑟𝑡22superscriptsubscript𝜎𝑝2𝐈\displaystyle\sigma(\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}\right\rangle% )\sim\mathcal{N}^{R}\left(0,\left\|\mathbf{w}_{j,r}^{(t)}\right\|_{2}^{2}% \sigma_{p}^{2}\cdot\mathbf{I}\right).italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z ⟩ ) ∼ caligraphic_N start_POSTSUPERSCRIPT italic_R end_POSTSUPERSCRIPT ( 0 , ∥ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ bold_I ) .

Then:

P(ΩC)𝑃subscriptΩ𝐶\displaystyle P(\Omega_{C})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT ) =P(|f(𝐖(t),𝐱)|<|f(𝐖(t),𝐱)|)absent𝑃𝑓superscript𝐖𝑡𝐱𝑓superscript𝐖𝑡superscript𝐱\displaystyle=P(\left|f\left(\mathbf{W}^{(t)},\mathbf{x}\right)\right|<\left|f% \left(\mathbf{W}^{(t)},\mathbf{x}^{\prime}\right)\right|)= italic_P ( | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) | < | italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | ) (64)
P(l(r|gr(𝐳l)|)<r(Θ(γy,r,𝝁1)Θ(γy,r,𝝁2)))absent𝑃subscript𝑙subscript𝑟subscript𝑔𝑟subscript𝐳𝑙subscript𝑟Θsubscript𝛾superscript𝑦𝑟subscript𝝁1Θsubscript𝛾𝑦𝑟subscript𝝁2\displaystyle\geq P(\sum_{l}(\sum_{r}\lvert g_{r}(\mathbf{z}_{l})\rvert)<\sum_% {r}(\Theta(\gamma_{y^{\prime},r,\bm{\mu}_{1}})-\Theta(\gamma_{y,r,\bm{\mu}_{2}% })))≥ italic_P ( ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ) < ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) )
P(mmaxj,r,l{|𝐰j,r(t),𝐳l|}<m(Θ(𝔼𝑟(γy,r,𝝁1))Θ(𝔼𝑟(γy,r,𝝁2))))absent𝑃𝑚subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙𝑚Θ𝑟𝔼subscript𝛾superscript𝑦𝑟subscript𝝁1Θ𝑟𝔼subscript𝛾𝑦𝑟subscript𝝁2\displaystyle\geq P(m\cdot\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t% )},\mathbf{z}_{l}\right\rangle\right|\}<m(\Theta(\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,\bm{\mu}_{1}}))-\Theta(\underset{r}{\mathbb{E}}(\gamma_{y% ,r,\bm{\mu}_{2}}))))≥ italic_P ( italic_m ⋅ roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m ( roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) )
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,𝝁1)𝔼𝑟(γy,r,𝝁2))Ωγ).\displaystyle=P(\underbrace{\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{% (t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,\bm{\mu}_{1}})-\underset{r}{\mathbb{E}}(\gamma_{y,r,\bm{% \mu}_{2}}))}_{\Omega_{\gamma}}).= italic_P ( under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) .

The second inequality is by triangle inequality and (60) in Proposition H.3; the third inequality is by (63).

For ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, denoting U0l={𝐱𝒟0𝐱signal part =𝝁l}superscriptsubscript𝑈0𝑙conditional-set𝐱subscript𝒟0subscript𝐱signal part subscript𝝁𝑙U_{0}^{l}=\{\mathbf{x}\in\mathcal{D}_{0}\mid\mathbf{x}_{\text{signal part }}=% \bm{\mu}_{l}\}italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = { bold_x ∈ caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } as the set of indices of 𝒟0subscript𝒟0\mathcal{D}_{0}caligraphic_D start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT where the data’s feature patch is 𝝁lsubscript𝝁𝑙\bm{\mu}_{l}bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, We then take a look at the rthsuperscript𝑟𝑡r^{th}italic_r start_POSTSUPERSCRIPT italic_t italic_h end_POSTSUPERSCRIPT row of the Feature Distance 𝐙(𝐱,t)𝐙𝐱𝑡\mathbf{Z}(\mathbf{x},t)bold_Z ( bold_x , italic_t ), which we denote as 𝐙r(𝐱,t)subscript𝐙𝑟𝐱𝑡\mathbf{Z}_{r}(\mathbf{x},t)bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ):

𝐙r(𝐱,t)subscript𝐙𝑟𝐱𝑡\displaystyle\mathbf{Z}_{r}(\mathbf{x},t)bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) =j(σ(𝐰j,r,y𝝁2)+σ(𝐰j,r,𝐳r))absentsubscript𝑗𝜎subscript𝐰𝑗𝑟𝑦subscript𝝁2𝜎subscript𝐰𝑗𝑟subscript𝐳𝑟\displaystyle=\sum_{j}\left(\sigma\left(\left\langle\mathbf{w}_{j,r},y\cdot\bm% {\mu}_{2}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_{j,r},\mathbf% {z}_{r}\right\rangle\right)\right)= ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⟩ ) ) (65)
=Θ(γy,r,𝝁2)+gr(𝐳2)absentΘsubscript𝛾𝑦𝑟subscript𝝁2subscript𝑔𝑟subscript𝐳2\displaystyle=\Theta\left(\gamma_{y,r,\bm{\mu}_{2}}\right)+g_{r}(\mathbf{z}_{2})= roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\displaystyle\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG =i,jσ(𝐰j,r,yi𝝁(i))+σ(𝐰j,r,𝝃i)n0absentsubscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝑦𝑖superscript𝝁𝑖𝜎subscript𝐰𝑗𝑟subscript𝝃𝑖subscript𝑛0\displaystyle=\sum_{i,j}\frac{\sigma\left(\left\langle\mathbf{w}_{j,r},y_{i}% \cdot\bm{\mu}^{(i)}\right\rangle\right)+\sigma\left(\left\langle\mathbf{w}_{j,% r},\bm{\xi}_{i}\right\rangle\right)}{n_{0}}= ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT divide start_ARG italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⟩ ) + italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (66)
=[lτln0𝔼ilU0lΘ(γyil,r,𝝁l)+ijΘ(ρ¯j,r,i)]n0absentdelimited-[]subscript𝑙subscript𝜏𝑙subscript𝑛0subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼Θsubscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖subscript𝑛0\displaystyle=\dfrac{\left[\sum_{l}\tau_{l}\cdot n_{0}\cdot\underset{i_{l}\in U% _{0}^{l}}{\mathbb{E}}\Theta(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}})+\sum_{i}\sum_{j% }\Theta\left(\bar{\rho}_{j,r,i}\right)\right]}{n_{0}}= divide start_ARG [ ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) ] end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG

Let (65) - (66), we have:

𝐙r(𝐱,t)i𝐙r(𝐱(i),t)n0=Θ(γy,r,𝝁2)+gr(𝐳2)i𝐙r(𝐱(i),t)n0subscript𝐙𝑟𝐱𝑡subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0Θsubscript𝛾𝑦𝑟subscript𝝁2subscript𝑔𝑟subscript𝐳2subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\mathbf{Z}_{r}(\mathbf{x},t)-\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)% }{n_{0}}=\Theta(\gamma_{y,r,\bm{\mu}_{2}})+g_{r}(\mathbf{z}_{2})-\sum_{i}% \dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG (67)

Now we can estimate D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ):

D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0\displaystyle D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) =𝐙(𝐱,t)i=1n0𝐙(𝐱(i),t)n0pabsentsubscriptnorm𝐙𝐱𝑡superscriptsubscript𝑖1subscript𝑛0𝐙superscript𝐱𝑖𝑡subscript𝑛0𝑝\displaystyle=\|\mathbf{Z}(\mathbf{x},t)-\sum_{i=1}^{n_{0}}\dfrac{\mathbf{Z}(% \mathbf{x}^{(i)},t)}{n_{0}}\|_{p}= ∥ bold_Z ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT divide start_ARG bold_Z ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ∥ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT (68)
=(r|𝐙r(𝐱,t)i𝐙r(𝐱(i),t)n0|p)1pabsentsuperscriptsubscript𝑟superscriptsubscript𝐙𝑟𝐱𝑡subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝\displaystyle=\left(\sum_{r}\lvert\mathbf{Z}_{r}(\mathbf{x},t)-\sum_{i}\dfrac{% \mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}\rvert^{p}\right)^{\frac{1}{p}}= ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x , italic_t ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT
=(r|Θ(γy,r,𝝁2)+gr(𝐳2)i𝐙r(𝐱(i),t)n0|p)1pabsentsuperscriptsubscript𝑟superscriptΘsubscript𝛾𝑦𝑟subscript𝝁2subscript𝑔𝑟subscript𝐳2subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝\displaystyle=\left(\sum_{r}\lvert\Theta(\gamma_{y,r,\bm{\mu}_{2}})+g_{r}(% \mathbf{z}_{2})-\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}% \rvert^{p}\right)^{\frac{1}{p}}= ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT

Similarly, the D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) could be written as:

D(𝐖(t),𝐱,p𝒟n0)=(r|Θ(γy,r,𝝁1)+gr(𝐳1)i𝐙r(𝐱(i),t)n0|p)1p𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0superscriptsubscript𝑟superscriptΘsubscript𝛾𝑦𝑟subscript𝝁1subscript𝑔𝑟subscript𝐳1subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0𝑝1𝑝D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)=% \left(\sum_{r}\lvert\Theta(\gamma_{y,r,\bm{\mu}_{1}})+g_{r}(\mathbf{z}_{1})-% \sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}\rvert^{p}\right)^{% \frac{1}{p}}italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) = ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG | start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT (69)

To compare D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), we first see that both expressions in the r𝑟ritalic_r-th filter owns

i𝐙r(𝐱(i),t)n0=lτlΘ(𝔼ilU0l(γyil,r,𝝁l))n01ijΘ(ρ¯j,r,i).subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙superscriptsubscript𝑛01subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖-\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}=-\sum_{l}\tau_{l}% \cdot\Theta(\underset{i_{l}\in U_{0}^{l}}{\mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{% \mu}_{l}}))-n_{0}^{-1}\sum_{i}\sum_{j}\Theta\left(\bar{\rho}_{j,r,i}\right).- ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG = - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) .

By Condition C.3, it holds that σp2d/(n0𝝁122)=Ω(log(T))superscriptsubscript𝜎𝑝2𝑑subscript𝑛0superscriptsubscriptnormsubscript𝝁122Ωsuperscript𝑇\sigma_{p}^{2}d/(n_{0}\|\bm{\mu}_{1}\|_{2}^{2})=\Omega(\log(T^{*}))italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / ( italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_Ω ( roman_log ( italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ). We see that as Tsuperscript𝑇T^{*}italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the substantially large maximum admissible iterations, collaborating with (60), (66) and (63), it holds that the order of n01i,jσ(𝐰j,r,𝝃i)=n01ijΘ(ρ¯j,r,i)superscriptsubscript𝑛01subscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝝃𝑖superscriptsubscript𝑛01subscript𝑖subscript𝑗Θsubscript¯𝜌𝑗𝑟𝑖n_{0}^{-1}\sum_{i,j}\sigma\left(\left\langle\mathbf{w}_{j,r},\bm{\xi}_{i}% \right\rangle\right)=n_{0}^{-1}\sum_{i}\sum_{j}\Theta\left(\bar{\rho}_{j,r,i}\right)italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , bold_italic_ξ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ ) = italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Θ ( over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_i end_POSTSUBSCRIPT ) in i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is indeed can dominate n01i,jσ(𝐰j,r,yi𝝁(i))=lτlΘ(𝔼ilU0l(γyil,r,𝝁l))superscriptsubscript𝑛01subscript𝑖𝑗𝜎subscript𝐰𝑗𝑟subscript𝑦𝑖superscript𝝁𝑖subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙n_{0}^{-1}\sum_{i,j}\sigma\left(\left\langle\mathbf{w}_{j,r},y_{i}\cdot\bm{\mu% }^{(i)}\right\rangle\right)=\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{% 0}^{l}}{\mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}}))italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i , italic_j end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ bold_italic_μ start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT ⟩ ) = ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ), Θ(γy,r,𝝁1)Θsubscript𝛾𝑦𝑟subscript𝝁1\Theta(\gamma_{y,r,\bm{\mu}_{1}})roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and gr(𝐳1)subscript𝑔𝑟subscript𝐳1g_{r}(\mathbf{z}_{1})italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ). As i𝐙r(𝐱(i),t)n0subscript𝑖subscript𝐙𝑟superscript𝐱𝑖𝑡subscript𝑛0\sum_{i}\dfrac{\mathbf{Z}_{r}(\mathbf{x}^{(i)},t)}{n_{0}}∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT divide start_ARG bold_Z start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_x start_POSTSUPERSCRIPT ( italic_i ) end_POSTSUPERSCRIPT , italic_t ) end_ARG start_ARG italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG is shared by both D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) and D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) in the r𝑟ritalic_r-th filter, a sufficient event for D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0)𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}\right)>D\left(% \mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}\right)italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is that for r[m]for-all𝑟delimited-[]𝑚\forall r\in[m]∀ italic_r ∈ [ italic_m ], it holds that

|lτlΘ(𝔼ilU0l(γyil,r,𝝁l))Θ(γy,r,𝝁2)gr(𝐳2)|>|max{lτlΘ(𝔼ilU0l(γyil,r,𝝁l))Θ(γy,r,1)gr(𝐳1),0}|.subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙Θsubscript𝛾𝑦𝑟subscript𝝁2subscript𝑔𝑟subscript𝐳2subscript𝑙subscript𝜏𝑙Θsubscript𝑖𝑙superscriptsubscript𝑈0𝑙𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙Θsubscript𝛾𝑦𝑟1subscript𝑔𝑟subscript𝐳10\lvert\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l}}{\mathbb{E}}(% \gamma_{y_{i_{l}},r,\bm{\mu}_{l}}))-\Theta(\gamma_{y,r,\bm{\mu}_{2}})-g_{r}(% \mathbf{z}_{2})\rvert>\lvert\max\{\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}% \in U_{0}^{l}}{\mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}}))-\Theta(\gamma_{% y,r,1})-g_{r}(\mathbf{z}_{1}),0\}\rvert.| ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | > | roman_max { ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - roman_Θ ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , 1 end_POSTSUBSCRIPT ) - italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , 0 } | .

Utilizing those results, we now could estimate the chance of event ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT:

P(ΩD)𝑃subscriptΩ𝐷\displaystyle P(\Omega_{D})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT ) =P(D(𝐖(t),𝐱,p𝒟n0)>D(𝐖(t),𝐱,p𝒟n0))absent𝑃𝐷superscript𝐖𝑡𝐱conditional𝑝subscript𝒟subscript𝑛0𝐷superscript𝐖𝑡superscript𝐱conditional𝑝subscript𝒟subscript𝑛0\displaystyle=P(D\left(\mathbf{W}^{(t)},\mathbf{x},p\ \mid\mathcal{D}_{n_{0}}% \right)>D\left(\mathbf{W}^{(t)},\mathbf{x}^{\prime},p\ \mid\mathcal{D}_{n_{0}}% \right))= italic_P ( italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > italic_D ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_p ∣ caligraphic_D start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) (70)
P(m1pl(maxr|gr(𝐳l)|)<m1p(|Θ(𝔼𝑟(γy,r,𝝁2))lτlΘ(𝔼ilU0l,r(γyil,r,𝝁l))|\displaystyle\geq P(m^{\frac{1}{p}}\sum_{l}(\max_{r}\lvert g_{r}(\mathbf{z}_{l% })\rvert)<m^{\frac{1}{p}}(\lvert\Theta(\underset{r}{\mathbb{E}}(\gamma_{y,r,% \bm{\mu}_{2}}))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{0}^{l},r}{% \mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}}))\rvert≥ italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ( roman_max start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT | italic_g start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | ) < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ( | roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) |
|Θ(𝔼𝑟(γy,r,𝝁1))lτlΘ(𝔼ilU0l,r(γyil,r,𝝁l))|)\displaystyle\phantom{\geq P(m^{\frac{1}{p}}\sum_{l}(\sum_{r}\lvert g_{r}(% \mathbf{z}_{l})\rvert)<m^{\frac{1}{p}}}-\lvert\Theta(\underset{r}{\mathbb{E}}(% \gamma_{y,r,\bm{\mu}_{1}}))-\sum_{l}\tau_{l}\cdot\Theta(\underset{i_{l}\in U_{% 0}^{l},r}{\mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}}))\rvert)- | roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ roman_Θ ( start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) | )
P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1p((τ1τ2)Θ(𝔼j,r(γj,r,𝝁1))(τ1τ2)Θ(𝔼j,r(γj,r,𝝁2)))\displaystyle\geq P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_% {j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\left((\tau_{1% }-\tau_{2})\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{1}}))-(\tau% _{1}-\tau_{2})\Theta(\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{2}}))\right)≥ italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT ( ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) - ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) )
=P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1pΘ(τ1(τ1τ2)𝝁122τ2(τ1τ2)𝝁222σp2d/n0)log(ησp2dtnm))absent𝑃superscript𝑚1𝑝subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙superscript𝑚1𝑝Θsubscript𝜏1subscript𝜏1subscript𝜏2superscriptsubscriptnormsubscript𝝁122subscript𝜏2subscript𝜏1subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚\displaystyle=P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r% }^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\Theta(\dfrac{\tau% _{1}(\tau_{1}-\tau_{2})\|\bm{\mu}_{1}\|_{2}^{2}-\tau_{2}(\tau_{1}-\tau_{2})\|% \bm{\mu}_{2}\|_{2}^{2}}{\sigma_{p}^{2}d/n_{0}})\cdot\log\left(\dfrac{\eta% \sigma_{p}^{2}dt}{nm}\right))= italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_Θ ( divide start_ARG italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) )
=P(m1pmaxj,r,l{|𝐰j,r(t),𝐳l|}<m1pΘ(𝔼𝑟(γy,r,𝝁1)𝔼𝑟(γy,r,𝝁2)))absent𝑃superscript𝑚1𝑝subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙superscript𝑚1𝑝Θ𝑟𝔼subscript𝛾superscript𝑦𝑟subscript𝝁1𝑟𝔼subscript𝛾𝑦𝑟subscript𝝁2\displaystyle=P(m^{\frac{1}{p}}\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r% }^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<m^{\frac{1}{p}}\Theta(\underset{r% }{\mathbb{E}}(\gamma_{y^{\prime},r,\bm{\mu}_{1}})-\underset{r}{\mathbb{E}}(% \gamma_{y,r,\bm{\mu}_{2}})))= italic_P ( italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < italic_m start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT roman_Θ ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) )
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,𝝁1)𝔼𝑟(γy,r,𝝁2))Ωγ),\displaystyle=P(\underbrace{\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{% (t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(% \gamma_{y^{\prime},r,\bm{\mu}_{1}})-\underset{r}{\mathbb{E}}(\gamma_{y,r,\bm{% \mu}_{2}}))}_{\Omega_{\gamma}}),= italic_P ( under⏟ start_ARG roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) end_ARG start_POSTSUBSCRIPT roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ,

where the first inequality is by triangle inequality, (68) and (69); The forth equality is by (63). Easy to see that if p=𝑝p=\inftyitalic_p = ∞, the third equality would be zero, thus our condition p<𝑝p<\inftyitalic_p < ∞ avoid this case. Now we take a look at the event ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT:

P(Ωγ)𝑃subscriptΩ𝛾\displaystyle P(\Omega_{\gamma})italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) =P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,𝝁1)𝔼𝑟(γy,r,𝝁2)))\displaystyle=P(\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf% {z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{\mathbb{E}}(\gamma_{y^{% \prime},r,\bm{\mu}_{1}})-\underset{r}{\mathbb{E}}(\gamma_{y,r,\bm{\mu}_{2}})))= italic_P ( roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ) (71)
=P(maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ([τ1𝝁122τ2𝝁222]σp2d/n0log(ησp2dtnm)))absent𝑃subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙Θdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚\displaystyle=P(\max_{j,r,l}\{\left|\left\langle\mathbf{w}_{j,r}^{(t)},\mathbf% {z}_{l}\right\rangle\right|\}<\Theta\left(\dfrac{\left[\tau_{1}\left\|\bm{\mu}% _{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2}\right]}{% \sigma_{p}^{2}d/n_{0}}\cdot\log\left(\dfrac{\eta\sigma_{p}^{2}dt}{nm}\right)% \right))= italic_P ( roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) )
P(j,r,l{|𝐰j,r(t),𝐳l0|<Θ([τ1𝝁122τ2𝝁222]σp2d/n0log(ησp2dtnm))}Ω^j,r,l)absent𝑃subscript𝑗𝑟𝑙subscriptsuperscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙0Θdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁222superscriptsubscript𝜎𝑝2𝑑subscript𝑛0𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚subscript^Ω𝑗𝑟𝑙\displaystyle\geq P(\bigcup_{j,r,l}\underbrace{\{\left|\left\langle\mathbf{w}_% {j,r}^{(t)},\mathbf{z}_{l}\right\rangle-0\right|<\Theta\left(\dfrac{\left[\tau% _{1}\left\|\bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2% }^{2}\right]}{\sigma_{p}^{2}d/n_{0}}\cdot\log\left(\dfrac{\eta\sigma_{p}^{2}dt% }{nm}\right)\right)\}}_{\hat{\Omega}_{j,r,l}})≥ italic_P ( ⋃ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT under⏟ start_ARG { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ - 0 | < roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) } end_ARG start_POSTSUBSCRIPT over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT )
=j,r,lP(Ω^j,r,l),absentsubscript𝑗𝑟𝑙𝑃subscript^Ω𝑗𝑟𝑙\displaystyle=\sum_{j,r,l}P(\hat{\Omega}_{j,r,l}),= ∑ start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT italic_P ( over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT ) ,

where the second equality is by the first inference statement of Lemma G.14; the third inequality is by the equivalence property of the union by events; the last equality is by the Union Rule. Then, by Gaussian tail bound, we have:

P(Ω^j,r,l)12exp{Θ([τ1𝝁122τ2𝝁222]2σp6d2/n02wj,r(t)22log(ησp2dtnm))}𝑃subscript^Ω𝑗𝑟𝑙12Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁2222superscriptsubscript𝜎𝑝6superscript𝑑2superscriptsubscript𝑛02superscriptsubscriptnormsuperscriptsubscript𝑤𝑗𝑟𝑡22𝜂superscriptsubscript𝜎𝑝2𝑑𝑡𝑛𝑚P(\hat{\Omega}_{j,r,l})\geq 1-2\exp\left\{-\Theta\left(\dfrac{\left[\tau_{1}% \left\|\bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2% }\right]^{2}}{\sigma_{p}^{6}d^{2}/n_{0}^{2}\left\|w_{j,r}^{(t)}\right\|_{2}^{2% }}\cdot\log\left(\dfrac{\eta\sigma_{p}^{2}dt}{nm}\right)\right)\right\}italic_P ( over^ start_ARG roman_Ω end_ARG start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT ) ≥ 1 - 2 roman_exp { - roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ roman_log ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_t end_ARG start_ARG italic_n italic_m end_ARG ) ) }

Finally, with conditions in Proposition C.5, Lemma 17, Proposition H.3 and union bound, we have the conclusion for event ΩΩ\Omegaroman_Ω:

P(Ω)P(Ωγ)absent𝑃Ω𝑃subscriptΩ𝛾\displaystyle\Rightarrow P(\Omega)\geq P(\Omega_{\gamma})⇒ italic_P ( roman_Ω ) ≥ italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) 18mexp{Θ([τ1𝝁122τ2𝝁222]2σp4d/n0)}absent18𝑚Θsuperscriptdelimited-[]subscript𝜏1superscriptsubscriptnormsubscript𝝁122subscript𝜏2superscriptsubscriptnormsubscript𝝁2222superscriptsubscript𝜎𝑝4𝑑subscript𝑛0\displaystyle\geqslant 1-8m\exp\left\{-\Theta\left(\frac{\left[\tau_{1}\left\|% \bm{\mu}_{1}\right\|_{2}^{2}-\tau_{2}\left\|\bm{\mu}_{2}\right\|_{2}^{2}\right% ]^{2}}{\sigma_{p}^{4}d/n_{0}}\right)\right\}⩾ 1 - 8 italic_m roman_exp { - roman_Θ ( divide start_ARG [ italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d / italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) } (72)
1δ,absent1superscript𝛿\displaystyle\geqslant 1-\delta^{\prime},⩾ 1 - italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ,

for p[1,)for-all𝑝1\forall p\in\left[1,\infty\right)∀ italic_p ∈ [ 1 , ∞ ).

Remark H.5.

The proof process is nearly identical to that of the linearly separable case (i.e., the proof of Proposition G.16). The only differences lie in the scale of wj,r(t)2subscriptnormsuperscriptsubscript𝑤𝑗𝑟𝑡2\|w_{j,r}^{(t)}\|_{2}∥ italic_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and γ±1,r,𝝁subscript𝛾plus-or-minus1𝑟𝝁\gamma_{\pm 1,r,\bm{\mu}}italic_γ start_POSTSUBSCRIPT ± 1 , italic_r , bold_italic_μ end_POSTSUBSCRIPT, but the conditions required are the same.

Similar to Lemma G.18 in Appendix G.4, we have the following lemma.

Lemma H.6.

Under the same conditions in Proposition 3, with the same notations in Proposition H.4, there exists certain constants c1,c2,c3,c4,c5,c6>0subscript𝑐1subscript𝑐2subscript𝑐3subscript𝑐4subscript𝑐5subscript𝑐60c_{1},c_{2},c_{3},c_{4},c_{5},c_{6}>0italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT > 0, such that

  • 𝐱C(t)𝐱superscriptsubscriptprecedes-or-equals𝐶𝑡𝐱superscript𝐱\mathbf{x}\preceq_{C}^{(t)}\mathbf{x}^{\prime}bold_x ⪯ start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT has a sufficient event that

    {c1𝔼𝑟(γy,r,𝝁1)c2𝔼𝑟(γy,r,𝝁2)>maxj,r,l{|𝐰j,r(t),𝐳l|}},subscript𝑐1𝑟𝔼subscript𝛾superscript𝑦𝑟subscript𝝁1subscript𝑐2𝑟𝔼subscript𝛾𝑦𝑟subscript𝝁2subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙\{c_{1}\underset{r}{\mathbb{E}}(\gamma_{y^{\prime},r,\bm{\mu}_{1}})-c_{2}% \underset{r}{\mathbb{E}}(\gamma_{y,r,\bm{\mu}_{2}})>\max_{j,r,l}\{\left|\left% \langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}\},{ italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } , (73)

    among which the left side of the inequality corresponds to the comparison of learning progress of samples with different type of feature patch.

  • 𝐱D(t)𝐱,p[1,)formulae-sequencesuperscriptsubscriptprecedes-or-equals𝐷𝑡𝐱superscript𝐱for-all𝑝1\mathbf{x}\preceq_{D}^{(t)}\mathbf{x}^{\prime},\forall p\in[1,\infty)bold_x ⪯ start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , ∀ italic_p ∈ [ 1 , ∞ ) has a sufficient event that

    {|c3𝔼𝑟(γy,r,𝝁2)c4lτl𝔼ilU0l,r(γyil,r,𝝁l)||c5𝔼𝑟(γy,r,𝝁1)c6lτl𝔼ilU0l,r(γyil,r,𝝁l)|>maxj,r,l{|𝐰j,r(t),𝐳l|}},subscript𝑐3𝑟𝔼subscript𝛾𝑦𝑟subscript𝝁2subscript𝑐4subscript𝑙subscript𝜏𝑙subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝑟𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙subscript𝑐5𝑟𝔼subscript𝛾superscript𝑦𝑟subscript𝝁1subscript𝑐6subscript𝑙subscript𝜏𝑙subscript𝑖𝑙superscriptsubscript𝑈0𝑙𝑟𝔼subscript𝛾subscript𝑦subscript𝑖𝑙𝑟subscript𝝁𝑙subscript𝑗𝑟𝑙superscriptsubscript𝐰𝑗𝑟𝑡subscript𝐳𝑙\{\lvert c_{3}\underset{r}{\mathbb{E}}(\gamma_{y,r,\bm{\mu}_{2}})-c_{4}\sum_{l% }\tau_{l}\cdot\underset{i_{l}\in U_{0}^{l},r}{\mathbb{E}}(\gamma_{y_{i_{l}},r,% \bm{\mu}_{l}})\rvert-\lvert c_{5}\underset{r}{\mathbb{E}}(\gamma_{y^{\prime},r% ,\bm{\mu}_{1}})-c_{6}\sum_{l}\tau_{l}\cdot\underset{i_{l}\in U_{0}^{l},r}{% \mathbb{E}}(\gamma_{y_{i_{l}},r,\bm{\mu}_{l}})\rvert>\max_{j,r,l}\{\left|\left% \langle\mathbf{w}_{j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}\},{ | italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | - | italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - italic_c start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT italic_τ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⋅ start_UNDERACCENT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∈ italic_U start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) | > roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } } , (74)

    among which the left side of the inequality corresponds to the comparison of the disparity between learning toward samples and labeled training set.

Proof of Lemma H.6. The first bullet point can be easily derived from (64), while the second bullet point is readily apparent from (68), (69), and (70).

Similar to the discussions in Appendix G.4, it is observed that for any p[1,)𝑝1p\in[1,\infty)italic_p ∈ [ 1 , ∞ ), there exists a shared sufficient event for (73) and (74). This implies that it is also a shared sufficient event for the events ΩCsubscriptΩ𝐶\Omega_{C}roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT, denoted as ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT:

Ωγ:={maxj,r,l{|𝐰j,r(t),𝐳l|}<Θ((𝔼𝑟(γy,r,𝝁1)𝔼𝑟(γy,r,𝝁2))}.\Omega_{\gamma}\mathrel{\mathop{:}}=\{\max_{j,r,l}\{\left|\left\langle\mathbf{% w}_{j,r}^{(t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{r}{% \mathbb{E}}(\gamma_{y^{\prime},r,\bm{\mu}_{1}})-\underset{r}{\mathbb{E}}(% \gamma_{y,r,\bm{\mu}_{2}}))\}.roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT : = { roman_max start_POSTSUBSCRIPT italic_j , italic_r , italic_l end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - underitalic_r start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_y , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) } .

By the first inference statement of Proposition H.3, we have

Ωγ={maxj,r,𝝁l{|𝐰j,r(t),𝐳l|}<Θ((𝔼j,r(γj,r,𝝁1)𝔼j,r(γj,r,𝝁2))}.\Omega_{\gamma}=\{\max_{j,r,\bm{\mu}_{l}}\{\left|\left\langle\mathbf{w}_{j,r}^% {(t)},\mathbf{z}_{l}\right\rangle\right|\}<\Theta((\underset{j,r}{\mathbb{E}}(% \gamma_{j,r,\bm{\mu}_{1}})-\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{2}% }))\}.roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT = { roman_max start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT { | ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_z start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ⟩ | } < roman_Θ ( ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) } . (75)

Therefore, we can conclude that the significant difference in the model’s learning of the feature 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is what causes the sufficient event for both event ΩCsubscriptΩ𝐶\Omega_{C}roman_Ω start_POSTSUBSCRIPT italic_C end_POSTSUBSCRIPT and ΩDsubscriptΩ𝐷\Omega_{D}roman_Ω start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT. By (72), we have:

P(Ωγ)18mexp{Θ(𝔼j,r(γj,r,𝝁1)𝔼j,r(γj,r,𝝁2))}.𝑃subscriptΩ𝛾18𝑚Θ𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝝁1𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝝁2P(\Omega_{\gamma})\geq 1-8m\exp\left\{-\Theta\left(\underset{j,r}{\mathbb{E}}(% \gamma_{j,r,\bm{\mu}_{1}})-\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{2}% })\right)\right\}.italic_P ( roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT ) ≥ 1 - 8 italic_m roman_exp { - roman_Θ ( start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) - start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) } . (76)

Based on Proposition H.3, we see that the 𝔼j,r(γj,r,𝝁1)𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝝁1\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{1}})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) is significant larger than 𝔼j,r(γj,r,𝝁2)𝑗𝑟𝔼subscript𝛾𝑗𝑟subscript𝝁2\underset{j,r}{\mathbb{E}}(\gamma_{j,r,\bm{\mu}_{2}})start_UNDERACCENT italic_j , italic_r end_UNDERACCENT start_ARG blackboard_E end_ARG ( italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) under our conditions, which causes the sufficient event ΩγsubscriptΩ𝛾\Omega_{\gamma}roman_Ω start_POSTSUBSCRIPT italic_γ end_POSTSUBSCRIPT.

Similar to Lemma 4.4 for linearly separable XOR data, we also have conclusions regarding the order of pool for XOR data.

Lemma H.7.

Under Condition C.3, when the results of Proposition 3.2 and Proposition H.4 hold at the initial stage and querying stage at a certain tT𝑡superscript𝑇t\leq T^{*}italic_t ≤ italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, denoting 𝐗𝒫1𝒫superscriptsubscript𝐗𝒫1𝒫\mathbf{X}_{\mathcal{P}}^{1}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of all the data points with strong feature 𝛍1subscript𝛍1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in 𝒫𝒫\mathcal{P}caligraphic_P, and 𝐗𝒫2𝒫superscriptsubscript𝐗𝒫2𝒫\mathbf{X}_{\mathcal{P}}^{2}\subsetneqq\mathcal{P}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⫋ caligraphic_P as the collection of data points with weak feature 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we have the conclusion that with probability more than 1-Θ(δ)Θsuperscript𝛿\Theta(\delta^{\prime})roman_Θ ( italic_δ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), 𝐗𝒫1(t)𝐗𝒫2superscriptprecedes𝑡superscriptsubscript𝐗𝒫1superscriptsubscript𝐗𝒫2\mathbf{X}_{\mathcal{P}}^{1}\prec^{(t)}\mathbf{X}_{\mathcal{P}}^{2}bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT ≺ start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT caligraphic_P end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT holds.

proof of Lemma H.7. See Lemma G.19 for a proof.

Similar to Lemma G.20, we directly have the following lemma demonstrate that both NAL algorithms would all prioritize those perplexing samples.

Lemma H.8.

(Formal Restatement of Proposition C.5) Under the same conditions in Proposition C.5, the Uncertainty Order and Diversity Order of the samples [(y𝛍l)T,ξT]Tsuperscriptsuperscript𝑦subscript𝛍𝑙𝑇superscript𝜉𝑇𝑇[(y\cdot\bm{\mu}_{l})^{T},\mathbf{\xi}^{T}]^{T}[ ( italic_y ⋅ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT in sampling pool 𝒫𝒫\mathcal{P}caligraphic_P follows the order of 𝔼j,k,lγj,k,𝛍l(t)𝑗𝑘𝑙𝔼superscriptsubscript𝛾𝑗𝑘subscript𝛍𝑙𝑡\displaystyle\underset{j,k,l}{\mathbb{E}}\gamma_{j,k,\bm{\mu}_{l}}^{(t)}start_UNDERACCENT italic_j , italic_k , italic_l end_UNDERACCENT start_ARG blackboard_E end_ARG italic_γ start_POSTSUBSCRIPT italic_j , italic_k , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT.

H.4 Label Complexity-based Test Error Analysis: XOR data version

The underlying philosophy in this section is the same as that in Appendix G.5 for the theory regarding linearly separable data. We propose that the results obtained in the previous section hold with high probability. By considering the scale of the coefficients, inner products, and the order of the data in the sampling pool 𝒫𝒫\mathcal{P}caligraphic_P, we can now examine the upper and lower bounds of the test error under different conditions before and after querying.

Lemma H.9.

Under Condition C.3, for a test set 𝒟𝒟superscript𝒟superscript𝒟\mathcal{D^{*}}\subseteq\mathcal{D^{*}}caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊆ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT with occurrence probability psuperscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT of the 𝛍2subscript𝛍2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-equipped data, then t=O~(η1ε1mn~d1σp2)𝑡~𝑂superscript𝜂1superscript𝜀1𝑚~𝑛superscript𝑑1superscriptsubscript𝜎𝑝2\exists\ t=\widetilde{O}\left(\eta^{-1}\varepsilon^{-1}m\widetilde{n}d^{-1}% \sigma_{p}^{-2}\right)∃ italic_t = over~ start_ARG italic_O end_ARG ( italic_η start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_ε start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_m over~ start_ARG italic_n end_ARG italic_d start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ), we have the following two situations before and after querying (i.e., s{0,1}for-all𝑠01\forall s\in\{0,1\}∀ italic_s ∈ { 0 , 1 }):

  • For t=Ω(n~m/(ησp2dε))𝑡Ω~𝑛𝑚𝜂superscriptsubscript𝜎𝑝2𝑑𝜀t=\Omega\left(\widetilde{n}m/\left(\eta\sigma_{p}^{2}d\varepsilon\right)\right)italic_t = roman_Ω ( over~ start_ARG italic_n end_ARG italic_m / ( italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ε ) ), the training loss converges LS(𝐖(t))εsubscript𝐿𝑆superscript𝐖𝑡𝜀L_{S}\left(\mathbf{W}^{(t)}\right)\leq\varepsilonitalic_L start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ italic_ε.

  • If l{1,2},ns,lC^1σp4d𝝁l24formulae-sequencefor-all𝑙12subscript𝑛𝑠𝑙subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁𝑙24\forall l\in\{1,2\},n_{s,l}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{% l}\|_{2}^{4}}∀ italic_l ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG holds, we have the test error:

    L𝒟01(𝐖(t))(1p)exp(ns,1𝝁124C^3σp4d)+pexp(ns,2𝝁224C^4σp4d).superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡1superscript𝑝subscript𝑛𝑠1superscriptsubscriptnormsubscript𝝁124subscript^𝐶3superscriptsubscript𝜎𝑝4𝑑superscript𝑝subscript𝑛𝑠2superscriptsubscriptnormsubscript𝝁224subscript^𝐶4superscriptsubscript𝜎𝑝4𝑑L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq(1-p^{*})\cdot\exp% \left(\dfrac{-n_{s,1}\|\bm{\mu}_{1}\|_{2}^{4}}{\hat{C}_{3}\sigma_{p}^{4}d}% \right)+p^{*}\cdot\exp\left(\dfrac{-n_{s,2}\|\bm{\mu}_{2}\|_{2}^{4}}{\hat{C}_{% 4}\sigma_{p}^{4}d}\right).italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) . (77)
  • If l{1,2}ns,lC^2σp4d𝝁l24superscript𝑙12subscript𝑛𝑠superscript𝑙subscript^𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁superscript𝑙24\exists l^{\prime}\in\{1,2\}n_{s,l^{\prime}}\leq\dfrac{\hat{C}_{2}\sigma_{p}^{% 4}d}{\|\bm{\mu}_{l^{\prime}}\|_{2}^{4}}∃ italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { 1 , 2 } italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG holds, where C^1subscript^𝐶1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is from Condition 3.1, we have the test error

    L𝒟01(𝐖(t))0.12τl.superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡0.12subscriptsuperscript𝜏superscript𝑙L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq 0.12\cdot\tau^{*}_{% l^{\prime}}.italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ 0.12 ⋅ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT . (78)

Here τlsubscriptsuperscript𝜏superscript𝑙\tau^{*}_{l^{\prime}}italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT denotes the occurrence probability of feature 𝛍lsubscript𝛍superscript𝑙\bm{\mu}_{l^{\prime}}bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, C^1subscript^𝐶1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, C^2subscript^𝐶2\hat{C}_{2}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, C^3subscript^𝐶3\hat{C}_{3}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT and C^4subscript^𝐶4\hat{C}_{4}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT are some positive constants.

Proof of Lemma H.9. The proof flow follows Theorem 3.2 in Meng et al. [2023] despite that we consider two features. For the training convergence, by Proposition H.2 we have

yif(𝐖(t),𝐱i)subscript𝑦𝑖𝑓superscript𝐖𝑡subscript𝐱𝑖\displaystyle y_{i}f\left(\mathbf{W}^{(t)},\mathbf{x}_{i}\right)italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) κ2+1mr=1mρ¯yi,r,i(t)absent𝜅21𝑚superscriptsubscript𝑟1𝑚superscriptsubscript¯𝜌subscript𝑦𝑖𝑟𝑖𝑡\displaystyle\geq-\frac{\kappa}{2}+\frac{1}{m}\sum_{r=1}^{m}\bar{\rho}_{y_{i},% r,i}^{(t)}≥ - divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG + divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT
κ2+x¯tabsent𝜅2subscript¯𝑥𝑡\displaystyle\geq-\frac{\kappa}{2}+\underline{x}_{t}≥ - divide start_ARG italic_κ end_ARG start_ARG 2 end_ARG + under¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
κ+log(Θ(ησp2dnsm)t+23).absent𝜅Θ𝜂superscriptsubscript𝜎𝑝2𝑑subscript𝑛𝑠𝑚𝑡23\displaystyle\geq-\kappa+\log\left(\Theta(\frac{\eta\sigma_{p}^{2}d}{n_{s}m})t% +\frac{2}{3}\right).≥ - italic_κ + roman_log ( roman_Θ ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_m end_ARG ) italic_t + divide start_ARG 2 end_ARG start_ARG 3 end_ARG ) .

Recall κ𝜅\kappaitalic_κ is defined in (58). Here, the first inequality is by the conclusion in Proposition H.2 and the second inequality is by (59) Proposition H.2, and last inequality are by (59). Then we have

L(𝐖(t))log(1+exp{κ}/(Θ(ησp2dnsm)t+23))eκΘ(ησp2dnsm)t+23eκ2/ε+23ε𝐿superscript𝐖𝑡1𝜅Θ𝜂superscriptsubscript𝜎𝑝2𝑑subscript𝑛𝑠𝑚𝑡23superscript𝑒𝜅Θ𝜂superscriptsubscript𝜎𝑝2𝑑subscript𝑛𝑠𝑚𝑡23superscript𝑒𝜅2𝜀23𝜀L\left(\mathbf{W}^{(t)}\right)\leq\log\left(1+\exp\{\kappa\}/\left(\Theta(% \frac{\eta\sigma_{p}^{2}d}{n_{s}m})t+\frac{2}{3}\right)\right)\leq\frac{e^{% \kappa}}{\Theta(\frac{\eta\sigma_{p}^{2}d}{n_{s}m})t+\frac{2}{3}}\leq\frac{e^{% \kappa}}{2/\varepsilon+\frac{2}{3}}\leq\varepsilonitalic_L ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ roman_log ( 1 + roman_exp { italic_κ } / ( roman_Θ ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_m end_ARG ) italic_t + divide start_ARG 2 end_ARG start_ARG 3 end_ARG ) ) ≤ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_κ end_POSTSUPERSCRIPT end_ARG start_ARG roman_Θ ( divide start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_m end_ARG ) italic_t + divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_ARG ≤ divide start_ARG italic_e start_POSTSUPERSCRIPT italic_κ end_POSTSUPERSCRIPT end_ARG start_ARG 2 / italic_ε + divide start_ARG 2 end_ARG start_ARG 3 end_ARG end_ARG ≤ italic_ε

The last inequality is by log(1+x)x1𝑥𝑥\log(1+x)\leq xroman_log ( 1 + italic_x ) ≤ italic_x, tΩ(n~mησp2dε)𝑡Ω~𝑛𝑚𝜂superscriptsubscript𝜎𝑝2𝑑𝜀t\geq\Omega\left(\frac{\widetilde{n}m}{\eta\sigma_{p}^{2}d\varepsilon}\right)italic_t ≥ roman_Ω ( divide start_ARG over~ start_ARG italic_n end_ARG italic_m end_ARG start_ARG italic_η italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d italic_ε end_ARG ) and exp{κ}1.5𝑒𝑥𝑝𝜅1.5exp\{\kappa\}\leq 1.5italic_e italic_x italic_p { italic_κ } ≤ 1.5.

For evaluating test error, same as techniques in Lemma G.21, we have

L𝒟01(𝐖)superscriptsubscript𝐿superscript𝒟01𝐖\displaystyle L_{\mathcal{D}^{*}}^{0-1}(\mathbf{W})italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W ) =(𝐱,y)𝒟[yf(𝐖,𝐱)<0]absentsubscriptsimilar-to𝐱𝑦superscript𝒟delimited-[]𝑦𝑓𝐖𝐱0\displaystyle=\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}^{*}}[y\cdot f(\mathbf{% W},\mathbf{x})<0]= blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] (79)
=(1p)(𝐱,y)𝒟𝝁1[yf(𝐖,𝐱)<0]+p(𝐱,y)𝒟𝝁2[yf(𝐖,𝐱)<0],absent1superscript𝑝subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁1delimited-[]𝑦𝑓𝐖𝐱0superscript𝑝subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁2delimited-[]𝑦𝑓𝐖𝐱0\displaystyle=(1-p^{*})\cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu% }_{1}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0]+p^{*}\cdot\mathbb{P}_{(\mathbf{% x},y)\sim\mathcal{D}_{\bm{\mu}_{2}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0],= ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] ,

where 𝒟𝝁1superscriptsubscript𝒟subscript𝝁1\mathcal{D}_{\bm{\mu}_{1}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝒟𝝁2superscriptsubscript𝒟subscript𝝁2\mathcal{D}_{\bm{\mu}_{2}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denotes the collection of data points in 𝒟𝒟\mathcal{D}caligraphic_D containing feature 𝝁1subscript𝝁1\bm{\mu}_{1}bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝝁2subscript𝝁2\bm{\mu}_{2}bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, respectively. Notably, (𝐱,y)𝒟𝝁l[yf(𝐖,𝐱)<0]subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁𝑙delimited-[]𝑦𝑓𝐖𝐱0\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{l}}^{*}}[y\cdot f(\mathbf% {W},\mathbf{x})<0]blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] is equal to

𝝁{±𝐮l,±𝐯l}P(yf(𝐖(t),𝐱)>0𝐱signal part =𝝁)14,subscript𝝁plus-or-minussubscript𝐮𝑙plus-or-minussubscript𝐯𝑙𝑃𝑦𝑓superscript𝐖𝑡𝐱conditional0subscript𝐱signal part 𝝁14\sum_{\bm{\mu}\in\{\pm\mathbf{u}_{l},\pm\mathbf{v}_{l}\}}P\left(yf\left(% \mathbf{W}^{(t)},\mathbf{x}\right)>0\mid\mathbf{x}_{\text{signal part }}=\bm{% \mu}\right)\cdot\frac{1}{4},∑ start_POSTSUBSCRIPT bold_italic_μ ∈ { ± bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , ± bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } end_POSTSUBSCRIPT italic_P ( italic_y italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) > 0 ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_italic_μ ) ⋅ divide start_ARG 1 end_ARG start_ARG 4 end_ARG ,

then without loss of generality, we can only investigate

P(1f(𝐖(t),𝐱)>0𝐱signal part =𝐮l),l{1,2}𝑃1𝑓superscript𝐖𝑡𝐱conditional0subscript𝐱signal part subscript𝐮𝑙for-all𝑙12P\left(1\cdot f\left(\mathbf{W}^{(t)},\mathbf{x}\right)>0\mid\mathbf{x}_{\text% {signal part }}=\mathbf{u}_{l}\right),\forall l\in\{1,2\}italic_P ( 1 ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) > 0 ∣ bold_x start_POSTSUBSCRIPT signal part end_POSTSUBSCRIPT = bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) , ∀ italic_l ∈ { 1 , 2 }

and the proofs for other cases (i.e., 𝝁{𝐮1,𝐮2,±𝐯1,±𝐯2}𝝁subscript𝐮1subscript𝐮2plus-or-minussubscript𝐯1plus-or-minussubscript𝐯2\bm{\mu}\in\{-\mathbf{u}_{1},-\mathbf{u}_{2},\pm\mathbf{v}_{1},\pm\mathbf{v}_{% 2}\}bold_italic_μ ∈ { - bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , - bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ± bold_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ± bold_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT }) are the same. Denote the feature patch in 𝐱𝐱\mathbf{x}bold_x as 𝐮lxsubscript𝐮subscript𝑙𝑥\mathbf{u}_{l_{x}}bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT (lx{1,2}subscript𝑙𝑥12l_{x}\in\{1,2\}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }), when 𝐱=(𝐮lx,𝝃)𝐱superscriptsuperscriptsubscript𝐮subscript𝑙𝑥topsuperscript𝝃toptop\mathbf{x}=\left(\mathbf{u}_{l_{x}}^{\top},\bm{\xi}^{\top}\right)^{\top}bold_x = ( bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, the true label y=+1𝑦1y=+1italic_y = + 1. Considering this case, we have

1f(𝐖(t),𝐱)1𝑓superscript𝐖𝑡𝐱\displaystyle 1\cdot f\left(\mathbf{W}^{(t)},\mathbf{x}\right)1 ⋅ italic_f ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) =1mr=1mF+1,r(𝐖(t),𝐮lx)+F+1,r(𝐖(t),𝝃)1mr=1m(F1,r(𝐖(t),𝐮lx)+F1,r(𝐖(t),𝝃))absent1𝑚superscriptsubscript𝑟1𝑚subscript𝐹1𝑟superscript𝐖𝑡subscript𝐮subscript𝑙𝑥subscript𝐹1𝑟superscript𝐖𝑡𝝃1𝑚superscriptsubscript𝑟1𝑚subscript𝐹1𝑟superscript𝐖𝑡subscript𝐮subscript𝑙𝑥subscript𝐹1𝑟superscript𝐖𝑡𝝃\displaystyle=\frac{1}{m}\sum_{r=1}^{m}F_{+1,r}\left(\mathbf{W}^{(t)},\mathbf{% u}_{l_{x}}\right)+F_{+1,r}\left(\mathbf{W}^{(t)},\bm{\xi}\right)-\frac{1}{m}% \sum_{r=1}^{m}\left(F_{-1,r}\left(\mathbf{W}^{(t)},\mathbf{u}_{l_{x}}\right)+F% _{-1,r}\left(\mathbf{W}^{(t)},\bm{\xi}\right)\right)= divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_F start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ) - divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_F start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_F start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ) )
1m[rσ(𝐰+1,r(t),𝐮lx)rσ(𝐰1,r(t),𝝃)].absent1𝑚delimited-[]subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃\displaystyle\leq\frac{1}{m}\left[\sum_{r}\sigma\left(\left\langle\mathbf{w}_{% +1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)-\sum_{r}\sigma\left(\left% \langle\mathbf{w}_{-1,r}^{(t)},\bm{\xi}\right\rangle\right)\right].≤ divide start_ARG 1 end_ARG start_ARG italic_m end_ARG [ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT + 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ] .

Then we can adopt the exact same techniques in Lemma G.21. Recall g(𝝃)𝑔𝝃g(\bm{\xi})italic_g ( bold_italic_ξ ) is denoted as rσ(𝐰y,r(t),𝝃)subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃\sum_{r}\sigma\left(\left\langle\mathbf{w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ), also (48):

𝔼g(𝝃)=r=1m𝔼σ(𝐰y,r(t),𝝃)=r=1m𝐰y,r(t)2σp2π=σp2πr=1m𝐰y,r(t)2.𝔼𝑔𝝃superscriptsubscript𝑟1𝑚𝔼𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝝃superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2subscript𝜎𝑝2𝜋subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\mathbb{E}g(\bm{\xi})=\sum_{r=1}^{m}\mathbb{E}\sigma\left(\left\langle\mathbf{% w}_{-y,r}^{(t)},\bm{\xi}\right\rangle\right)=\sum_{r=1}^{m}\frac{\left\|% \mathbf{w}_{-y,r}^{(t)}\right\|_{2}\sigma_{p}}{\sqrt{2\pi}}=\frac{\sigma_{p}}{% \sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}\right\|_{2}.blackboard_E italic_g ( bold_italic_ξ ) = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT blackboard_E italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) = ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT divide start_ARG ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (80)

Then we can obtain the following test error upper bound on 𝒟𝐮lxsuperscriptsubscript𝒟subscript𝐮subscript𝑙𝑥\mathcal{D}_{\mathbf{u}_{l_{x}}}^{*}caligraphic_D start_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT by adding 𝔼g(𝝃)𝔼𝑔𝝃\mathbb{E}g(\bm{\xi})blackboard_E italic_g ( bold_italic_ξ ) and σp2πr=1m𝐰y,r(t)2subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰𝑦𝑟𝑡2\dfrac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{-y,r}^{(t)}% \right\|_{2}divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT at two sides of the inequality:

(𝐱,+1)𝒟𝐮lx(1f(𝑾(t),𝐱)0)subscriptsimilar-to𝐱1superscriptsubscript𝒟subscript𝐮subscript𝑙𝑥1𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},+1)\sim\mathcal{D}_{\mathbf{u}_{l_{x}}}^{% *}}\left(1\cdot f\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , + 1 ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 ⋅ italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) P(rσ(𝐰1,r(t),𝝃)rσ(𝐰1,r(t),𝐮lx))absent𝑃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥\displaystyle\leq P\left(\sum_{r}\sigma\left(\left\langle\mathbf{w}_{-1,r}^{(t% )},\bm{\xi}\right\rangle\right)\geq\sum_{r}\sigma\left(\left\langle\mathbf{w}_% {1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)\right)≤ italic_P ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) ≥ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) ) (81)
=P(g(𝝃)𝔼g(𝝃)rσ(𝐰1,r(t),𝐮lx)σp2πr=1m𝐰1,r(t)2).absent𝑃𝑔𝝃𝔼𝑔𝝃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡2\displaystyle=P\left(g(\bm{\xi})-\mathbb{E}g(\bm{\xi})\geq\sum_{r}\sigma\left(% \left\langle\mathbf{w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)-% \frac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}% \right\|_{2}\right).= italic_P ( italic_g ( bold_italic_ξ ) - blackboard_E italic_g ( bold_italic_ξ ) ≥ ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) .

By the results in Proposition H.3, we take a look at the comparison of the two terms at the right side of the inequality:

rσ(𝐰y,r(t),y𝐮lx)σpr=1m𝐰1,r(t)2Θ(rγ1,r,𝐮lx(t))Θ(d1/2ns1/2)r,iρ¯1,r,i(t)=Θ(τlxd1/2ns1/2SNRlx2)=Θ(τlxns1/2𝐮lx22/(σp2d1/2)),subscript𝑟𝜎superscriptsubscript𝐰𝑦𝑟𝑡𝑦subscript𝐮subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡2Θsubscript𝑟superscriptsubscript𝛾1𝑟subscript𝐮subscript𝑙𝑥𝑡Θsuperscript𝑑12superscriptsubscript𝑛𝑠12subscript𝑟𝑖superscriptsubscript¯𝜌1𝑟𝑖𝑡Θsubscript𝜏subscript𝑙𝑥superscript𝑑12superscriptsubscript𝑛𝑠12superscriptsubscriptSNRsubscript𝑙𝑥2Θsubscript𝜏subscript𝑙𝑥superscriptsubscript𝑛𝑠12superscriptsubscriptnormsubscript𝐮subscript𝑙𝑥22superscriptsubscript𝜎𝑝2superscript𝑑12\frac{\sum_{r}\sigma\left(\left\langle\mathbf{w}_{y,r}^{(t)},y\mathbf{u}_{l_{x% }}\right\rangle\right)}{\sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}% \right\|_{2}}\geq\frac{\Theta\left(\sum_{r}\gamma_{1,r,\mathbf{u}_{l_{x}}}^{(t% )}\right)}{\Theta\left(d^{-1/2}n_{s}^{-1/2}\right)\cdot\sum_{r,i}\bar{\rho}_{-% 1,r,i}^{(t)}}=\Theta\left(\tau_{l_{x}}d^{1/2}n_{s}^{1/2}\operatorname{SNR}_{l_% {x}}^{2}\right)=\Theta\left(\tau_{l_{x}}n_{s}^{1/2}\|\mathbf{u}_{l_{x}}\|_{2}^% {2}/(\sigma_{p}^{2}d^{1/2})\right),divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , italic_y bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ≥ divide start_ARG roman_Θ ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 , italic_r , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) end_ARG start_ARG roman_Θ ( italic_d start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT ) ⋅ ∑ start_POSTSUBSCRIPT italic_r , italic_i end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT - 1 , italic_r , italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT end_ARG = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT roman_SNR start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = roman_Θ ( italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / ( italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ) ) , (82)

where τlxsubscript𝜏subscript𝑙𝑥\tau_{l_{x}}italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the proportion of feature 𝐮lxsubscript𝐮subscript𝑙𝑥\mathbf{u}_{l_{x}}bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT in current training data set (before or after querying). Worth noting that we have assumption in the first bullet that l{1,2},ns,lC^1σp4d𝐮l24formulae-sequencefor-all𝑙12subscript𝑛𝑠𝑙subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝐮𝑙24\forall l\in\{1,2\},n_{s,l}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\mathbf{u}% _{l}\|_{2}^{4}}∀ italic_l ∈ { 1 , 2 } , italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_u start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, which means n1,lx𝐮1242C^1σp4d,lx{1,2}formulae-sequencesubscript𝑛1subscript𝑙𝑥superscriptsubscriptnormsubscript𝐮1242subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑for-allsubscript𝑙𝑥12n_{1,l_{x}}\|\mathbf{u}_{1}\|_{2}^{4}\geq 2\hat{C}_{1}\sigma_{p}^{4}d,\forall l% _{x}\in\{1,2\}italic_n start_POSTSUBSCRIPT 1 , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT ≥ 2 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d , ∀ italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }. Since C^1subscript^𝐶1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a sufficiently large constant, it directly follows that

rσ(𝐰1,r(t),𝐮lx)σp2πr=1m𝐰1,r(t)2>0.subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡20\sum_{r}\sigma\left(\left\langle\mathbf{w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}% \right\rangle\right)-\frac{\sigma_{p}}{\sqrt{2\pi}}\sum_{r=1}^{m}\left\|% \mathbf{w}_{-1,r}^{(t)}\right\|_{2}>0.∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > 0 .

Same as (83), we adopt the techniques of Theorem 5.2.2 in Vershynin [2018]:

P(g(𝝃)𝔼g(𝝃)x)exp(cx2σp2gLip 2),𝑃𝑔𝝃𝔼𝑔𝝃𝑥𝑐superscript𝑥2superscriptsubscript𝜎𝑝2superscriptsubscriptnorm𝑔Lip 2P(g(\bm{\xi})-\mathbb{E}g(\bm{\xi})\geq x)\leq\exp\left(-\frac{cx^{2}}{\sigma_% {p}^{2}\|g\|_{\text{Lip }}^{2}}\right),italic_P ( italic_g ( bold_italic_ξ ) - blackboard_E italic_g ( bold_italic_ξ ) ≥ italic_x ) ≤ roman_exp ( - divide start_ARG italic_c italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ italic_g ∥ start_POSTSUBSCRIPT Lip end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) , (83)

where c𝑐citalic_c is a constant. To calculate the Lipschitz norm, we have

|g(𝝃)g(𝝃)|𝑔𝝃𝑔superscript𝝃\displaystyle\left|g(\bm{\xi})-g\left(\bm{\xi}^{\prime}\right)\right|| italic_g ( bold_italic_ξ ) - italic_g ( bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) | =|r=1mσ(𝐰1,r(t),𝝃)r=1mσ(𝐰y,r(t),𝝃)|absentsuperscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰1𝑟𝑡𝝃superscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰𝑦𝑟𝑡superscript𝝃\displaystyle=\left|\sum_{r=1}^{m}\sigma\left(\left\langle\mathbf{w}_{-1,r}^{(% t)},\bm{\xi}\right\rangle\right)-\sum_{r=1}^{m}\sigma\left(\left\langle\mathbf% {w}_{-y,r}^{(t)},\bm{\xi}^{\prime}\right\rangle\right)\right|= | ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - italic_y , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ) |
r=1m|σ(𝐰1,r(t),𝝃)σ(𝐰1,r(t),𝝃)|absentsuperscriptsubscript𝑟1𝑚𝜎superscriptsubscript𝐰1𝑟𝑡𝝃𝜎superscriptsubscript𝐰1𝑟𝑡superscript𝝃\displaystyle\leq\sum_{r=1}^{m}\left|\sigma\left(\left\langle\mathbf{w}_{-1,r}% ^{(t)},\bm{\xi}\right\rangle\right)-\sigma\left(\left\langle\mathbf{w}_{-1,r}^% {(t)},\bm{\xi}^{\prime}\right\rangle\right)\right|≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ ) |
r=1m|𝐰1,r(t),𝝃𝝃|absentsuperscriptsubscript𝑟1𝑚superscriptsubscript𝐰1𝑟𝑡𝝃superscript𝝃\displaystyle\leq\sum_{r=1}^{m}\left|\left\langle\mathbf{w}_{-1,r}^{(t)},\bm{% \xi}-\bm{\xi}^{\prime}\right\rangle\right|≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ - bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⟩ |
r=1m𝐰1,r(t)2𝝃𝝃2,absentsuperscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡2subscriptnorm𝝃superscript𝝃2\displaystyle\leq\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}\right\|_{2}\cdot% \left\|\bm{\xi}-\bm{\xi}^{\prime}\right\|_{2},≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ ∥ bold_italic_ξ - bold_italic_ξ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ,

where the first inequality is by triangle inequality; the second inequality is by the property of ReLU; the last inequality is by Cauchy Schwartz Inequality. Therefore, we have

gLip r=1m𝐰1,r(t)2.subscriptnorm𝑔Lip superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡2\|g\|_{\text{Lip }}\leq\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}\right\|_{2}.∥ italic_g ∥ start_POSTSUBSCRIPT Lip end_POSTSUBSCRIPT ≤ ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (84)

Utilize (83) and (84) in (81), we have

(𝐱,+1)𝒟𝐮lx(1f(𝑾(t),𝐱)0)subscriptsimilar-to𝐱1superscriptsubscript𝒟subscript𝐮subscript𝑙𝑥1𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},+1)\sim\mathcal{D}_{\mathbf{u}_{l_{x}}}^{% *}}\left(1\cdot f\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , + 1 ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 ⋅ italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) exp[c(rσ(𝐰1,r(t),𝐮lx)(σp2π)r=1m𝐰1,r(t)2)2σp2(r=1m𝐰1,r(t)2)2]absent𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝2𝜋superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡22superscriptsubscript𝜎𝑝2superscriptsuperscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡22\displaystyle\leq\exp\left[-\frac{c\left(\sum_{r}\sigma\left(\left\langle% \mathbf{w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)-\left(\dfrac{% \sigma_{p}}{\sqrt{2\pi}}\right)\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}% \right\|_{2}\right)^{2}}{\sigma_{p}^{2}\left(\sum_{r=1}^{m}\left\|\mathbf{w}_{% -1,r}^{(t)}\right\|_{2}\right)^{2}}\right]≤ roman_exp [ - divide start_ARG italic_c ( ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) - ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ) ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ] (85)
=exp[c(rσ(𝐰1,r(t),𝐮lx)σpr=1m𝐰1,r(t)212π)2]absent𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡212𝜋2\displaystyle=\exp\left[-c\left(\frac{\sum_{r}\sigma\left(\left\langle\mathbf{% w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)}{\sigma_{p}\sum_{r=1}^{m% }\|\mathbf{w}_{-1,r}^{(t)}\|_{2}}-\dfrac{1}{\sqrt{2\pi}}\right)^{2}\right]= roman_exp [ - italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG - divide start_ARG 1 end_ARG start_ARG square-root start_ARG 2 italic_π end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
exp(c/2π)exp(0.5c(rσ(𝐰1,r(t),𝐮lx)σpr=1m𝐰1,r(t)2)2),absent𝑐2𝜋0.5𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡22\displaystyle\leq\exp(c/2\pi)\exp\left(-0.5c\left(\frac{\sum_{r}\sigma\left(% \left\langle\mathbf{w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)}{% \sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}\right\|_{2}}\right)^{2}% \right),≤ roman_exp ( italic_c / 2 italic_π ) roman_exp ( - 0.5 italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where the third inequality is by (st)2s2/2t2,s,t0formulae-sequencesuperscript𝑠𝑡2superscript𝑠22superscript𝑡2for-all𝑠𝑡0(s-t)^{2}\geq s^{2}/2-t^{2},\forall s,t\geq 0( italic_s - italic_t ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≥ italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / 2 - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , ∀ italic_s , italic_t ≥ 0. And then by (82) and (85), we can have

(𝐱,+1)𝒟𝐮lx(1f(𝑾(t),𝐱)0)subscriptsimilar-to𝐱1superscriptsubscript𝒟subscript𝐮subscript𝑙𝑥1𝑓superscript𝑾𝑡𝐱0\displaystyle\mathbb{P}_{(\mathbf{x},+1)\sim\mathcal{D}_{\mathbf{u}_{l_{x}}}^{% *}}\left(1\cdot f\left(\bm{W}^{(t)},\mathbf{x}\right)\leq 0\right)blackboard_P start_POSTSUBSCRIPT ( bold_x , + 1 ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( 1 ⋅ italic_f ( bold_italic_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_x ) ≤ 0 ) exp(c/2π)exp(0.5c(rσ(𝐰1,r(t),𝐮lx)σpr=1m𝐰1,r(t)2)2)absent𝑐2𝜋0.5𝑐superscriptsubscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡subscript𝐮subscript𝑙𝑥subscript𝜎𝑝superscriptsubscript𝑟1𝑚subscriptnormsuperscriptsubscript𝐰1𝑟𝑡22\displaystyle\leq\exp(c/2\pi)\exp\left(-0.5c\left(\frac{\sum_{r}\sigma\left(% \left\langle\mathbf{w}_{1,r}^{(t)},\mathbf{u}_{l_{x}}\right\rangle\right)}{% \sigma_{p}\sum_{r=1}^{m}\left\|\mathbf{w}_{-1,r}^{(t)}\right\|_{2}}\right)^{2}\right)≤ roman_exp ( italic_c / 2 italic_π ) roman_exp ( - 0.5 italic_c ( divide start_ARG ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ ) end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_r = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) (86)
=exp(c2πτlxns,lx𝐮lx24C^σp4d)absent𝑐2𝜋subscript𝜏subscript𝑙𝑥subscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝐮subscript𝑙𝑥24^𝐶superscriptsubscript𝜎𝑝4𝑑\displaystyle=\exp\left(\frac{c}{2\pi}-\frac{\tau_{l_{x}}n_{s,l_{x}}\|\mathbf{% u}_{l_{x}}\|_{2}^{4}}{\hat{C}\sigma_{p}^{4}d}\right)= roman_exp ( divide start_ARG italic_c end_ARG start_ARG 2 italic_π end_ARG - divide start_ARG italic_τ start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )
=exp(c2πns,lx𝐮lx24C^lxσp4d)absent𝑐2𝜋subscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝐮subscript𝑙𝑥24subscript^𝐶subscript𝑙𝑥superscriptsubscript𝜎𝑝4𝑑\displaystyle=\exp\left(\frac{c}{2\pi}-\frac{n_{s,l_{x}}\|\mathbf{u}_{l_{x}}\|% _{2}^{4}}{\hat{C}_{l_{x}}\sigma_{p}^{4}d}\right)= roman_exp ( divide start_ARG italic_c end_ARG start_ARG 2 italic_π end_ARG - divide start_ARG italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )
exp(ns,lx𝐮lx242C^lxσp4d)absentsubscript𝑛𝑠subscript𝑙𝑥superscriptsubscriptnormsubscript𝐮subscript𝑙𝑥242subscript^𝐶subscript𝑙𝑥superscriptsubscript𝜎𝑝4𝑑\displaystyle\leq\exp\left(-\frac{n_{s,l_{x}}\|\mathbf{u}_{l_{x}}\|_{2}^{4}}{2% \hat{C}_{l_{x}}\sigma_{p}^{4}d}\right)≤ roman_exp ( - divide start_ARG italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG 2 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG )

where C^lx=C^/τlx=O(1)subscript^𝐶subscript𝑙𝑥^𝐶subscript𝜏𝑙𝑥𝑂1\hat{C}_{l_{x}}=\hat{C}/\tau_{lx}=O(1)over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT = over^ start_ARG italic_C end_ARG / italic_τ start_POSTSUBSCRIPT italic_l italic_x end_POSTSUBSCRIPT = italic_O ( 1 ); the last inequality holds if we choose C^1cC^lx/πsubscript^𝐶1𝑐subscript^𝐶subscript𝑙𝑥𝜋\hat{C}_{1}\geq c\hat{C}_{l_{x}}/\piover^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≥ italic_c over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT / italic_π, for any lx{1,2}subscript𝑙𝑥12l_{x}\in\{1,2\}italic_l start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ { 1 , 2 }. If we choose C^3subscript^𝐶3\hat{C}_{3}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT as 2C^l12subscript^𝐶subscript𝑙12\hat{C}_{l_{1}}2 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT and C^4subscript^𝐶4\hat{C}_{4}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT as 2C^l22subscript^𝐶subscript𝑙22\hat{C}_{l_{2}}2 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT, by (79) and (86) we have

L𝒟01(𝐖(t))(1p)exp(ns,1𝐮124C^3σp4d)+pexp(ns,2𝐮224C^4σp4d).superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡1superscript𝑝subscript𝑛𝑠1superscriptsubscriptnormsubscript𝐮124subscript^𝐶3superscriptsubscript𝜎𝑝4𝑑superscript𝑝subscript𝑛𝑠2superscriptsubscriptnormsubscript𝐮224subscript^𝐶4superscriptsubscript𝜎𝑝4𝑑L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\leq(1-p^{*})\cdot\exp% \left(\dfrac{-n_{s,1}\|\mathbf{u}_{1}\|_{2}^{4}}{\hat{C}_{3}\sigma_{p}^{4}d}% \right)+p^{*}\cdot\exp\left(\dfrac{-n_{s,2}\|\mathbf{u}_{2}\|_{2}^{4}}{\hat{C}% _{4}\sigma_{p}^{4}d}\right).italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≤ ( 1 - italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) + italic_p start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⋅ roman_exp ( divide start_ARG - italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ∥ bold_u start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG ) .

Next, we serve to prove the test error upper bound. Same as the proof in Lemma G.21, we utilize the pigeonhole principle technique in Kou et al. [2023b], Meng et al. [2023], which is based on the following two lemmas.

Lemma H.10.

For t[T1,T]𝑡subscript𝑇1superscript𝑇t\in\left[T_{1},T^{*}\right]italic_t ∈ [ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_T start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ], denote g(𝛏)=j,rσ(𝐰j,r(t),𝛏)𝑔𝛏subscript𝑗𝑟𝜎superscriptsubscript𝐰𝑗𝑟𝑡𝛏g(\bm{\xi})=\sum_{j,r}\sigma\left(\left\langle\mathbf{w}_{j,r}^{(t)},\bm{\xi}% \right\rangle\right)italic_g ( bold_italic_ξ ) = ∑ start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT italic_j , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ). There exists a fixed vector 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT with 𝐯l20.01σpsubscriptnormsubscript𝐯𝑙20.01subscript𝜎𝑝\|\mathbf{v}_{l}\|_{2}\leq 0.01\sigma_{p}∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 0.01 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT and constant C^6subscript^𝐶6\hat{C}_{6}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT such that

j{±1}[g(j𝝃+𝐯l)g(j𝝃)]4C^6maxj,l{rγj,r,𝝁l(t)},subscriptsuperscript𝑗plus-or-minus1delimited-[]𝑔superscript𝑗𝝃subscript𝐯𝑙𝑔superscript𝑗𝝃4subscript^𝐶6subscript𝑗𝑙subscript𝑟superscriptsubscript𝛾𝑗𝑟subscript𝝁𝑙𝑡\sum_{j^{\prime}\in\{\pm 1\}}\left[g\left(j^{\prime}\bm{\xi}+\mathbf{v}_{l}% \right)-g\left(j^{\prime}\bm{\xi}\right)\right]\geq 4\hat{C}_{6}\max_{j,l}% \left\{\sum_{r}\gamma_{j,r,\bm{\mu}_{l}}^{(t)}\right\},∑ start_POSTSUBSCRIPT italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ { ± 1 } end_POSTSUBSCRIPT [ italic_g ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_g ( italic_j start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT bold_italic_ξ ) ] ≥ 4 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } ,

for all 𝛏d𝛏superscript𝑑\bm{\xi}\in\mathbb{R}^{d}bold_italic_ξ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.

Proof of Lemma H.10. See Lemma 5.8 in Kou et al. [2023b] or Theorem 3.2 in Meng et al. [2023] for a proof, where we utilize a large enough C^2subscript^𝐶2\hat{C}_{2}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in the condition given in the second bullet point (ns,lC^2σp4d𝝁l24subscript𝑛𝑠superscript𝑙subscript^𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁superscript𝑙24n_{s,{l^{\prime}}}\leq\dfrac{\hat{C}_{2}\sigma_{p}^{4}d}{\|\bm{\mu}_{l^{\prime% }}\|_{2}^{4}}italic_n start_POSTSUBSCRIPT italic_s , italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ≤ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG) to control the norm of 𝐯lsubscript𝐯𝑙\mathbf{v}_{l}bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT.

Lemma H.11.

(Proposition 2.1 in Devroye et al. [2023]). The TV distance between 𝒩(0,σp2𝐈d)𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑\mathcal{N}\left(0,\sigma_{p}^{2}\mathbf{I}_{d}\right)caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and 𝒩(𝐯l,σp2𝐈d)𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑\mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}_{d}\right)caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) is smaller than 𝐯l2/2σpsubscriptnormsubscript𝐯𝑙22subscript𝜎𝑝\|\mathbf{v}_{l}\|_{2}/2\sigma_{p}∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT.

Proof of Lemma H.11. See Proposition 2.1 in Devroye et al. [2023] for a proof.

Now we take a look at L𝒟01(𝐖(t))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ), by (79) we have:

L𝒟01(𝐖(t))superscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡\displaystyle L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) =τ1(𝐱,y)𝒟𝝁1[yf(𝐖,𝐱)<0]+τ2(𝐱,y)𝒟𝝁2[yf(𝐖,𝐱)<0]absentsubscriptsuperscript𝜏1subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁1delimited-[]𝑦𝑓𝐖𝐱0subscriptsuperscript𝜏2subscriptsimilar-to𝐱𝑦subscript𝒟subscript𝝁2delimited-[]𝑦𝑓𝐖𝐱0\displaystyle=\tau^{*}_{1}\cdot\mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{% \mu}_{1}}^{*}}\left[y\cdot f(\mathbf{W},\mathbf{x})<0\right]+\tau^{*}_{2}\cdot% \mathbb{P}_{(\mathbf{x},y)\sim\mathcal{D}_{\bm{\mu}_{2}}}\left[y\cdot f(% \mathbf{W},\mathbf{x})<0\right]= italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] + italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ] (87)
τl(𝐱,y)𝒟𝝁l[yf(𝐖,𝐱)<0]absentsubscriptsuperscript𝜏superscript𝑙subscriptsimilar-to𝐱𝑦superscriptsubscript𝒟subscript𝝁superscript𝑙delimited-[]𝑦𝑓𝐖𝐱0\displaystyle\geq\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim% \mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}[y\cdot f(\mathbf{W},\mathbf{x})<0]≥ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_y ⋅ italic_f ( bold_W , bold_x ) < 0 ]
0.5τl(𝐱,y)𝒟𝝁l(|rσ(𝐰1,r(t),𝝃)rσ(𝐰1,r(t),𝝃)|\displaystyle\geq 0.5\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x},y)\sim% \mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}\Big{(}\left|\sum_{r}\sigma\left(\left% \langle\mathbf{w}_{1,r}^{(t)},\bm{\xi}\right\rangle\right)-\sum_{r}\sigma\left% (\left\langle\mathbf{w}_{-1,r}^{(t)},\bm{\xi}\right\rangle\right)\right|≥ 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ blackboard_P start_POSTSUBSCRIPT ( bold_x , italic_y ) ∼ caligraphic_D start_POSTSUBSCRIPT bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( | ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) |
C^6max{rγ1,r,𝝁l(t),rγ1,r,𝝁l(t)})\displaystyle\phantom{\geq 0.5\tau^{*}_{l^{\prime}}\cdot\mathbb{P}_{(\mathbf{x% },y)\sim\mathcal{D}_{\bm{\mu}_{l^{\prime}}}^{*}}\Big{(}}\geq\hat{C}_{6}\max% \left\{\sum_{r}\gamma_{1,r,\bm{\mu}_{l^{\prime}}}^{(t)},\sum_{r}\gamma_{-1,r,% \bm{\mu}_{l^{\prime}}}^{(t)}\right\}\Big{)}≥ over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT - 1 , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } )
=0.5τlP(Ω𝝃),absent0.5subscriptsuperscript𝜏superscript𝑙𝑃subscriptΩ𝝃\displaystyle=0.5\tau^{*}_{l^{\prime}}\cdot P(\Omega_{\bm{\xi}}),= 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ,

where Ω𝝃:={𝝃||g(𝝃)C^6max{rγ1,r,𝝁l(t),rγ1,r,𝝁l(t)}}assignsubscriptΩ𝝃conditional-set𝝃delimited-|∣𝑔𝝃subscript^𝐶6subscript𝑟superscriptsubscript𝛾1𝑟subscript𝝁superscript𝑙𝑡subscript𝑟superscriptsubscript𝛾1𝑟subscript𝝁superscript𝑙𝑡\Omega_{\bm{\xi}}:=\left\{\bm{\xi}||g(\bm{\xi})\mid\geq\hat{C}_{6}\max\left\{% \sum_{r}\gamma_{1,r,\bm{\mu}_{l^{\prime}}}^{(t)},\sum_{r}\gamma_{-1,r,\bm{\mu}% _{l^{\prime}}}^{(t)}\right\}\right\}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT := { bold_italic_ξ | | italic_g ( bold_italic_ξ ) ∣ ≥ over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT 1 , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT - 1 , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT } }. The last inequality holds since we can always have a corresponding y𝑦yitalic_y to make a wrong prediction if given 𝝃𝝃\bm{\xi}bold_italic_ξ, the |rσ(𝐰1,r(t),𝝃)rσ(𝐰1,r(t),𝝃)|subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃subscript𝑟𝜎superscriptsubscript𝐰1𝑟𝑡𝝃\left|\sum_{r}\sigma\left(\left\langle\mathbf{w}_{1,r}^{(t)},\bm{\xi}\right% \rangle\right)-\sum_{r}\sigma\left(\left\langle\mathbf{w}_{-1,r}^{(t)},\bm{\xi% }\right\rangle\right)\right|| ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) - ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_σ ( ⟨ bold_w start_POSTSUBSCRIPT - 1 , italic_r end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT , bold_italic_ξ ⟩ ) | is large enough.

Next, we seek a lower bound of P(Ω𝝃)𝑃subscriptΩ𝝃P(\Omega_{\bm{\xi}})italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ). By Lemma H.10, we have that j[g(j𝝃+𝐯l)g(j𝝃)]subscript𝑗delimited-[]𝑔𝑗𝝃subscript𝐯𝑙𝑔𝑗𝝃absent\sum_{j}[g(j\bm{\xi}+\mathbf{v}_{l})-g(j\bm{\xi})]\geq∑ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT [ italic_g ( italic_j bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) - italic_g ( italic_j bold_italic_ξ ) ] ≥ 4C^6maxj,l{rγj,r,𝝁l(t)}4subscript^𝐶6subscript𝑗𝑙subscript𝑟superscriptsubscript𝛾𝑗𝑟subscript𝝁𝑙𝑡4\hat{C}_{6}\max_{j,l}\left\{\sum_{r}\gamma_{j,r,\bm{\mu}_{l}}^{(t)}\right\}4 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 6 end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_j , italic_l end_POSTSUBSCRIPT { ∑ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_j , italic_r , bold_italic_μ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT }. Then by pigeon’s hole principle, there must exist one of the 𝝃,𝝃+𝐯l𝝃𝝃subscript𝐯𝑙\bm{\xi},\bm{\xi}+\mathbf{v}_{l}bold_italic_ξ , bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT, 𝝃,𝝃+𝐯l𝝃𝝃subscript𝐯𝑙-\bm{\xi},-\bm{\xi}+\mathbf{v}_{l}- bold_italic_ξ , - bold_italic_ξ + bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT belongs Ω𝝃subscriptΩ𝝃\Omega_{\bm{\xi}}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT. So we have proved that Ω𝝃Ω𝝃Ω𝝃{𝐯l}Ω𝝃{𝐯l}=d\Omega_{\bm{\xi}}\cup-\Omega_{\bm{\xi}}\cup\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}% \}\cup-\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}\}=\mathbb{R}^{d}roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∪ - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ∪ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ∪ - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } = blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Therefore at least one of P(Ω𝝃),P(Ω𝝃),P(Ω𝝃{𝐯l}),P(Ω𝝃{𝐯l}),P(Ω𝝃{𝐯l})𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃subscript𝐯𝑙𝑃subscriptΩ𝝃subscript𝐯𝑙𝑃subscriptΩ𝝃subscript𝐯𝑙P(\Omega_{\bm{\xi}}),P(-\Omega_{\bm{\xi}}),P(\Omega_{\bm{\xi}}-\{\mathbf{v}_{l% }\}),P(\Omega_{\bm{\xi}}-\{\mathbf{v}_{l}\}),P(-\Omega_{\bm{\xi}}-\{\mathbf{v}% _{l}\})italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) , italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) , italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) , italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) , italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - { bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT } ) is greater than 0.25. By the definition of TV distance, we have:

|P(Ω𝝃)P(Ω𝝃𝐯l)|𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃subscript𝐯𝑙\displaystyle|P(\Omega_{\bm{\xi}})-P(\Omega_{\bm{\xi}}-\mathbf{v}_{l})|| italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) - italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT - bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) | =|𝝃𝒩(0,σp2𝐈d)(𝝃Ω𝝃)𝝃𝒩(𝐯l,σp2𝐈d)(𝝃Ω𝝃)|absentsubscriptsimilar-to𝝃𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝝃subscriptΩ𝝃subscriptsimilar-to𝝃𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝝃subscriptΩ𝝃\displaystyle=\left|\mathbb{P}_{\bm{\xi}\sim\mathcal{N}\left(0,\sigma_{p}^{2}% \mathbf{I}_{d}\right)}(\bm{\xi}\in\Omega_{\bm{\xi}})-\mathbb{P}_{\bm{\xi}\sim% \mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}_{d}\right)}(\bm{\xi}% \in\Omega_{\bm{\xi}})\right|= | blackboard_P start_POSTSUBSCRIPT bold_italic_ξ ∼ caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_italic_ξ ∈ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) - blackboard_P start_POSTSUBSCRIPT bold_italic_ξ ∼ caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT ( bold_italic_ξ ∈ roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) |
TV(𝒩(0,σp2𝐈d),𝒩(𝐯l,σp2𝐈d))absentTV𝒩0superscriptsubscript𝜎𝑝2subscript𝐈𝑑𝒩subscript𝐯𝑙superscriptsubscript𝜎𝑝2subscript𝐈𝑑\displaystyle\leq\operatorname{TV}\left(\mathcal{N}\left(0,\sigma_{p}^{2}% \mathbf{I}_{d}\right),\mathcal{N}\left(\mathbf{v}_{l},\sigma_{p}^{2}\mathbf{I}% _{d}\right)\right)≤ roman_TV ( caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) , caligraphic_N ( bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) )
𝐯l22σpabsentsubscriptnormsubscript𝐯𝑙22subscript𝜎𝑝\displaystyle\leq\frac{\|\mathbf{v}_{l}\|_{2}}{2\sigma_{p}}≤ divide start_ARG ∥ bold_v start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT end_ARG
0.02.absent0.02\displaystyle\leq 0.02.≤ 0.02 .

Also, notice that P(Ω𝝃)=P(Ω𝝃)𝑃subscriptΩ𝝃𝑃subscriptΩ𝝃P(-\Omega_{\bm{\xi}})=P(\Omega_{\bm{\xi}})italic_P ( - roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) = italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ), we have 4P(Ω𝝃)120.024𝑃subscriptΩ𝝃120.024P(\Omega_{\bm{\xi}})\geq 1-2\cdot 0.024 italic_P ( roman_Ω start_POSTSUBSCRIPT bold_italic_ξ end_POSTSUBSCRIPT ) ≥ 1 - 2 ⋅ 0.02. Thus L𝒟01(𝐖(t))0.5τl0.24=0.12τlsuperscriptsubscript𝐿superscript𝒟01superscript𝐖𝑡0.5subscriptsuperscript𝜏superscript𝑙0.240.12subscriptsuperscript𝜏superscript𝑙L_{\mathcal{D}^{*}}^{0-1}\left(\mathbf{W}^{(t)}\right)\geq 0.5\tau^{*}_{l^{% \prime}}\cdot 0.24=0.12\cdot\tau^{*}_{l^{\prime}}italic_L start_POSTSUBSCRIPT caligraphic_D start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 - 1 end_POSTSUPERSCRIPT ( bold_W start_POSTSUPERSCRIPT ( italic_t ) end_POSTSUPERSCRIPT ) ≥ 0.5 italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ 0.24 = 0.12 ⋅ italic_τ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_l start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The proofs of Lemma H.9 complete.

Similar to the proof process in Appendix G.5, our main focus is to verify whether the NAL algorithms satisfy the condition stated in the first bullet point of Lemma H.9. Conversely, it is highly probable that Random Sampling satisfies the condition stated in the second bullet point. The following proposition validates this intuition.

Proposition H.12.

When Lemma H.7 holds, and the sampling size of algorithm satisfies C^1σp4d𝛍224pn02n=Θ(n~n0)n~n0subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝛍224𝑝subscript𝑛02superscript𝑛Θ~𝑛subscript𝑛0~𝑛subscript𝑛0\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}-\dfrac{pn_{0}}{2}% \leq n^{*}=\Theta(\widetilde{n}-n_{0})\leq\widetilde{n}-n_{0}divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG ≤ italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_Θ ( over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ≤ over~ start_ARG italic_n end_ARG - italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, we have the following:

  • The number of data with strong feature patch ns,1subscript𝑛𝑠1n_{s,1}italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT satisfies ns,1C^1σp4d𝝁124,s{0,1}formulae-sequencesubscript𝑛𝑠1subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124for-all𝑠01n_{s,1}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{1}\|_{2}^{4}},% \forall s\in\{0,1\}italic_n start_POSTSUBSCRIPT italic_s , 1 end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG , ∀ italic_s ∈ { 0 , 1 }.

  • The number of data with weak feature patch ns,2subscript𝑛𝑠2n_{s,2}italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT before querying and after Random Sampling satisfies ns,2C^2σp4d𝝁224,s{0,1}formulae-sequencesubscript𝑛𝑠2subscript^𝐶2superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224for-all𝑠01n_{s,2}\leq\dfrac{\hat{C}_{2}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}},% \forall s\in\{0,1\}italic_n start_POSTSUBSCRIPT italic_s , 2 end_POSTSUBSCRIPT ≤ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG , ∀ italic_s ∈ { 0 , 1 }.

  • The total number of data with weak feature patch n1,2subscript𝑛12n_{1,2}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT after Uncertainty Sampling and Diversity Sampling satisfies n1,2C^1σp4d𝝁224subscript𝑛12subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224n_{1,2}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG .

For the sake of coherence, here C^1subscript^𝐶1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and C^2subscript^𝐶2\hat{C}_{2}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are some constants shared with Theorem C.6.

Proof of Proposition H.12. According to the conditions stated in Definition C.1, we have (132p)n0C^1σp4d𝝁124132𝑝subscript𝑛0subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124(1-\dfrac{3}{2}p)n_{0}\geq\frac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{1}\|_{% 2}^{4}}( 1 - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_p ) italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG for a large constant C^1subscript^𝐶1\hat{C}_{1}over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. By substituting the results of npsubscript𝑛𝑝n_{p}italic_n start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT for n0subscript𝑛0n_{0}italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT from Lemma 17, as well as the definition of ns,lsubscript𝑛𝑠𝑙n_{s,l}italic_n start_POSTSUBSCRIPT italic_s , italic_l end_POSTSUBSCRIPT, we obtain the following:

n1,1n0,1(132p)n0C^1σp4d𝝁124.subscript𝑛11subscript𝑛01132𝑝subscript𝑛0subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁124n_{1,1}\geq n_{0,1}\geq(1-\dfrac{3}{2}p)n_{0}\geq\dfrac{\hat{C}_{1}\sigma_{p}^% {4}d}{\|\bm{\mu}_{1}\|_{2}^{4}}.italic_n start_POSTSUBSCRIPT 1 , 1 end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT 0 , 1 end_POSTSUBSCRIPT ≥ ( 1 - divide start_ARG 3 end_ARG start_ARG 2 end_ARG italic_p ) italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG .

For the second bullet, by Lemma 17, Lemma H.7 and conditions nC^1σp4d𝝁224pn02superscript𝑛subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224𝑝subscript𝑛02n^{*}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm{\mu}_{2}\|_{2}^{4}}-\dfrac{% pn_{0}}{2}italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG - divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG, we have:

n1,2pn02+nC^1σp4d𝝁224subscript𝑛12𝑝subscript𝑛02superscript𝑛subscript^𝐶1superscriptsubscript𝜎𝑝4𝑑superscriptsubscriptnormsubscript𝝁224n_{1,2}\geq\dfrac{pn_{0}}{2}+n^{*}\geq\dfrac{\hat{C}_{1}\sigma_{p}^{4}d}{\|\bm% {\mu}_{2}\|_{2}^{4}}italic_n start_POSTSUBSCRIPT 1 , 2 end_POSTSUBSCRIPT ≥ divide start_ARG italic_p italic_n start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG + italic_n start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ≥ divide start_ARG over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG

Furthermore, by using Lemma 17 and the condition n~2C^2σp4d3p𝝁224~𝑛2subscript^𝐶2superscriptsubscript𝜎𝑝4𝑑3𝑝superscriptsubscriptnormsubscript𝝁224\widetilde{n}\leq\dfrac{2\hat{C}_{2}\sigma_{p}^{4}d}{3p\|\bm{\mu}_{2}\|_{2}^{4}}over~ start_ARG italic_n end_ARG ≤ divide start_ARG 2 over^ start_ARG italic_C end_ARG start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_σ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT italic_d end_ARG start_ARG 3 italic_p ∥ bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT end_ARG, the third bullet point is satisfied straightforwardly.

Based on the results of Lemma H.9 and Proposition H.12, the conclusions of Proposition C.4 and Theorem C.6 follow directly.

Appendix I Attribution of Lion Images

In Figure 1, a collection of various lion images found on Google is presented. Due to the challenge of accurately determining the copyright attribution of these images, specific acknowledgments to individual websites or sources cannot be provided here. However, we express our gratitude to all creators, and sincerely hope that they do not find any offense in the use of their work for illustrative purposes in our paper.