BWS: Best Window Selection Based on Sample Scores
for Data Pruning across Broad Ranges

Hoyong Choi    Nohyun Ki    Hye Won Chung
Abstract

Data subset selection aims to find a smaller yet informative subset of a large dataset that can approximate the full-dataset training, addressing challenges associated with training neural networks on large-scale datasets. However, existing methods tend to specialize in either high or low selection ratio regimes, lacking a universal approach that consistently achieves competitive performance across a broad range of selection ratios. We introduce a universal and efficient data subset selection method, Best Window Selection (BWS), by proposing a method to choose the best window subset from samples ordered based on their difficulty scores. This approach offers flexibility by allowing the choice of window intervals that span from easy to difficult samples. Furthermore, we provide an efficient mechanism for selecting the best window subset by evaluating its quality using kernel ridge regression. Our experimental results demonstrate the superior performance of BWS compared to other baselines across a broad range of selection ratios over datasets, including CIFAR-10/100 and ImageNet, and the scenarios involving training from random initialization or fine-tuning of pre-trained models.

Machine Learning, ICML

1 Introduction

Refer to caption
Figure 1: Overview of the proposed method, Best Window Selection (BWS). BWS is composed of two parts, 1) generating window subsets and 2) evaluating window subsets. We first sort samples by a difficulty score (e.g., Forgetting (Toneva et al., 2019)) and generate window subsets of a fixed size while varying their starting points. We then evaluate the window subsets, by solving kernel ridge regression on the input features of each window subset and obtaining simple (linear) classifiers associated with each window subset. Finally, we evaluate the performance of these classifiers on the full training dataset to identify the best window subset achieving the highest accuracy.

In many machine learning tasks, the effectiveness of deep neural networks often relies on large-scale datasets that include a vast number of samples, enabling them to achieve state-of-the-art performances. However, working with such large datasets presents several challenges, including the high computational costs, storage requirements, and potential concerns related to privacy (Schwartz et al., 2020; Strubell et al., 2019). Data subset selection emerges as a promising approach to address these issues. This involves the careful selection of a smaller, yet highly informative, subset from the original large dataset. The goal is to find a subset with a specified selection ratio that approximates the performance of the entire dataset or incurs minimal performance loss.

Data subset selection has two primary approaches: score-based selection and optimization-based selection. Score-based selection involves defining a specific score to measure each sample’s influence (Koh & Liang, 2017), difficulty (Toneva et al., 2019; Paul et al., 2021), or consistency (Jiang et al., 2021) in training neural networks. The primary goal is to identify the most valuable or influential samples within the dataset while pruning the remaining samples that have minimal impact on the model’s generalization ability. On the other hand, optimization-based selection approaches find the optimal subset of a fixed size that can best approximate the full dataset training in terms of loss gradient or curvature by solving the associated optimization problem (Mirzasoleiman et al., 2020; Pooladzandi et al., 2022; Shin et al., 2023; Yang et al., 2023). The original optimization, which is NP-hard, is commonly approximated by submodular functions and a greedy algorithm is adopted to sequentially select the samples up to the size limit of the subset.

While the prior approaches successfully reduce dataset size in specific scenarios, there is not a single selection method that universally outperforms other baselines across broad selection ratios. To illustrate this, we conduct a benchmark comparison between two methods: Forgetting score (Toneva et al., 2019) representing the score-based selection approach, and LCMat (Shin et al., 2023) representing the optimization-based selection approach. We evaluate the test accuracy of models trained with different subset sizes of datasets, including CIFAR-10/100 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009), ranging from 1% to 90%, as selected by these two methods (Table 1). Score-based methods, which prioritize samples of high influence or difficulty, tend to initially select rare yet influential samples while excluding typical or easy samples. These methods demonstrate competitive performance, nearly matching the full-dataset training, when the selection ratio is sufficiently high (e.g., over 40% for CIFAR-10). However, they suffer significant performance degradation as the selection ratio decreases. In contrast, optimization-based methods tend to select representative samples that best approximate the full dataset training. Consequently, they achieve competitive performance even with very low selection ratios. However, their performance gains are limited as the selection ratio increases due to lack of diversity in sample selection. These findings show the variability in the criteria for an effective data subset, depending on the selection ratio, and highlight that previous methods may not be general enough to cover the entire spectrum of selection ratios.

Our key contribution is the development of a universal and efficient data selection method capable of maintaining competitive performance across a wide range of selection ratios. We introduce the Best Window Selection (BWS) method, illustrated in Fig. 1. The key idea involves ordering samples based on their difficulty-based sample scores and offering flexibility in choosing a ‘window subset’ from the ordered samples. Here, the window subset is defined as a subset consisting of samples with a contiguous ranking of difficulty. By allowing the starting point (the ranking of the hardest data in the subset) of each window subset to vary, we enable the selection of easy, moderate, or hard data subsets. We first demonstrate the existence of the best window that achieves the highest test accuracy for each subset size, and reveal that the optimal starting point for the best window varies depending on both the subset size and dataset. We then present a computationally-efficient method for selecting the best window subset without the need to evaluate models trained with each subset. We achieve this by solving a kernel ridge regression problem using samples from each window, evaluating the corresponding solution’s performance on the full training dataset, and selecting the best performing window subset.

We evaluate our selection method, BWS on CIFAR-10/100 and ImageNet, demonstrating that BWS consistently outperforms other baselines, including both score-based and optimization-based approaches, across a wide range of selection ratios ranging from 1% to 90%. For CIFAR-10, BWS achieves a 15-30% improvement in test accuracy compared to Forgetting (Toneva et al., 2019) in the low selection ratios of 1-10%. It also demonstrates competitive performance in the high selection ratio regime, reaching up to 93% test accuracy with only a 40% data subset. BWS also consistently outperforms optimization-based techniques such as LCMat (Shin et al., 2023) and AdaCore (Pooladzandi et al., 2022), despite requiring significantly lower computational costs. Furthermore, we empirically verify that BWS is effective across different model architectures, including pre-trained ViT (Dosovitskiy et al., 2021). Another significant advantage of our method is its resilience to label noise, enhancing its robustness in sample selection. Our code is publicly available at https://github.com/NohyunKi/BWS.

2 Related Works

Score-based selection

Some initial works in score-based selection use validation/test sets to quantify the effect of each training instance. Data Shapley (Ghorbani & Zou, 2019; Kwon et al., 2021; Kwon & Zou, 2022) evaluates the value of each instance by measuring the average change in validation accuracy when that instance is excluded from the dataset. Influence Function (Koh & Liang, 2017; Pruthi et al., 2020) approximates how a model’s prediction changes as individual training examples are visited. In the absence of a validation set, score-based selection quantifies the learning difficulty or consistency of samples during neural network training. Forgetting (Toneva et al., 2019) and EL2N (Paul et al., 2021) introduce a difficulty score to measure a data point’s learning difficulty. Memorization (Feldman & Zhang, 2020) and C-score (Jiang et al., 2021) aim to predict the accuracy on a sample when the full dataset is utilized, except for that sample. CG-score (Ki et al., 2023) evaluates data instances without model training by calculating the analytical gap in generalization errors when an instance is held out. These score-based methods prioritize difficult or influential samples for data subset selection. While they effectively select a subset approximating the full-dataset performance, their performance degrades significantly as the selection ratio decreases, as achieving high performance solely with difficult samples becomes challenging.

Optimization-based selection

Optimization-based selection involves formulating an optimization problem to select a coreset of a given size that can effectively approximate the diverse characteristics of the full dataset. These methods include coreset selection to approximate the training distribution by herding (Chen et al., 2010) or k-center algorithms (Sener & Savarese, 2018). Recent approaches have sought subsets of samples approximating loss gradients or curvature by CRAIG (Mirzasoleiman et al., 2020), CREST (Yang et al., 2023), and AdaCore (Pooladzandi et al., 2022). While these methods have proven effective, they are computationally demanding and necessitate full-dataset sampling at each epoch. LCMat (Shin et al., 2023) addresses this computational challenge by aligning both gradients and Hessians without requiring periodic full-dataset sampling. However, these methods often struggle to choose diverse samples, and their performance does not match that of score-based approaches, in the intermediate to high selection ratio regimes.

In contrast to these approaches, we develop a universal selection method capable of consistently identifying a high-performance subset across a wide range of selection ratios. While recent methods like Moderate-DS (Xia et al., 2023) and CCS (Zheng et al., 2023) have also aimed for universality across various selection ratios, our method outperforms these approaches, over a broad range of selection ratios, as demonstrated in Section 6. Moderate-DS selects samples closest to the median of the features of each class, while CCS prunes a β𝛽\betaitalic_β% of hard examples, with β𝛽\betaitalic_β being a hyperparameter, and then selects samples with a uniform difficulty score distribution. Importantly, our method does not require hyperparameter tuning, such as setting β𝛽\betaitalic_β in CCS, since we assess the quality of window subsets and efficiently find the best one using kernel ridge regression.

3 Motivation

3.1 No single method prevails over the entire range

We conduct an evaluation of existing data selection methods across a wide range of selection ratios. Specifically, we benchmark two representative methods: Forgetting score (Toneva et al., 2019), representing difficulty score-based selection, and LCMat (Shin et al., 2023), representing optimization-based selection. We assess the test accuracy of models trained on subsets of CIFAR-10/100 and ImageNet, with selection ratios ranging from 1% to 90%, as summarized in Table 1. For the Forgetting score approach, we sort the samples in descending order based on their scores, defined as the number of times during training the decision of that sample switches from a correct one to incorrect one, and select the top-ranking (most difficult) samples. In contrast, for LCMat, we employ an optimization to identify a subset that best approximates the loss curvature of the full dataset. We employ ResNet18 (He et al., 2016) for CIFAR-10 and ResNet50 for CIFAR-100 and ImageNet.

We can observe that the most effective strategy varies depending on the selection ratios, and there is no single method that consistently outperforms others across the entire range of selection ratios. Specifically, for CIFAR-10 with low subset ratios (1-30%), the optimization-based selection (LCMat) performs better than the difficulty score-based selection (Forgetting). In this regime, the ‘Forgetting’ even underperforms random selection. However, as the subset ratio increases beyond 40%, the ‘Forgetting’ outperforms both the LCMat and random selection. Similar trends are observed for CIFAR-100 and ImageNet. Interestingly, for CIFAR-100, there is an intermediate regime where neither the ‘Forgetting’ nor LCMat outperform random sampling.

These findings emphasize that the desired properties of data subsets change depending on the selection ratios. In cases of low selection ratios (sample-deficient regime), it is more beneficial to identify a representative subset that closely resembles the full dataset in terms of average loss gradients or curvature during training. However, as the selection ratio increases (sample-sufficient regime), preserving the high-scoring, rare or difficult-to-learn samples becomes more critical, as these samples are known to enhance the generalization capability of neural networks and cannot be fully captured by a representative subset that reflects only the average behavior of the full dataset.

Table 1: Test accuracy across various selection ratios for the CIFAR-10/100 and ImageNet datasets, with subsets selected using random sampling, Forgetting score (Toneva et al., 2019), and LCMat (Shin et al., 2023). The best performance among the three is highlighted in bold.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Random 39.10 67.14 78.43 86.87 89.91 91.66 92.83 94.40 95.08
CIFAR-10 Forgetting 30.08 42.39 54.31 79.19 89.13 93.41 94.49 95.31 95.14 95.40
LCMat 41.53 66.86 77.48 87.34 90.72 92.45 93.38 94.90 95.19
Random 5.89 23.76 42.03 55.03 65.98 69.23 73.84 76.53 78.29
CIFAR-100 Forgetting 7.01 20.69 34.22 50.95 61.54 68.92 72.65 78.55 79.69 78.81
LCMat 8.43 28.51 42.81 55.77 64.39 67.22 73.11 77.51 78.47
Random 6.14 33.17 45.87 59.19 65.94 68.23 70.14 73.74 74.83
ImageNet Forgetting 4.78 28.18 45.84 60.75 67.48 70.26 72.73 74.63 75.53 75.85
LCMat 6.01 32.26 46.08 59.02 65.28 68.50 70.30 74.13 74.81

3.2 Theoretical analysis

To validate this experimental finding, we provide a theoretical analysis of optimal subset selection, revealing similar change of trends in the desirable subsets depending on the selection ratios. We consider a binary classification problem by solving a linear regression problem, as detailed below: Data samples 𝐱1,𝐱2,𝐱ndsubscript𝐱1subscript𝐱2subscript𝐱𝑛superscript𝑑{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots\mathbf{x}_{n}\in\mathbb{R}^{d}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT are generated from a multivariate normal distribution, 𝒟=1d𝒩(0,𝐈d)𝒟1𝑑𝒩0subscript𝐈𝑑\mathcal{D}=\frac{1}{\sqrt{d}}\mathcal{N}(0,{\mathbf{I}}_{d})caligraphic_D = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ). The label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of sample 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined by the sign of its first element. Specifically, if (𝐱i)1>0subscriptsubscript𝐱𝑖10({\mathbf{x}}_{i})_{1}>0( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 then yi=1subscript𝑦𝑖1y_{i}=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1; and if (𝐱i)1<0subscriptsubscript𝐱𝑖10({\mathbf{x}}_{i})_{1}<0( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0, then yi=1subscript𝑦𝑖1y_{i}=-1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1. We define the score of each sample as 1/|(𝐱i)1|1subscriptsubscript𝐱𝑖11/|({\mathbf{x}}_{i})_{1}|1 / | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT |. Samples closer to the decision boundary (𝐱)1=0subscript𝐱10({\mathbf{x}})_{1}=0( bold_x ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 have higher scores, while those farther from the boundary have lower scores. We select a label-balanced subset of size m𝑚mitalic_m, denoted by (𝐗𝐒,𝐲𝐒)d×m×{1,1}msubscript𝐗𝐒subscript𝐲𝐒superscript𝑑𝑚superscript11𝑚({\mathbf{X}}_{{\mathbf{S}}},{\mathbf{y}}_{\mathbf{S}})\in\mathbb{R}^{d\times m% }\times\{-1,1\}^{m}( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_m end_POSTSUPERSCRIPT × { - 1 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, and use it to solve a linear regression problem to find 𝐰𝐒=argmin𝐰d𝐲𝐒𝐗𝐒𝐰22subscript𝐰𝐒subscriptargmin𝐰superscript𝑑superscriptsubscriptnormsubscript𝐲𝐒superscriptsubscript𝐗𝐒top𝐰22{\mathbf{w}}_{{\mathbf{S}}}=\operatorname*{arg\,min}_{{\mathbf{w}}\in\mathbb{R% }^{d}}\|{\mathbf{y}}_{{\mathbf{S}}}-{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf% {w}}\|_{2}^{2}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. For a new sample 𝐱superscript𝐱{\mathbf{x}}^{\prime}bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, our decision will be +11+1+ 1 if 𝐰𝐒𝐱>0superscriptsubscript𝐰𝐒topsuperscript𝐱0{\mathbf{w}}_{{\mathbf{S}}}^{\top}{\mathbf{x}}^{\prime}>0bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 and 11-1- 1 otherwise. Thus, we consider 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT to be a better solution when the value of its first element, (𝐰𝐒)1subscriptsubscript𝐰𝐒1({\mathbf{w}}_{{\mathbf{S}}})_{1}( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, is larger. For the above setup, we analyze the solution 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT depending on the subset size |𝐒|𝐒|{\mathbf{S}}|| bold_S |.

A similar problem setup was analyzed in (Sorscher et al., 2022), demonstrating that the optimal selection strategy varies depending on the subset ratio. Specifically, Sorscher et al. (2022) considers a max margin classifier trained on a data subset selected by the teacher-perceptron model, providing a comprehensive set of equations enabling numerical computation of the generalization error for various subset data distributions. In contrast, our contribution lies in providing a closed-form solution for the optimal linear classifier, as summarized in the theorem below. This theorem shows the transition of the optimal sample selection strategy between sample-deficient and sample-sufficient regimes.

Theorem 1 (Informal).

If the subset size is as small as |𝐒|=md/lnd𝐒𝑚much-less-than𝑑𝑑|{\mathbf{S}}|=m\ll\sqrt{d/\ln d}| bold_S | = italic_m ≪ square-root start_ARG italic_d / roman_ln italic_d end_ARG, then the first coordinate of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT is approximated as (𝐰𝐒)1i=1m|(𝐱i)1|subscriptsubscript𝐰𝐒1superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖1({\mathbf{w}}_{{\mathbf{S}}})_{1}\approx\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT |. On the other hand, if |𝐒|=md2lnd𝐒𝑚much-greater-thansuperscript𝑑2𝑑|{\mathbf{S}}|=m\gg d^{2}\ln{d}| bold_S | = italic_m ≫ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d, it can be approximated as (𝐰𝐒)1(i=1m|(𝐱i)1|)/(i=1m|(𝐱i)1|2)subscriptsubscript𝐰𝐒1superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖1superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖12({\mathbf{w}}_{{\mathbf{S}}})_{1}\approx(\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}% |)/(\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|^{2})( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ) / ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

A more formal statement and the proof of Thm. 1 is available in Appendix A.2. From Thm.1, it is evident that the characteristics of the desirable data subset 𝐗𝐒subscript𝐗𝐒{\mathbf{X}}_{{\mathbf{S}}}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT vary depending on the subset size regime. In the sample-deficient regime (md/lnd)much-less-than𝑚𝑑𝑑(m\ll\sqrt{d/\ln d})( italic_m ≪ square-root start_ARG italic_d / roman_ln italic_d end_ARG ), it is more advantageous to include samples that are farther from the decision boundary (easy samples) in 𝐗𝐒subscript𝐗𝐒{\mathbf{X}}_{{\mathbf{S}}}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT to train a better classifier, resulting in a higher value of (𝐰𝐒)1subscriptsubscript𝐰𝐒1({\mathbf{w}}_{{\mathbf{S}}})_{1}( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Conversely, in the sample-sufficient regime (md2lnd)much-greater-than𝑚superscript𝑑2𝑑(m\gg d^{2}\ln d)( italic_m ≫ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d ), it is more beneficial to include samples closer to the decision boundary (difficult samples) to increase (𝐰𝐒)1subscriptsubscript𝐰𝐒1({\mathbf{w}}_{{\mathbf{S}}})_{1}( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. We conjecture that the relatively wide gap between two distinct regimes ( [d/lnd,d2lnd]𝑑𝑑superscript𝑑2𝑑[\sqrt{d/\ln{d}},d^{2}\ln{d}][ square-root start_ARG italic_d / roman_ln italic_d end_ARG , italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d ]) may be attributed to the loose analysis. We anticipate that a more precise boundary will occur at m=Θ(d)𝑚Θ𝑑m=\Theta(d)italic_m = roman_Θ ( italic_d ), where mdmuch-less-than𝑚𝑑m\ll ditalic_m ≪ italic_d (mdmuch-greater-than𝑚𝑑m\gg ditalic_m ≫ italic_d) corresponds to the sample-deficient (sufficient) regime. We provide empirical results that support this theoretical analysis and our conjecture in Appendix A.3.

Having identified the distinct properties of desirable data subsets depending on the subset size, the remaining question is how to design a universal data selection method capable of performing well across a wide range of selection ratios.

4 Window Subset: Flexible Subset Selection

4.1 Desirable difficult level for data subsets in moderate selection ratios? Hard, but not the hardest

The underlying rationale for difficulty score-based selection methods like Forgetting (Toneva et al., 2019) and EL2N (Paul et al., 2021) is that training models on a subset consisting of challenging data will enable the models to learn (or memorize) the atypical features of hard samples, while still retaining the capacity to learn typical features of easier samples. However, as shown in our empirical findings in Sec. 3.1 and further supported by our theoretical analysis in Sec. 3.2, this assumption may not hold when the subset ratio is extremely small. This leads us to our next question: Is it still feasible for models trained on hard instances to effectively learn easier instances, without having been exposed to these samples during training, at moderate selection ratios?

Refer to caption

Figure 2: Results on “training set split” experiment on CIFAR-10 dataset, when five different models are trained by five different data subsets, divided by their difficulty rankings, [0,20]%percent020[0,20]\%[ 0 , 20 ] % (hardest) to [80,100]%percent80100[80,100]\%[ 80 , 100 ] % (easiest). Model accuracies (y𝑦yitalic_y-axis) are evaluated on all five subsets (x𝑥xitalic_x-axis) separately. Right figures visualize the t-SNE of test samples’ features extracted from models trained by the hardest [0,20]%percent020[0,20]\%[ 0 , 20 ] % subset (top) and the [20,40]%percent2040[20,40]\%[ 20 , 40 ] % subset (bottom).

To investigate this, we design a “training set split” experiment on CIFAR-10 dataset. We divide the training dataset into five subsets and observe the impact of training on each subset on the accuracy across the other subsets. In detail, we sort the CIFAR-10 training instances by forgetting score (Toneva et al., 2019) and divide them into five subsets based on consecutive ranking intervals: the hardest 20% (rankings within [0,20]%percent020[0,20]\%[ 0 , 20 ] %), [20,40]%percent2040[20,40]\%[ 20 , 40 ] %, and so on, up to the easiest 20% ([80,100]%percent80100[80,100]\%[ 80 , 100 ] %). We train five different ResNet18 models, each on one of these subsets, and then evaluate their classification accuracies on all five subsets separately.

The results, presented in Fig. 2, reveal that models trained on harder data subsets generally perform better across all subsets, with the exception of the model trained solely on the hardest 20%. For instance, the model trained on the [20,40]%percent2040[20,40]\%[ 20 , 40 ] %-ranked subset effectively classifies instances not only within its training range but also those in the easier [40,100]%percent40100[40,100]\%[ 40 , 100 ] % range. This suggests that training with harder instances helps the model learn both the unique features of these challenging instances and the common, representative features of the entire dataset. This finding supports the rationale behind existing score-based selection methods, which prioritize selecting challenging data for training.

Yet, this pattern does not hold for the model trained exclusively on the hardest 20% subset. This model exhibits a significant drop in accuracy across all the easier subsets, except for the hardest subset it was trained on. This indicates that a model trained with only the most challenging instances lacks generalizability to easier samples.

We support this claim by analyzing the feature spaces of models trained with the hardest [0,20]%percent020[0,20]\%[ 0 , 20 ] % subset and the subsequent [20,40]%percent2040[20,40]\%[ 20 , 40 ] % subset. Our focus is on demonstrating that the model trained with the hardest 20% subset struggles to effectively create a feature space for classification. We extract features of CIFAR-10 test samples from each model and visualize their t-SNE (van der Maaten & Hinton, 2008) in Fig. 2 (right). The figure reveals that the feature space generated by the model trained on the hardest subset does not efficiently cluster test samples by class. We further quantify this using the neural collapse property (Kothapalli, 2023), which compares within-class feature variability to inter-class feature variability. Let fk,isubscript𝑓𝑘𝑖f_{k,i}italic_f start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT be the feature of the i𝑖iitalic_i-th data in the k𝑘kitalic_k-th class, μk=1ni=1nfk,isubscript𝜇𝑘1𝑛superscriptsubscript𝑖1𝑛subscript𝑓𝑘𝑖\mu_{k}=\frac{1}{n}\sum_{i=1}^{n}f_{k,i}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT be the feature mean of class k𝑘kitalic_k, and μG=1Kk=1Kμksubscript𝜇𝐺1𝐾superscriptsubscript𝑘1𝐾subscript𝜇𝑘\mu_{G}=\frac{1}{K}\sum_{k=1}^{K}\mu_{k}italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the global mean feature. The within-class covariance ΣwsubscriptΣ𝑤\Sigma_{w}roman_Σ start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is defined by 1Knk=1Ki=1n(fk,iμk)(fk,iμk)1𝐾𝑛superscriptsubscript𝑘1𝐾superscriptsubscript𝑖1𝑛subscript𝑓𝑘𝑖subscript𝜇𝑘superscriptsubscript𝑓𝑘𝑖subscript𝜇𝑘top\frac{1}{Kn}\sum_{k=1}^{K}\sum_{i=1}^{n}(f_{k,i}-\mu_{k})(f_{k,i}-\mu_{k})^{\top}divide start_ARG 1 end_ARG start_ARG italic_K italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ( italic_f start_POSTSUBSCRIPT italic_k , italic_i end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, and the inter-class covariance ΣBsubscriptΣ𝐵\Sigma_{B}roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT by 1Kk=1K(μkμG)(μkμG)1𝐾superscriptsubscript𝑘1𝐾subscript𝜇𝑘subscript𝜇𝐺superscriptsubscript𝜇𝑘subscript𝜇𝐺top\frac{1}{K}\sum_{k=1}^{K}(\mu_{k}-\mu_{G})(\mu_{k}-\mu_{G})^{\top}divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) ( italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_μ start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. The trace, tr(ΣWΣB)trsubscriptΣ𝑊superscriptsubscriptΣ𝐵\text{tr}(\Sigma_{W}\Sigma_{B}^{\dagger})tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ), then measures the clusterability of features with respect to their classes, with a lower value indicating better clustering. The tr(ΣWΣB)trsubscriptΣ𝑊superscriptsubscriptΣ𝐵\text{tr}(\Sigma_{W}\Sigma_{B}^{\dagger})tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) values for models trained on [0,20]%percent020[0,20]\%[ 0 , 20 ] % (the hardest 20%), [20,40]%percent2040[20,40]\%[ 20 , 40 ] %, and so on, up to [80,100]%percent80100[80,100]\%[ 80 , 100 ] %, and the full dataset, are 9.33, 1.68, 1.99, 2.60, 3.35, and 1.04, respectively. Notably, there is a significant increase in tr(ΣWΣB)trsubscriptΣ𝑊superscriptsubscriptΣ𝐵\text{tr}(\Sigma_{W}\Sigma_{B}^{\dagger})tr ( roman_Σ start_POSTSUBSCRIPT italic_W end_POSTSUBSCRIPT roman_Σ start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT start_POSTSUPERSCRIPT † end_POSTSUPERSCRIPT ) for the hardest subset, suggesting poor feature learning for classification.

In summary, training with harder data generally benefits learning both representative and atypical features, aiding in better model generalization. However, when the subset ratio is moderate and the subset consists of the hardest samples, the model may suffer significant performance drop and fail to establish an effective feature learning for classification.

4.2 The existence of a high-performing window subset

Refer to caption
(a) CIFAR-10
Refer to caption
(b) CIFAR-100
Figure 3: Sliding window experiments to measure the test accuracy of the models trained by window subsets while changing the starting point of the windows in CIFAR-10 (left) and CIFAR-100 (right) dataset. Samples are sorted in descending order by their difficulty scores. The horizontal lines are results from random selection. For each subset ratio, there exists the best window, and its starting point shifts toward left as the subset ratio increases. Results for ImageNet dataset is also reported in Appendix F.1

Section 4.1 implies that for each subset ratio, there is a proper difficulty level of the subset for better model generalization. Expecting that a subset composed of samples of proper difficult level will perform well, we consider the window selection method, similar to (Lee & Chung, 2024), that selects a window subset from samples ordered by their difficulty scores. In detail, we sort the samples in descending order based on their difficulty scores and select a starting point, such as s%percent𝑠s\%italic_s % for a given window size of w%percent𝑤w\%italic_w %, to choose continuous intervals of samples within [s,s+w]%percent𝑠𝑠𝑤[s,s+w]\%[ italic_s , italic_s + italic_w ] %. This approach has two merits: 1) flexibility and 2) computational-efficiency. The flexibility in choosing the starting point s𝑠sitalic_s% of the window allows us to opt for easy, moderate, or hard data subsets depending on the choice of the starting point. The search space of window selection method is confined to the number of possible starting points for the windows, making the window selection method computationally much more efficient compared to a general subset selection where the search space scales as (nm)exp(cn)FRACOP𝑛𝑚𝑐𝑛\genfrac{(}{)}{0.0pt}{1}{n}{m}\approx\exp(cn)( FRACOP start_ARG italic_n end_ARG start_ARG italic_m end_ARG ) ≈ roman_exp ( italic_c italic_n ) for some constant c>0𝑐0c>0italic_c > 0 when the subset size m𝑚mitalic_m is a constant fraction of n𝑛nitalic_n.

We first explore the performance of the window selection approach while varying the starting point and illustrate the existence of the best window subset. We sort the samples from CIFAR-10/100 in descending order based on their Forgetting scores (Toneva et al., 2019), and select windows of different sizes, ranging from 10%percent1010\%10 % to 40%percent4040\%40 %, by adjusting the starting point from 00 to (100w)%percent100𝑤(100-w)\%( 100 - italic_w ) % with a step size of 5%percent55\%5 %. We then train ResNet18 for CIFAR-10 and ResNet50 for CIFAR-100 using the windows subsets and plot the resulting test accuracies in Fig. 3.

We can observe that, for each subset ratio, there exists an optimal starting point, and this optimal point shifts towards lower values (indicating more difficult samples) as the window subset size increases. Specifically, for CIFAR-10, the optimal window subset of size 10%percent1010\%10 % falls within the interval [50,60]%percent5060[50,60]\%[ 50 , 60 ] %, while for a window size of 40%percent4040\%40 %, it falls within [5,45]%percent545[5,45]\%[ 5 , 45 ] %. Similar trends are observed for CIFAR-100, albeit with distinct optimal starting points depending on the dataset. For CIFAR-100, with a window size of 10%percent1010\%10 %, the best window subset comprises samples from [80,90]%percent8090[80,90]\%[ 80 , 90 ] %, primarily consisting of easy samples. It is important to note that the 10%percent1010\%10 % subset for CIFAR-100 includes only 50 samples per class, whereas for CIFAR-10, it includes 500 samples per class. Consequently, the optimal 10%percent1010\%10 % window for CIFAR-100 ([80,90]%percent8090[80,90]\%[ 80 , 90 ] %) tends to include more easy and representative samples capable of capturing the representative features of each class.

The observation that the optimal starting point of the window subset varies based on both the subset size and the dataset introduces a new challenge in window selection: How can we efficiently identify the best window subset without having to evaluate models trained on each subset? We address this crucial question by introducing a proxy task to estimate the quality of window subsets.

5 Best Window Selection (BWS)

Algorithm 1 BWS: Best Window Selection Method

Input Dataset {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{({\mathbf{x}}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT sorted by difficulty scores from the hardest to easiest, subset size m𝑚mitalic_m, and step size t𝑡titalic_t.

  Train a feature extractor f()𝑓f(\cdot)italic_f ( ⋅ ) by m𝑚mitalic_m randomly chosen samples from the dataset.
  Extract the features of the samples by using f()𝑓f(\cdot)italic_f ( ⋅ ) and denote them by 𝐟i=[f(𝐱i),1]subscript𝐟𝑖𝑓subscript𝐱𝑖1{\mathbf{f}}_{i}=[f({\mathbf{x}}_{i}),1]bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 1 ].
  for k{0,t,2t,3t,(nm)/tt}𝑘0𝑡2𝑡3𝑡𝑛𝑚𝑡𝑡k\in\{0,t,2t,3t\dots,\lfloor(n-m)/t\rfloor t\}italic_k ∈ { 0 , italic_t , 2 italic_t , 3 italic_t … , ⌊ ( italic_n - italic_m ) / italic_t ⌋ italic_t } do
    Define a window subset 𝐒={(𝐟i,yi)}i=kk+m1𝐒superscriptsubscriptsubscript𝐟𝑖subscript𝑦𝑖𝑖𝑘𝑘𝑚1{{\mathbf{S}}}=\{({\mathbf{f}}_{i},y_{i})\}_{i=k}^{k+m-1}bold_S = { ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k + italic_m - 1 end_POSTSUPERSCRIPT.
    for c{1,2,C}𝑐12𝐶c\in\{1,2,\dots C\}italic_c ∈ { 1 , 2 , … italic_C } do
       For the samples in 𝐒𝐒{\mathbf{S}}bold_S with label c𝑐citalic_c, set the label equal to 1. For others, set the label to 0.
       Solve the linear regression problem Eq.1 with the window subset 𝐒𝐒{{\mathbf{S}}}bold_S. Let 𝐰𝐒(c)subscript𝐰𝐒𝑐{\mathbf{w}}_{{\mathbf{S}}}(c)bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( italic_c ) be the solution.
    end for
    Obtain 𝐰𝐒(d+1)×Csubscript𝐰𝐒superscript𝑑1𝐶{\mathbf{w}}_{{\mathbf{S}}}\in\mathbb{R}^{(d+1)\times C}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d + 1 ) × italic_C end_POSTSUPERSCRIPT by 𝐰𝐒:=[𝐰𝐒(1),𝐰𝐒(C)]assignsubscript𝐰𝐒subscript𝐰𝐒1subscript𝐰𝐒𝐶{\mathbf{w}}_{{\mathbf{S}}}:=[{\mathbf{w}}_{{\mathbf{S}}}(1),\dots\mathbf{w}_{% {\mathbf{S}}}(C)]bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT := [ bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( 1 ) , … bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( italic_C ) ].
    Calculate the accuracy of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT by 1ni=1n𝟙(argmaxc(𝐰𝐒𝐟i)c=yi)\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(\operatorname*{arg\,max}_{c}({\mathbf{w}}% _{{\mathbf{S}}}^{\top}{\mathbf{f}}_{i})_{c}=y_{i})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).
  end for

Output Window subset 𝐒𝐒{{\mathbf{S}}}bold_S for which the accuracy of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT is maximized.

Table 2: Comparison between the window subsets chosen by the sliding window experiment in Fig. 3 (left) and BWS (using the KRR as a proxy task) (right) in terms of their starting points and test accuracies. The chosen windows align well between the two.
Ratio Sliding window experiment BWS
Starting point Test accuracy Starting point Test accuracy
10% 50% 82.67 55% 82.29
20% 25% 89.06 30% 88.74
30% 15% 91.80 15% 91.80
40% 5% 93.59 5% 93.59

Our goal is to develop a computationally-efficient method capable of assessing and identifying the best window subset without requiring the training of a model on every potential subset. To achieve this goal, we propose to solve a kernel ridge regression (KRR) problem by using each window subset and evaluate the performance of the corresponding solution on the full training datasets. Using KRR for a proxy task is motivated by the observation that the kernel regression with the model-related kernels can provide a good approximation to the original model (Neal, 1996; Lee et al., 2018; Jacot et al., 2018; Arora et al., 2019), while providing computational efficiency compared to training the actual models. We provide further justifications of this proxy task in Appx. B. Alg. 1 outlines the main steps.

Let 𝐟i:=[f(𝐱i),1]d+1assignsubscript𝐟𝑖𝑓subscript𝐱𝑖1superscript𝑑1{\mathbf{f}}_{i}:=[f({\mathbf{x}}_{i}),1]\in\mathbb{R}^{d+1}bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := [ italic_f ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , 1 ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT be the feature vector of 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT obtained by a feature extractor f()𝑓f(\cdot)italic_f ( ⋅ ). The details of the feature extractor is available in the end of this section. For each window subset 𝐒={(𝐟i,yi)}i=1m𝐒superscriptsubscriptsubscript𝐟𝑖subscript𝑦𝑖𝑖1𝑚{\mathbf{S}}=\{({\mathbf{f}}_{i},y_{i})\}_{i=1}^{m}bold_S = { ( bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT composed of m𝑚mitalic_m samples, define 𝐗𝐒:=[𝐟1,,𝐟m]assignsubscript𝐗𝐒subscript𝐟1subscript𝐟𝑚{\mathbf{X}}_{\mathbf{S}}:=[{\mathbf{f}}_{1},\dots,{\mathbf{f}}_{m}]bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT := [ bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] and 𝐲𝐒:=[y1,,ym]assignsubscript𝐲𝐒subscript𝑦1subscript𝑦𝑚{\mathbf{y}}_{\mathbf{S}}:=[y_{1},\dots,y_{m}]bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT := [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_y start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ]. Then, we denote the problem of kernel ridge regression, and the corresponding solution, using the subset 𝐒𝐒{\mathbf{S}}bold_S by

𝐰𝐒subscript𝐰𝐒\displaystyle{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT :=argmin𝐰𝐲𝐒𝐗𝐒𝐰22+λ𝐰22,assignabsentsubscriptargmin𝐰superscriptsubscriptnormsubscript𝐲𝐒superscriptsubscript𝐗𝐒top𝐰22𝜆superscriptsubscriptnorm𝐰22\displaystyle:=\operatorname*{arg\,min}_{{\mathbf{w}}}\|{\mathbf{y}}_{{\mathbf% {S}}}-{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{w}}\|_{2}^{2}+\lambda\|{% \mathbf{w}}\|_{2}^{2},:= start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ∥ bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (1)
𝐰𝐒subscript𝐰𝐒\displaystyle{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT =(λ𝐈d+1+𝐗𝐒𝐗𝐒)1𝐗𝐒𝐲𝐒absentsuperscript𝜆subscript𝐈𝑑1subscript𝐗𝐒superscriptsubscript𝐗𝐒top1subscript𝐗𝐒subscript𝐲𝐒\displaystyle=(\lambda{\mathbf{I}}_{d+1}+{\mathbf{X}}_{{\mathbf{S}}}{{\mathbf{% X}}_{{\mathbf{S}}}}^{\top})^{-1}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{y}}_{{% \mathbf{S}}}= ( italic_λ bold_I start_POSTSUBSCRIPT italic_d + 1 end_POSTSUBSCRIPT + bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT
=𝐗𝐒(λ𝐈m+𝐗𝐒𝐗𝐒)1𝐲𝐒.absentsubscript𝐗𝐒superscript𝜆subscript𝐈𝑚superscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐲𝐒\displaystyle={\mathbf{X}}_{{\mathbf{S}}}(\lambda{\mathbf{I}}_{m}+{\mathbf{X}}% _{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y}}_{{\mathbf{% S}}}.= bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( italic_λ bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT . (2)

We set λ=1𝜆1\lambda=1italic_λ = 1 to prevent singularity in matrix inversion. The matrix inversion in Eq. 2 can be performed efficiently in a lower dimension between d+1𝑑1d+1italic_d + 1 and m𝑚mitalic_m.

Refer to caption
Refer to caption
(a) CIFAR-10
Refer to caption
(b) CIFAR-100
Refer to caption
(c) ImageNet
Figure 4: (a, b, c) Data pruning experiments. Test accuracy of the models trained with data subsets of varying ratios in CIFAR-10/100, and ImageNet dataset, selected by different methods. Our method (BWS) outperforms other baselines across a wide range of selection ratios and achieves the accuracy as high as the Oracle window. Full results are reported in Table 1315.

Our algorithm finds the best window subset by evaluating the performance of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, corresponding to each window subset 𝐒𝐒{\mathbf{S}}bold_S, on classifying the training samples {(𝐱i,yi)}i=1nsuperscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛\{({\mathbf{x}}_{i},y_{i})\}_{i=1}^{n}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT as described in Alg. 1. To apply 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT for C𝐶Citalic_C-class classification problem, we find 𝐰𝐒(c)d+1subscript𝐰𝐒𝑐superscript𝑑1{\mathbf{w}}_{{\mathbf{S}}}(c)\in\mathbb{R}^{d+1}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( italic_c ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d + 1 end_POSTSUPERSCRIPT for each class c{1,,C}𝑐1𝐶c\in\{1,\dots,C\}italic_c ∈ { 1 , … , italic_C }, classifying whether a sample belongs to class c𝑐citalic_c or not, and simply place the vectors in columns of 𝐰𝐒(d+1)×Csubscript𝐰𝐒superscript𝑑1𝐶{\mathbf{w}}_{\mathbf{S}}\in\mathbb{R}^{(d+1)\times C}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_d + 1 ) × italic_C end_POSTSUPERSCRIPT. Then, we evaluate the performance of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT by calculating the classification accuracy 1ni=1n𝟙(argmaxc(𝐰𝐒𝐟i)c=yi)\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(\operatorname*{arg\,max}_{c}({\mathbf{w}}% _{{\mathbf{S}}}^{\top}{\mathbf{f}}_{i})_{c}=y_{i})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) on the full training set.

In Table 2, we compare the performances of window subsets chosen by the sliding window experiment in Fig. 3 and BWS (using the KRR as a proxy task) on CIFAR-10 dataset. We compare the starting points and test accuracies of the window subsets chosen by the two different methods for each subset ratio. We can observe that window subsets chosen by KRR align well with those chosen by the sliding window experiment. This observation demonstrates the effectiveness of our algorithm, which can efficiently replace the need to train models on each window subset and evaluate them on test dataset. BWS also finds the near optimal starting points for CIFAR-100 and ImageNet datasets across a broad range of subset ratios (from 1% to 90%). The detailed results are available in the Appendix H.

Feature extractor

When |𝐒|=m𝐒𝑚|{\mathbf{S}}|=m| bold_S | = italic_m, we randomly select m𝑚mitalic_m samples from the full dataset, and use these samples to train a neural network for a few epochs to generate a feature extractor f()𝑓f(\cdot)italic_f ( ⋅ ). For CIFAR-10, we train ResNet18 for 20 epochs, and for CIFAR-100/ImageNet, we train ResNet50 for 20 epochs. The rationale behind training a feature extractor with random samples matching the window subset size is to simulate the situation where the model is trained with a limited window subset of the same size, enabling effective quality evaluation for window subsets.

Computational complexity

The computational complexity of Algorithm 1 includes training a feature extractor and solving the regression problem for ((nm)/t)𝑛𝑚𝑡(\lfloor(n-m)/t\rfloor)( ⌊ ( italic_n - italic_m ) / italic_t ⌋ )-subsets. Training the feature extractor is relatively efficient since it involves only a few epochs. Solving the regression requires matrix inversion, which takes O(min(d,m)3)O(\min(d,m)^{3})italic_O ( roman_min ( italic_d , italic_m ) start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) steps, with d=512𝑑512d=512italic_d = 512 for ResNet18 and 2048204820482048 for ResNet50. This cost is significantly lower than other optimization-based baselines. For example, running BWS for the CIFAR-10 dataset with ResNet-18 and a step size of 5% takes less than 11 seconds. Detailed comparisons are provided in Appendix C.3.

6 Experiments

To demonstrate the effectiveness of our method, we conduct data pruning experiments. We select a subset of the dataset using each selection method while pruning the rest of the samples, and evaluate the performance of the model trained with each subset. We perform these experiments using ResNet18 for CIFAR-10 and ResNet50 for CIFAR-100 and ImageNet. Baselines include 1) two difficulty score-based selection: Forgetting and EL2N, 2) two optimization-based selection: AdaCore and LCMat, and 3) two universal selection methods: Moderate DS score and CCS. We also add SSL Prototype (Sorscher et al., 2022) and memorization score (Feldman & Zhang, 2020) as baselines on the ImageNet experiment, since these scores are known to achieve competitive performances especially on the large-scale datasets (Sorscher et al., 2022). More details about the baselines and experiments are available in Appx. C. The full experimental results are available in Appx. H.

6.1 Experimental Results

Data pruning experiments

In Fig. 4, we present the test accuracies of models trained with data subsets of varying ratios, selected by different methods. The reported values are mean, and the shaded regions are std. across three (two) independent runs for CIFAR-10/100 (ImageNet). The Oracle window curve represents the results obtained using the window subset of the highest test accuracy found by the sliding window experiment as in Fig. 3, and BWS represents the results obtained using Alg. 1. We can observe that our method, BWS, consistently outperforms all other baselines across almost all selection ratios, and achieves the performance near the Oracle window. In the case of CIFAR-10/100, the difficulty score-based methods, Forgetting and EL2N, perform well in high ratio regimes but experience significant performance drop as the selection ratio decreases. The optimization-based methods, LCMat and AdaCore, achieve better performance than the difficulty score-based methods in low selection ratios but underperform in high selection ratios. Detailed numbers are reported in Table 1315.

Refer to caption
Refer to caption
(a) CIFAR-10 with ViT
Refer to caption
(b) Noisy CIFAR-10
Figure 5: (a) Cross architecture experiment. Test accuracy of the model fine-tuned with subsets of varying ratios in the CIFAR-10 dataset, selected by different methods. We utilize the Vision Transformer (ViT) architecture, pretrained on the ImageNet dataset. (b) Robustness to label noise. Data pruning experiments with CIFAR-10, including 20% label-noise. For both experiments, BWS surpasses other baselines for a wide range of selection ratios.

Cross-architecture experiments

To test the robustness of our method across changes in model architectures, we conduct data pruning experiments on CIFAR-10 while using different architectures during sample scoring and training. Window subsets are constructed using samples ordered by their Forgetting scores, calculated on ResNet18, and then the best window selection (Alg. 1) and the model training are conducted using a simpler CNN/EfficientNet-B0 or a larger Vision Transformer (ViT) (Dosovitskiy et al., 2021), pre-trained on the ImageNet. The results on the ViT are presented in Fig. 5(a), while those on CNN and EfficientNet-B0 are shown in Fig. 9 of Appx. F.2. In all cases, our method consistently achieves competitive performances across all selection ratios, demonstrating its robustness to changes in neural network architectures during data subset selection.

Robustness to label noise

Additionally, we demonstrate that BWS is robust against label noise in subset selection. Existing sample selection methods, which rely on difficulty-based sample scores (Toneva et al., 2019; Paul et al., 2021), are susceptible to a particular limitation: they often assign high scores to samples corrupted by label noise, as these samples are inherently hard to learn. This poses the risk of unintentionally selecting noisy samples during the selection phase. On the contrary, our BWS algorithm adopts a different approach by solving a proxy task using kernel ridge regression rather than solely relying on high or low difficulty-based scores. We test the robustness of BWS in the presence of label noise by corrupting randomly chosen 20% samples of CIFAR-10 dataset by random label noise. To further enhance the robustness of our method, we modify Alg. 1 to evaluate the solution of kernel ridge regression using only the low-scoring 50% samples from the training dataset, which will rarely include label-noise samples, instead of the full dataset. In Fig. 5(b), we compare the performance of this modified version of BWS with other baselines. While difficulty score-based selection and optimization-based selection methods suffer from performance degradation due to label noise, our method, along with another label noise-robust method, Moderate DS, achieves performance even higher than what is achievable with the full training dataset, which includes the 20% label noise. Further experiment results with higher noise ratio are provided in Appendix F.3.

Refer to caption
(a) Two half-width windows
Refer to caption
(b) Wider window
Figure 6: (a) Test accuracies of the models trained with two half-width windows of varying starting points. Each axis indicates the starting points of each widow, and brighter color indicates higher accuracy. The best result is observed near the diagonal, contiguous windows. (b) Test accuracy of the models trained with wider windows. Horizontal lines are the results of oracle window subset. At high ratio, oracle window outperforms wider windows.

6.2 Ablation study

BWS operates by sorting training instances based on their difficulty scores, creating window subsets, and selecting the best window by a proxy task. To assess the importance of each component, we conduct several ablation studies.

Different types of window subsets

Our method employs a window type that includes samples from a contiguous range of difficulty scores while changing the starting point. We explore two more generalized window types: a union of two half-width windows and a wider window where the samples are randomly selected from a wider range. For two half-width windows, given a subset of size w𝑤witalic_w, we search over all combinations of two half-width windows, denoted by [x1,x1+w/2][x2,x2+w/2]subscript𝑥1subscript𝑥1𝑤2subscript𝑥2subscript𝑥2𝑤2[x_{1},x_{1}+w/2]\cup[x_{2},x_{2}+w/2][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w / 2 ] ∪ [ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w / 2 ], while varying their starting points x1[0,100w]subscript𝑥10100𝑤x_{1}\in[0,100-w]italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ 0 , 100 - italic_w ] and x2[x1+w/2,100w/2]subscript𝑥2subscript𝑥1𝑤2100𝑤2x_{2}\in[x_{1}+w/2,100-w/2]italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w / 2 , 100 - italic_w / 2 ] with a step size of 5%percent55\%5 %. For wider windows, we consider a window that is c𝑐citalic_c times wider than the subset size w𝑤witalic_w, denoted as [x1,x1+cw]subscript𝑥1subscript𝑥1𝑐𝑤[x_{1},x_{1}+c\cdot w][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c ⋅ italic_w ] while varying the starting point x1subscript𝑥1x_{1}italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT within the range [0,100cw]0100𝑐𝑤[0,100-c\cdot w][ 0 , 100 - italic_c ⋅ italic_w ] with a step size of 5%. These ablation studies are conducted with CIFAR-10 on ResNet18, to see whether the generality in subset selection can bring meaningful gain possibly at the cost of computation.

In Table 3, we present the maximum test accuracies achieved by the two non-contiguous (two half-width/wider windows) and contiguous window types. Remind that two half-width window type includes all the contiguous windows. We observe that for every subset ratio, the performance of the best contiguous window subset almost matches that of two half-width windows, and outperforms wider widows. Moreover, Fig. 6(a) shows that the best composition of two half-width windows occur when the two windows are close to each other (the diagonal positions in the figure). The sliding window experiment for wider windows in Fig. 6(b) shows that the best contiguous window (horizontal lines) achieves better performance than wider windows, especially in high ratios. These results support our use of contiguous window subsets in choosing the near-optimal subset in a computationally efficient manner across a broad range of selection ratios. Further results are reported in Appx. G.1G.2.

Table 3: The maximum test accuracy achieved by each window type in CIFAR-10. The best contiguous window nearly matches two half-width windows and outperforms wider windows.
Selection Ratio Two half-width windows Twice wider window Best contiguous window
10% 83.04 82.37 82.67
20% 89.16 89.01 89.06
30% 92.02 91.72 91.80
40% 93.67 92.62 93.59

Different types of proxy task

Table 4: Test accuracy of the models trained by window subsets of CIFAR-10 selected by different proxy tasks. Our method achieves the better performance, and the best window subsets selected by ours aligns better with those of oracle windows.
Proxy task Subset ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
SVP Test accuracy 46.25 71.35 80.95 88.06 90.68 91.63 93.36 94.75 95.37
Window index 80% 65% 60% 40% 30% 25% 15% 5% 0%
Gradient Test accuracy 39.45 70.40 82.24 88.42 90.68 91.63 92.72 94.30 94.82
difference Window index 50% 45% 40% 35% 30% 25% 20% 10% 5%
Gradient Test accuracy 36.33 60.46 74.77 87.79 91.77 93.59 94.54 95.23 95.37
similarity Window index 30% 25% 20% 15% 10% 5% 0% 0% 0%
BWS Test accuracy 46.10 70.70 82.29 88.74 91.80 93.59 94.54 95.23 95.37
Window index 90% 70% 55% 30% 15% 5% 0% 0% 0%
Oracle Test accuracy 47.17 72.89 82.67 89.06 91.80 93.59 94.54 95.23 95.37
Window index 85% 55% 50% 25% 15% 5% 0% 0% 0%

We also evaluate the effectiveness of our proxy task, kernel ridge regression in Alg. 1, by comparing it with three different variants: 1) Selection via proxy (SVP) (Coleman et al., 2020), utilizing a smaller model (ConvNet) for choosing the best window, 2) Gradient 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm difference, which finds a window subset minimizing the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT-norm difference between the average gradients of the full dataset and the window subset, and 2) Gradient cosine similarity, which finds a window subset maximizing the cosine similarity between the average gradients of the full dataset and the window subset. The last two methods are inspired by gradient-matching strategies used in optimization-based coreset selection (Mirzasoleiman et al., 2020; Yang et al., 2023). Table 4 presents the test accuracies achieved by models trained on window subsets selected by each method, along with the corresponding starting points of the chosen windows. The last row shows the result with the oracle window. Our method achieves better test accuracy compared to the three variants, and the window selected by our method aligns better with the oracle selection. In particular, SVP tends to select easier subsets possibly due to the limited capacity of the simple network used in the proxy task. This result demonstrates that the best subset cannot be effectively chosen by using a simpler network or matching the average gradients; it requires a proxy task such as kernel ridge regression, with model-related kernels, to evaluate the quality of window subsets for classification tasks. We also perform an ablation study to show the robustness of our method across various difficulty scores in Appx. G.3.

7 Conclusion and Discussion

We introduced the Best Window Selection (BWS), a universal and efficient data subset selection method capable of achieving competitive performance across a wide range of selection ratios. Our experimental results demonstrate that BWS effectively identifies the best window subset from samples ordered by difficulty-based scores, utilizing a simple proxy task based on kernel ridge regression. This method outperforms previous data subset selection approaches, which often excel within a limited range of selection ratios.

Subset selection has become a crucial technique in the big data era, allowing for the reduction of large datasets with minimal information loss. However, current efforts, including BWS, mainly focus on sample selection for supervised learning on curated datasets designed for classification tasks with well-defined labels. The next stage for subset selection may involve addressing challenges associated with much larger and more complex datasets. For instance, DataComp (Gadre et al., 2023) proposes a new benchmark for subset selection by providing a web-scale multimodal dataset as the full training set. This setup challenges researchers to develop strategies for selecting subsets that benefit diverse downstream test sets capable of zero-shot generalization.

We believe that the insights gained through BWS–specifically, the shifts in the desired dataset characteristics based on selection ratio and the methodology for efficiently identifying the optimal subset using a simple proxy task–may provide valuable perspectives for designing data filtering or selection strategies for these large-scale datasets.

Impact Statement

This paper addresses the performance degradation seen in existing data subset selection methods when the selection ratio varies widely. We introduce a methodology specifically designed to effectively counter this challenge. Our proposed universal data subset selection method delivers consistent, competitive performance across various selection ratios. This is particularly valuable in practical situations where computational and storage resources for training can vary, necessitating flexible sample selection based on the required subset ratios.

Acknowledgements

This research was supported by the National Research Foundation of Korea under grant 2021R1C1C11008539.

References

  • Arora et al. (2019) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. In Advances in Neural Information Processing Systems, 2019.
  • Chen et al. (2010) Chen, Y., Welling, M., and Smola, A. Super-samples from kernel herding. In Conference on Uncertainty in Artificial Intelligence, 2010.
  • Citovsky et al. (2023) Citovsky, G., DeSalvo, G., Kumar, S., Ramalingam, S., Rostamizadeh, A., and Wang, Y. Leveraging importance weights in subset selection. In International Conference on Learning Representations, 2023.
  • Coleman et al. (2020) Coleman, C., Yeh, C., Mussmann, S., Mirzasoleiman, B., Bailis, P., Liang, P., Leskovec, J., and Zaharia, M. Selection via proxy: Efficient data selection for deep learning. In International Conference on Learning Representations, 2020.
  • Culotta & McCallum (2005) Culotta, A. and McCallum, A. Reducing labeling effort for structured prediction tasks. In Association for the Advancement of Artificial Intelligence, 2005.
  • Das et al. (2021) Das, S., Singh, A., Chatterjee, S., Bhattacharya, S., and Bhattacharya, S. Finding high-value training data subset through differentiable convex programming. In Machine Learning and Knowledge Discovery in Database, 2021.
  • Deng et al. (2009) Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE conference on computer vision and pattern recognition, 2009.
  • Dosovitskiy et al. (2021) Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., and Houlsby, N. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021.
  • Fan & Wang (2020) Fan, Z. and Wang, Z. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. In Advances in Neural Information Processing Systems, 2020.
  • Feldman & Zhang (2020) Feldman, V. and Zhang, C. What neural networks memorize and why: Discovering the long tail via influence estimation. In Advances in Neural Information Processing Systems, 2020.
  • Gadre et al. (2023) Gadre, S. Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., Orgad, E., Entezari, R., Daras, G., Pratt, S., Ramanujan, V., Bitton, Y., Marathe, K., Mussmann, S., Vencu, R., Cherti, M., Krishna, R., Koh, P. W., Saukh, O., Ratner, A., Song, S., Hajishirzi, H., Farhadi, A., Beaumont, R., Oh, S., Dimakis, A., Jitsev, J., Carmon, Y., Shankar, V., and Schmidt, L. Datacomp: In search of the next generation of multimodal datasets, 2023.
  • Ghorbani & Zou (2019) Ghorbani, A. and Zou, J. Data shapley: Equitable valuation of data for machine learning. In International Conference on Machine Learning, 2019.
  • Har-Peled et al. (2007) Har-Peled, S., Roth, D., and Zimak, D. Maximum margin coresets for active and noise tolerant learning. In International Joint Conference on Artificial Intelligence, 2007.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, 2018.
  • Jiang et al. (2021) Jiang, Z., Zhang, C., Talwar, K., and Mozer, M. C. Characterizing structural regularities of labeled data in overparameterized models. In International Conference on Machine Learning, 2021.
  • Just et al. (2023) Just, H. A., Kang, F., Wang, T., Zeng, Y., Ko, M., **, M., and Jia, R. LAVA: Data valuation without pre-specified learning algorithms. In International Conference on Learning Representations, 2023.
  • Ki et al. (2023) Ki, N., Choi, H., and Chung, H. W. Data valuation without training of a model. In International Conference on Learning Representations, 2023.
  • Killamsetty et al. (2021a) Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., De, A., and Iyer, R. Grad-match: Gradient matching based data subset selection for efficient deep model training. In International Conference on Machine Learning, 2021a.
  • Killamsetty et al. (2021b) Killamsetty, K., Sivasubramanian, D., Ramakrishnan, G., and Iyer, R. Glister: Generalization based data subset selection for efficient and robust learning. In Association for the Advancement of Artificial Intelligence, 2021b.
  • Koh & Liang (2017) Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. In Advances in Neural Information Processing Systems, 2017.
  • Kong et al. (2022) Kong, S. T., Jeon, S., Na, D., Lee, J., Lee, H.-S., and Jung, K.-H. A neural pre-conditioning active learning algorithm to reduce label complexity. In Advances in Neural Information Processing Systems, 2022.
  • Kothapalli (2023) Kothapalli, V. Neural collapse: A review on modelling principles and generalization. In Transactions on Machine Learning Research, 2023.
  • Krizhevsky (2009) Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, 2009.
  • Kwon & Zou (2022) Kwon, Y. and Zou, J. Beta shapley: a unified and noise-reduced data valuation framework for machine learning. In International Conference on Artificial Intelligence and Statistics, 2022.
  • Kwon et al. (2021) Kwon, Y., A. Rivas, M., and Zou, J. Efficient computation and analysis of distributional shapley values. In International Conference on Artificial Intelligence and Statistics, 2021.
  • Lee et al. (2018) Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., and Sohl-Dickstein, J. Deep neural networks as gaussian processes. In International Conference on Learning Representations, 2018.
  • Lee & Chung (2024) Lee, Y. and Chung, H. W. SelMatch: Effectively scaling up dataset distillation via selection-based initialization and partial updates by trajectory matching. In International Conference on Machine Learning, 2024.
  • Maini et al. (2022) Maini, P., Garg, S., Lipton, Z. C., and Kolter, J. Z. Characterizing datapoints via second-split forgetting. In Advances in Neural Information Processing Systems, 2022.
  • Mirzasoleiman et al. (2020) Mirzasoleiman, B., Bilmes, J., and Leskovec, J. Coresets for data-efficient training of machine learning models. In Proceedings of the 37th International Conference on Machine Learning, 2020.
  • Neal (1996) Neal, R. M. Bayesian learning for neural networks. Springer Science & Business Media, 1996.
  • Paszke et al. (2019) Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems, 2019.
  • Paul et al. (2021) Paul, M., Ganguli, S., and Dziugaite, G. K. Deep learning on a data diet: Finding important examples early in training. In Advances in Neural Information Processing Systems, 2021.
  • Pleiss et al. (2020) Pleiss, G., Zhang, T., Elenberg, E. R., and Weinberger, K. Q. Identifying mislabeled data using the area under the margin ranking. In Advances in Neural Information Processing Systems, 2020.
  • Pooladzandi et al. (2022) Pooladzandi, O., Davini, D., and Mirzasoleiman, B. Adaptive second order coresets for data-efficient machine learning. In International Conference on Machine Learning, 2022.
  • Pruthi et al. (2020) Pruthi, G., Liu, F., Kale, S., and Sundararajan, M. Estimating training data influence by tracing gradient descent. In Advances in Neural Information Processing Systems, 2020.
  • Schwartz et al. (2020) Schwartz, R., Dodge, J., Smith, N. A., and Etzioni, O. Green ai. In Communications of the ACM, 2020.
  • Sener & Savarese (2018) Sener, O. and Savarese, S. Active learning for convolutional neural networks: A core-set approach. In International Conference on Learning Representations, 2018.
  • Shin et al. (2023) Shin, S., Bae, H., Shin, D., Joo, W., and Moon, I.-C. Loss-curvature matching for dataset selection and condensation. In International Conference on Artificial Intelligence and Statistics, 2023.
  • Sorscher et al. (2022) Sorscher, B., Geirhos, R., Shekhar, S., Ganguli, S., and Morcos, A. S. Beyond neural scaling laws: beating power law scaling via data pruning. In Advances in Neural Information Processing Systems, 2022.
  • Strubell et al. (2019) Strubell, E., Ganesh, A., and McCallum, A. Energy and policy considerations for deep learning in nlp. In Association for Computational Linguistics, 2019.
  • Swayamdipta et al. (2020) Swayamdipta, S., Schwartz, R., Lourie, N., Wang, Y., Hajishirzi, H., Smith, N. A., and Choi, Y. Dataset cartography: Map** and diagnosing datasets with training dynamics. In Conference on Empirical Methods in Natural Language Processing, 2020.
  • Tan & Le (2019) Tan, M. and Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, 2019.
  • Toneva et al. (2019) Toneva, M., Sordoni, A., des Combes, R. T., Trischler, A., Bengio, Y., and Gordon, G. J. An empirical study of example forgetting during deep neural network learning. In International Conference on Learning Representations, 2019.
  • van der Maaten & Hinton (2008) van der Maaten, L. and Hinton, G. Visualizing data using t-SNE. In Journal of Machine Learning Research, 2008.
  • Wu et al. (2022) Wu, Z., Shu, Y., and Low, B. K. H. DAVINZ: Data valuation using deep neural networks at initialization. In International Conference on Machine Learning, 2022.
  • Xia et al. (2023) Xia, X., Liu, J., Yu, J., Shen, X., Han, B., and Liu, T. Moderate coreset: A universal method of data selection for real-world data-efficient deep learning. In International Conference on Learning Representations, 2023.
  • Yang et al. (2023) Yang, Y., Kang, H., and Mirzasoleiman, B. Towards sustainable learning: Coresets for data-efficient deep learning. In International Conference on Machine Learning, 2023.
  • Zheng et al. (2023) Zheng, H., Liu, R., Lai, F., and Prakash, A. Coverage-centric coreset selection for high pruning rates. In International Conference on Learning Representations, 2023.
  • Zhou et al. (2022) Zhou, Y., Nezhadarya, E., and Ba, J. Dataset distillation using neural feature regression. In Advances in Neural Information Processing Systems, 2022.

Appendix A Proof of Theoretical Analysis

A.1 Linear ridge regression

The solution of the linear ridge regression problem is derived as follows.

L(𝐰)𝐿𝐰\displaystyle L({\mathbf{w}})italic_L ( bold_w ) =𝐲𝐗𝐰22+λ𝐰22absentsuperscriptsubscriptnorm𝐲superscript𝐗top𝐰22𝜆superscriptsubscriptnorm𝐰22\displaystyle=\|{\mathbf{y}}-{\mathbf{X}}^{\top}{\mathbf{w}}\|_{2}^{2}+\lambda% \|{\mathbf{w}}\|_{2}^{2}= ∥ bold_y - bold_X start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
L𝐰𝐿𝐰\displaystyle\dfrac{\partial L}{\partial{\mathbf{w}}}divide start_ARG ∂ italic_L end_ARG start_ARG ∂ bold_w end_ARG =2𝐗𝐗𝐰2𝐗𝐲+2λ𝐰=0𝐰=(λ𝐈+𝐗𝐗)1𝐗𝐲formulae-sequenceabsent2superscript𝐗𝐗top𝐰2𝐗𝐲2𝜆𝐰0𝐰superscript𝜆𝐈superscript𝐗𝐗top1𝐗𝐲\displaystyle=2{\mathbf{X}}{\mathbf{X}}^{\top}{\mathbf{w}}-2{\mathbf{X}}{% \mathbf{y}}+2\lambda{\mathbf{w}}=0\quad\Rightarrow\quad{\mathbf{w}}=(\lambda{% \mathbf{I}}+{\mathbf{X}}{\mathbf{X}}^{\top})^{-1}{\mathbf{X}}{\mathbf{y}}= 2 bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w - 2 bold_Xy + 2 italic_λ bold_w = 0 ⇒ bold_w = ( italic_λ bold_I + bold_XX start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Xy
𝐰𝐒thereforeabsentsubscript𝐰𝐒\displaystyle\therefore{\mathbf{w}}_{\mathbf{S}}∴ bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT =(λ𝐈+𝐗𝐒𝐗𝐒)1𝐗𝐒𝐲𝐒=𝐗𝐒(λ𝐈+𝐗𝐒𝐗𝐒)1𝐲𝐒absentsuperscript𝜆𝐈subscript𝐗𝐒superscriptsubscript𝐗𝐒top1subscript𝐗𝐒subscript𝐲𝐒subscript𝐗𝐒superscript𝜆𝐈superscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐲𝐒\displaystyle=(\lambda{\mathbf{I}}+{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{% \mathbf{S}}}^{\top})^{-1}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{y}}_{\mathbf{S}}=% {\mathbf{X}}_{{\mathbf{S}}}(\lambda{\mathbf{I}}+{\mathbf{X}}_{{\mathbf{S}}}^{% \top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y}}_{\mathbf{S}}= ( italic_λ bold_I + bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( italic_λ bold_I + bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT

A.2 Proof of Theorem 1

In this section, we provide the detailed proof of Theorem 1 in Sec. 3.2. We assume that n=poly(d)𝑛𝑝𝑜𝑙𝑦𝑑n=poly(d)italic_n = italic_p italic_o italic_l italic_y ( italic_d ) data inputs 𝐱1,𝐱2,𝐱nsubscript𝐱1subscript𝐱2subscript𝐱𝑛{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots\mathbf{x}_{n}bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT are sampled from normalized multivariate normal distribution, 𝒟=1d𝒩(0,𝐈d)=1d(𝒩1,𝒩2𝒩d)𝒟1𝑑𝒩0subscript𝐈𝑑1𝑑subscript𝒩1subscript𝒩2subscript𝒩𝑑\mathcal{D}=\frac{1}{\sqrt{d}}\mathcal{N}(0,{\mathbf{I}}_{d})=\frac{1}{\sqrt{d% }}(\mathcal{N}_{1},\mathcal{N}_{2}\dots\mathcal{N}_{d})caligraphic_D = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ( caligraphic_N start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … caligraphic_N start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) where {𝒩k}subscript𝒩𝑘\{\mathcal{N}_{k}\}{ caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT } are i.i.d.formulae-sequence𝑖𝑖𝑑i.i.d.italic_i . italic_i . italic_d . normal distributions. Remind that the label yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT of sample 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is determined by the sign of its first element, i.e., if (𝐱i)1>0subscriptsubscript𝐱𝑖10({\mathbf{x}}_{i})_{1}>0( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT > 0 then yi=1subscript𝑦𝑖1y_{i}=1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 1, and if (𝐱i)1<0subscriptsubscript𝐱𝑖10({\mathbf{x}}_{i})_{1}<0( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < 0, then yi=1subscript𝑦𝑖1y_{i}=-1italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - 1. We select a subset of size m𝑚mitalic_m, denoted by (𝐗𝐒,𝐲𝐒)d×m×{1,1}msubscript𝐗𝐒subscript𝐲𝐒superscript𝑑𝑚superscript11𝑚({\mathbf{X}}_{{\mathbf{S}}},{\mathbf{y}}_{\mathbf{S}})\in\mathbb{R}^{d\times m% }\times\{-1,1\}^{m}( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT , bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_m end_POSTSUPERSCRIPT × { - 1 , 1 } start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT.

We first provide a high-level proof idea of Theorem 1. Note that the optimal 𝐰𝐒=argmin𝐰d𝐲𝐒𝐗𝐒𝐰22subscript𝐰𝐒subscriptargmin𝐰superscript𝑑superscriptsubscriptnormsubscript𝐲𝐒superscriptsubscript𝐗𝐒top𝐰22{\mathbf{w}}_{{\mathbf{S}}}=\operatorname*{arg\,min}_{{\mathbf{w}}\in\mathbb{R% }^{d}}\|{\mathbf{y}}_{{\mathbf{S}}}-{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf% {w}}\|_{2}^{2}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_w ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT - bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT can be written as 𝐰𝐒=𝐗𝐒(𝐗𝐒𝐗𝐒)1𝐲𝐒subscript𝐰𝐒subscript𝐗𝐒superscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐲𝐒{\mathbf{w}}_{{\mathbf{S}}}={\mathbf{X}}_{{\mathbf{S}}}({\mathbf{X}}_{{\mathbf% {S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT when md𝑚𝑑m\leq ditalic_m ≤ italic_d, and 𝐰𝐒=(𝐗𝐒𝐗𝐒)1𝐗𝐒𝐲𝐒subscript𝐰𝐒superscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top1subscript𝐗𝐒subscript𝐲𝐒{\mathbf{w}}_{{\mathbf{S}}}=({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf% {S}}}^{\top})^{-1}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{y}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT when md𝑚𝑑m\geq ditalic_m ≥ italic_d. Let us first consider the case of md𝑚𝑑m\leq ditalic_m ≤ italic_d. Due to the properties of high dimensional multivariate normals, we have 𝐱i[1±7lnn/2d]normsubscript𝐱𝑖delimited-[]plus-or-minus17𝑛2𝑑\|{\mathbf{x}}_{i}\|\in[1\pm\sqrt{7\ln n/2d}]∥ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ ∈ [ 1 ± square-root start_ARG 7 roman_ln italic_n / 2 italic_d end_ARG ] for all i[n]𝑖delimited-[]𝑛i\in[n]italic_i ∈ [ italic_n ] and |𝐱i𝐱j|7lnn/2dsuperscriptsubscript𝐱𝑖topsubscript𝐱𝑗7𝑛2𝑑|{\mathbf{x}}_{i}^{\top}{\mathbf{x}}_{j}|\leq\sqrt{7\ln n/2d}| bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ square-root start_ARG 7 roman_ln italic_n / 2 italic_d end_ARG for all ij[n]𝑖𝑗delimited-[]𝑛i\neq j\in[n]italic_i ≠ italic_j ∈ [ italic_n ] with high probability. Thus, 𝐗𝐒𝐗𝐒𝐈mFm2(7lnn)2dsubscriptnormsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒subscript𝐈𝑚𝐹superscript𝑚27𝑛2𝑑\|{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}}-{\mathbf{I}}_{% m}\|_{F}\leq\sqrt{\frac{m^{2}(7\ln n)}{2d}}∥ bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT - bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ square-root start_ARG divide start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 7 roman_ln italic_n ) end_ARG start_ARG 2 italic_d end_ARG end_ARG where 𝐈msubscript𝐈𝑚{\mathbf{I}}_{m}bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the identity matrix of size m𝑚mitalic_m. When md/lndmuch-less-than𝑚𝑑𝑑m\ll\sqrt{d/\ln d}italic_m ≪ square-root start_ARG italic_d / roman_ln italic_d end_ARG, we have (𝐗𝐒𝐗𝐒)(𝐗𝐒𝐗𝐒)1𝐈msuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒superscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐈𝑚({\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})\approx({% \mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}\approx{% \mathbf{I}}_{m}( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) ≈ ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≈ bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, and thus 𝐰𝐒=𝐗𝐒(𝐗𝐒𝐗𝐒)1𝐲𝐒𝐗𝐒𝐲𝐒subscript𝐰𝐒subscript𝐗𝐒superscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐲𝐒subscript𝐗𝐒subscript𝐲𝐒{\mathbf{w}}_{{\mathbf{S}}}={\mathbf{X}}_{{\mathbf{S}}}({\mathbf{X}}_{{\mathbf% {S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y}}_{{\mathbf{S}}}% \approx{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{y}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ≈ bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, which implies that (𝐰𝐒)1i=1m|(𝐱i)1|subscriptsubscript𝐰𝐒1superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖1({\mathbf{w}}_{{\mathbf{S}}})_{1}\approx\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT |. Let us next consider the case of md𝑚𝑑m\geq ditalic_m ≥ italic_d. Note that the diagonal terms of 𝐗𝐒𝐗𝐒subscript𝐗𝐒superscriptsubscript𝐗𝐒top{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are i=1m|(𝐱i)k|2=Θ(m/d)superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖𝑘2Θ𝑚𝑑\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{k}|^{2}=\Theta(m/d)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = roman_Θ ( italic_m / italic_d ) for k[d]𝑘delimited-[]𝑑k\in[d]italic_k ∈ [ italic_d ] and the off-diagonal terms are i=1m(𝐱i)k(𝐱i)l=O(mlnd/d)superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖𝑘subscriptsubscript𝐱𝑖𝑙𝑂𝑚𝑑𝑑\sum_{i=1}^{m}({\mathbf{x}}_{i})_{k}({\mathbf{x}}_{i})_{l}=O(\sqrt{m\ln d}/d)∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = italic_O ( square-root start_ARG italic_m roman_ln italic_d end_ARG / italic_d ) for kl[d]𝑘𝑙delimited-[]𝑑k\neq l\in[d]italic_k ≠ italic_l ∈ [ italic_d ] with high probability. The eigenvalues of 𝐗𝐒𝐗𝐒subscript𝐗𝐒superscriptsubscript𝐗𝐒top{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT can be shifted from its diagonal entries (i=1m|(𝐱i)1|2,,i=1m|(𝐱i)d|2)superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖12superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖𝑑2(\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|^{2},\dots,\sum_{i=1}^{m}|({\mathbf{x}}% _{i})_{d}|^{2})( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by at most mlnddd=mlnd𝑚𝑑𝑑𝑑𝑚𝑑\frac{\sqrt{m\ln d}}{d}d=\sqrt{m\ln d}divide start_ARG square-root start_ARG italic_m roman_ln italic_d end_ARG end_ARG start_ARG italic_d end_ARG italic_d = square-root start_ARG italic_m roman_ln italic_d end_ARG by the effect of its off-diagonal entries. Thus, when m/dmlndmuch-greater-than𝑚𝑑𝑚𝑑m/d\gg\sqrt{m\ln d}italic_m / italic_d ≫ square-root start_ARG italic_m roman_ln italic_d end_ARG, i.e., md2lndmuch-greater-than𝑚superscript𝑑2𝑑m\gg d^{2}\ln ditalic_m ≫ italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d, we can have 𝐗𝐒𝐗𝐒diag(i=1m|(𝐱i)1|2,,i=1m|(𝐱i)d|2)subscript𝐗𝐒superscriptsubscript𝐗𝐒topdiagsuperscriptsubscripti1msuperscriptsubscriptsubscript𝐱i12superscriptsubscripti1msuperscriptsubscriptsubscript𝐱id2{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top}\approx\rm{diag}(% \sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|^{2},\dots,\sum_{i=1}^{m}|({\mathbf{x}}_% {i})_{d}|^{2})bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ≈ roman_diag ( ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) and (𝐗𝐒𝐗𝐒)1diag((i=1m|(𝐱i)1|2)1,,(i=1m|(𝐱i)d|2)1)superscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top1diagsuperscriptsuperscriptsubscripti1msuperscriptsubscriptsubscript𝐱i121superscriptsuperscriptsubscripti1msuperscriptsubscriptsubscript𝐱id21({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})^{-1}\approx\rm% {diag}((\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|^{2})^{-1},\dots,(\sum_{i=1}^{m}% |({\mathbf{x}}_{i})_{d}|^{2})^{-1})( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ≈ roman_diag ( ( ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , … , ( ∑ start_POSTSUBSCRIPT roman_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT roman_d end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ). Since 𝐰𝐒=(𝐗𝐒𝐗𝐒)1𝐗𝐒𝐲𝐒subscript𝐰𝐒superscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top1subscript𝐗𝐒subscript𝐲𝐒{\mathbf{w}}_{{\mathbf{S}}}=({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf% {S}}}^{\top})^{-1}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{y}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, the first coordinate value of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT is (𝐰𝐒)1(i=1m|(𝐱i)1|)/(i=1m|(𝐱i)1|2)subscriptsubscript𝐰𝐒1superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖1superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖12({\mathbf{w}}_{{\mathbf{S}}})_{1}\approx(\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}% |)/(\sum_{i=1}^{m}|({\mathbf{x}}_{i})_{1}|^{2})( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≈ ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | ) / ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT | ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

To more formally state and prove Theorem 1, we provide Theorem 2 to explain the regime of low selection ratio (m=o(d/lnd)𝑚𝑜𝑑𝑑m=o(\sqrt{d/\ln d})italic_m = italic_o ( square-root start_ARG italic_d / roman_ln italic_d end_ARG )) and Theorem 3 for the high selection ratio (m=ω(d2lnd)𝑚𝜔superscript𝑑2𝑑m=\omega(d^{2}\ln{d})italic_m = italic_ω ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d )). To prove the two theorems, we use the following three lemmas, including the tail bounds on chi-square and Gaussian distributions, and Gershgorin theorem, which are stated as below:

Lemma A.1 (Chi-square tail bound).

If 𝐱χ2(d)similar-to𝐱superscript𝜒2𝑑{\mathbf{x}}\sim\chi^{2}(d)bold_x ∼ italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ), then (χ2(d)d+2dt+2t)etsuperscript𝜒2𝑑𝑑2𝑑𝑡2𝑡superscript𝑒𝑡\mathbb{P}(\chi^{2}(d)\geq d+2\sqrt{dt}+2t)\leq e^{-t}blackboard_P ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ≥ italic_d + 2 square-root start_ARG italic_d italic_t end_ARG + 2 italic_t ) ≤ italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT and (χ2(d)d2dt)etsuperscript𝜒2𝑑𝑑2𝑑𝑡superscript𝑒𝑡\mathbb{P}(\chi^{2}(d)\leq d-2\sqrt{dt})\leq e^{-t}blackboard_P ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ≤ italic_d - 2 square-root start_ARG italic_d italic_t end_ARG ) ≤ italic_e start_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT.

Lemma A.2 (Gaussian tail bound).

If 𝐱𝒩(0,1)similar-to𝐱𝒩01{\mathbf{x}}\sim\mathcal{N}(0,1)bold_x ∼ caligraphic_N ( 0 , 1 ), then (|𝐱|t)et22𝐱𝑡superscript𝑒superscript𝑡22\mathbb{P}(|{\mathbf{x}}|\geq t)\leq e^{\frac{-t^{2}}{2}}blackboard_P ( | bold_x | ≥ italic_t ) ≤ italic_e start_POSTSUPERSCRIPT divide start_ARG - italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT.

Lemma A.3 (Gershgorin circle theorem).

Let Ad×d𝐴superscript𝑑𝑑A\in\mathbb{C}^{d\times d}italic_A ∈ blackboard_C start_POSTSUPERSCRIPT italic_d × italic_d end_POSTSUPERSCRIPT be a matrix with its (i,j)𝑖𝑗(i,j)( italic_i , italic_j )-th entry equal to aijsubscript𝑎𝑖𝑗a_{ij}italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. Let ri:=ji|aij|assignsubscript𝑟𝑖subscript𝑗𝑖subscript𝑎𝑖𝑗r_{i}:=\sum_{j\neq i}|a_{ij}|italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | and Di:=Dri(aii)assignsubscript𝐷𝑖subscript𝐷subscript𝑟𝑖subscript𝑎𝑖𝑖D_{i}:=D_{r_{i}}(a_{ii})italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT := italic_D start_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT ) be a closed ball centered aiisubscript𝑎𝑖𝑖a_{ii}italic_a start_POSTSUBSCRIPT italic_i italic_i end_POSTSUBSCRIPT with radius risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, every eigenvalue of A𝐴Aitalic_A is contained in iDisubscript𝑖subscript𝐷𝑖\cup_{i}D_{i}∪ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

Gershgorin circle theorem restricts the eigenvalues of a matrix in a union of disks, whose centers are diagonal elements, and the radius is the sum of off-diagonal elements.

Now, we provide Theorem 2, which will be used to explain why selecting low-scoring (easy) data samples results in a good performance when the subset size |𝐒|𝐒|{\mathbf{S}}|| bold_S | is small.

Theorem 2 (Sample-deficient regime).

If m=o(d/lnd)𝑚𝑜𝑑𝑑m=o\left(\sqrt{d/\ln d}\right)italic_m = italic_o ( square-root start_ARG italic_d / roman_ln italic_d end_ARG ), then (𝐗𝐒𝐗𝐒)1𝐈m2m7lnn2dsubscriptnormsuperscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐈𝑚2𝑚7𝑛2𝑑\|({\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}-{% \mathbf{I}}_{m}\|_{2}\leq m\sqrt{\frac{7\ln{n}}{2d}}∥ ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_m square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG with high probability as d𝑑d\to\inftyitalic_d → ∞.

Proof.

At first, we prove two properties of the high dimensional multivariate normal distribution, which state that the norm of every 𝐱isubscript𝐱𝑖{\mathbf{x}}_{i}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is almost equal to 1111, and every two independent vectors are almost orthogonal for large enough d𝑑ditalic_d. For any 1ijn1𝑖𝑗𝑛1\leq i\neq j\leq n1 ≤ italic_i ≠ italic_j ≤ italic_n, with probability 1O(1n)1𝑂1𝑛1-O(\frac{1}{n})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ), we have

17lnn2d𝐱i21+7lnn2d,andformulae-sequence17𝑛2𝑑subscriptnormsubscript𝐱𝑖217𝑛2𝑑and\displaystyle 1-\sqrt{\dfrac{7\ln{n}}{2d}}\leq\|{\mathbf{x}}_{i}\|_{2}\leq 1+% \sqrt{\dfrac{7\ln{n}}{2d}},\;\;\text{and}1 - square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG ≤ ∥ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 1 + square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG , and (3)
|𝐱i𝐱j|<7lnn2d.superscriptsubscript𝐱𝑖topsubscript𝐱𝑗7𝑛2𝑑\displaystyle|{\mathbf{x}}_{i}^{\top}{\mathbf{x}}_{j}|<\sqrt{\dfrac{7\ln{n}}{2% d}}.| bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | < square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG . (4)

The first property (Eq. 3) can be proved by Lemma A.1. Let t=3lnn𝑡3𝑛t=3\ln{n}italic_t = 3 roman_ln italic_n for Lemma A.1. Then,

𝒫(χ2(d)d+23dlnn+6lnn)1n3and𝒫(χ2(d)d23dlnn)1n3.formulae-sequence𝒫superscript𝜒2𝑑𝑑23𝑑𝑛6𝑛1superscript𝑛3and𝒫superscript𝜒2𝑑𝑑23𝑑𝑛1superscript𝑛3\mathcal{P}(\chi^{2}(d)\geq d+2\sqrt{3d\ln{n}}+6\ln{n})\leq\frac{1}{n^{3}}% \quad\text{and}\quad\mathcal{P}(\chi^{2}(d)\leq d-2\sqrt{3d\ln{n}})\leq\frac{1% }{n^{3}}.caligraphic_P ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ≥ italic_d + 2 square-root start_ARG 3 italic_d roman_ln italic_n end_ARG + 6 roman_ln italic_n ) ≤ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG and caligraphic_P ( italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ≤ italic_d - 2 square-root start_ARG 3 italic_d roman_ln italic_n end_ARG ) ≤ divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG .

Since 23dlnn+6lnn13dlnn23𝑑𝑛6𝑛13𝑑𝑛2\sqrt{3d\ln{n}}+6\ln{n}\leq\sqrt{13d\ln{n}}2 square-root start_ARG 3 italic_d roman_ln italic_n end_ARG + 6 roman_ln italic_n ≤ square-root start_ARG 13 italic_d roman_ln italic_n end_ARG for large enough d𝑑ditalic_d, with probability 1O(1n3)1𝑂1superscript𝑛31-O(\frac{1}{n^{3}})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ) we have

113lnnd1dχ2(d)1+13lnnd 17lnn2d1dχ2(d)1+7lnn2d.113𝑛𝑑1𝑑superscript𝜒2𝑑113𝑛𝑑17𝑛2𝑑1𝑑superscript𝜒2𝑑17𝑛2𝑑1-\sqrt{\frac{13\ln{n}}{d}}\leq\frac{1}{d}\chi^{2}(d)\leq 1+\sqrt{\frac{13\ln{% n}}{d}}\;\xRightarrow{\text{d $\rightarrow\infty$}}\;1-\sqrt{\frac{7\ln{n}}{2d% }}\leq\sqrt{\frac{1}{d}\chi^{2}(d)}\leq 1+\sqrt{\frac{7\ln{n}}{2d}}.1 - square-root start_ARG divide start_ARG 13 roman_ln italic_n end_ARG start_ARG italic_d end_ARG end_ARG ≤ divide start_ARG 1 end_ARG start_ARG italic_d end_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ≤ 1 + square-root start_ARG divide start_ARG 13 roman_ln italic_n end_ARG start_ARG italic_d end_ARG end_ARG start_ARROW overd →∞ ⇒ end_ARROW 1 - square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_d end_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) end_ARG ≤ 1 + square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG . (5)

Since 𝐱i1d𝒩(0,𝐈d)similar-tosubscript𝐱𝑖1𝑑𝒩0subscript𝐈𝑑{\mathbf{x}}_{i}\sim\frac{1}{\sqrt{d}}\mathcal{N}(0,{\mathbf{I}}_{d})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and 𝐱i22=1dχ2(d)superscriptsubscriptnormsubscript𝐱𝑖221𝑑superscript𝜒2𝑑\|{\mathbf{x}}_{i}\|_{2}^{2}=\frac{1}{d}\chi^{2}(d)∥ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d end_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ), for i[n]for-all𝑖delimited-[]𝑛\forall i\in[n]∀ italic_i ∈ [ italic_n ], with probability 1O(1n2)1𝑂1superscript𝑛21-O(\frac{1}{n^{2}})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ), Eq. 3 follows.

The proof of the second property (Eq. 4) also utilizes Lemma A.1. Let 𝐱i=1d(𝒩i1,𝒩i2,𝒩id)subscript𝐱𝑖1𝑑subscript𝒩𝑖1subscript𝒩𝑖2subscript𝒩𝑖𝑑{\mathbf{x}}_{i}=\dfrac{1}{\sqrt{d}}(\mathcal{N}_{i1},\mathcal{N}_{i2},\dots% \mathcal{N}_{id})bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ( caligraphic_N start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT , … caligraphic_N start_POSTSUBSCRIPT italic_i italic_d end_POSTSUBSCRIPT ) and 𝐱j=1d(𝒩j1,𝒩j2,𝒩jd)subscript𝐱𝑗1𝑑subscript𝒩𝑗1subscript𝒩𝑗2subscript𝒩𝑗𝑑{\mathbf{x}}_{j}=\dfrac{1}{\sqrt{d}}(\mathcal{N}_{j1},\mathcal{N}_{j2},\dots% \mathcal{N}_{jd})bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ( caligraphic_N start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT , … caligraphic_N start_POSTSUBSCRIPT italic_j italic_d end_POSTSUBSCRIPT ) where 𝒩ik,𝒩jksubscript𝒩𝑖𝑘subscript𝒩𝑗𝑘\mathcal{N}_{ik},\mathcal{N}_{jk}caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT , caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT are i.i.d.formulae-sequence𝑖𝑖𝑑i.i.d.italic_i . italic_i . italic_d . normals 𝒩(0,1)𝒩01\mathcal{N}(0,1)caligraphic_N ( 0 , 1 ). Then,

𝐱i𝐱jsuperscriptsubscript𝐱𝑖topsubscript𝐱𝑗\displaystyle{\mathbf{x}}_{i}^{\top}{\mathbf{x}}_{j}bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT =1dk=1d𝒩ik𝒩jk=1dk=1d(𝒩ik+𝒩jk)2(𝒩ik𝒩jk)24absent1𝑑superscriptsubscript𝑘1𝑑subscript𝒩𝑖𝑘subscript𝒩𝑗𝑘1𝑑superscriptsubscript𝑘1𝑑superscriptsubscript𝒩𝑖𝑘subscript𝒩𝑗𝑘2superscriptsubscript𝒩𝑖𝑘subscript𝒩𝑗𝑘24\displaystyle=\frac{1}{d}\sum_{k=1}^{d}{\mathcal{N}_{ik}}{\mathcal{N}_{jk}}=% \frac{1}{d}\sum_{k=1}^{d}{\frac{({\mathcal{N}_{ik}+{\mathcal{N}_{jk}}})^{2}-({% \mathcal{N}_{ik}-{\mathcal{N}_{jk}}})^{2}}{4}}= divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT divide start_ARG ( caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 4 end_ARG
=12dk=1d[(𝒩ik+𝒩jk2)2(𝒩ik𝒩jk2)2]=12dk=1d[(𝒩k)2(𝒩k′′)2]absent12𝑑superscriptsubscript𝑘1𝑑delimited-[]superscriptsubscript𝒩𝑖𝑘subscript𝒩𝑗𝑘22superscriptsubscript𝒩𝑖𝑘subscript𝒩𝑗𝑘2212𝑑superscriptsubscript𝑘1𝑑delimited-[]superscriptsuperscriptsubscript𝒩𝑘2superscriptsuperscriptsubscript𝒩𝑘′′2\displaystyle=\frac{1}{2d}\sum_{k=1}^{d}\left[\left(\frac{\mathcal{N}_{ik}+% \mathcal{N}_{jk}}{\sqrt{2}}\right)^{2}-\left(\frac{\mathcal{N}_{ik}-\mathcal{N% }_{jk}}{\sqrt{2}}\right)^{2}\right]=\frac{1}{2d}\sum_{k=1}^{d}[(\mathcal{N}_{k% }^{{}^{\prime}})^{2}-(\mathcal{N}_{k}^{{}^{\prime\prime}})^{2}]= divide start_ARG 1 end_ARG start_ARG 2 italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ ( divide start_ARG caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT + caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( divide start_ARG caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT - caligraphic_N start_POSTSUBSCRIPT italic_j italic_k end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG 2 end_ARG end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] = divide start_ARG 1 end_ARG start_ARG 2 italic_d end_ARG ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT [ ( caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]
=12d(χ12(d)χ22(d)),absent12𝑑superscriptsubscript𝜒12𝑑superscriptsubscript𝜒22𝑑\displaystyle=\frac{1}{2d}(\chi_{1}^{2}(d)-\chi_{2}^{2}(d)),= divide start_ARG 1 end_ARG start_ARG 2 italic_d end_ARG ( italic_χ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) - italic_χ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ) ,

where 𝒩ksuperscriptsubscript𝒩𝑘\mathcal{N}_{k}^{{}^{\prime}}caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and 𝒩k′′superscriptsubscript𝒩𝑘′′\mathcal{N}_{k}^{{}^{\prime\prime}}caligraphic_N start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT are i.i.d.formulae-sequence𝑖𝑖𝑑i.i.d.italic_i . italic_i . italic_d . normals, and χ12(d)superscriptsubscript𝜒12𝑑\chi_{1}^{2}(d)italic_χ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) and χ22(d)superscriptsubscript𝜒22𝑑\chi_{2}^{2}(d)italic_χ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) are i.i.dformulae-sequence𝑖𝑖𝑑i.i.ditalic_i . italic_i . italic_d chi-squares.

As shown in Eq. 5, with probability 1O(1n3)1𝑂1superscript𝑛31-O(\frac{1}{n^{3}})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT end_ARG ),

17lnn2d1dχ12(d)and1dχ22(d)1+7lnn2d.formulae-sequence17𝑛2𝑑1𝑑superscriptsubscript𝜒12𝑑and1𝑑superscriptsubscript𝜒22𝑑17𝑛2𝑑1-\sqrt{\frac{7\ln{n}}{2d}}\leq\sqrt{\frac{1}{d}\chi_{1}^{2}(d)}\quad\text{and% }\quad\sqrt{\frac{1}{d}\chi_{2}^{2}(d)}\leq 1+\sqrt{\frac{7\ln{n}}{2d}}.1 - square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG ≤ square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_d end_ARG italic_χ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) end_ARG and square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_d end_ARG italic_χ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) end_ARG ≤ 1 + square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG . (6)

Thus, we have

|12d(χ12(d)χ22(d))|7lnn2d.12𝑑superscriptsubscript𝜒12𝑑superscriptsubscript𝜒22𝑑7𝑛2𝑑\left|\frac{1}{2d}({\chi_{1}^{2}(d)-\chi_{2}^{2}(d)})\right|\leq\sqrt{\frac{7% \ln{n}}{2d}}.| divide start_ARG 1 end_ARG start_ARG 2 italic_d end_ARG ( italic_χ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) - italic_χ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_d ) ) | ≤ square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG . (7)

By applying a union bound, for ij[n]for-all𝑖𝑗delimited-[]𝑛\forall i\neq j\in[n]∀ italic_i ≠ italic_j ∈ [ italic_n ], with probability 1O(1n)1𝑂1𝑛1-O(\frac{1}{n})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ), we have |𝐱i𝐱j|7lnn2d.superscriptsubscript𝐱𝑖topsubscript𝐱𝑗7𝑛2𝑑|{\mathbf{x}}_{i}^{\top}{\mathbf{x}}_{j}|\leq\sqrt{\frac{7\ln{n}}{2d}}.| bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | ≤ square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG . From Eq. 3 and Eq. 4, we obtain that 𝐗𝐒𝐗𝐒𝐈mF2m2(7lnn2d)superscriptsubscriptnormsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒subscript𝐈𝑚𝐹2superscript𝑚27𝑛2𝑑\|{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}}-{\mathbf{I}}_{% m}\|_{F}^{2}\leq m^{2}\left(\frac{7\ln{n}}{2d}\right)∥ bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT - bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG ).

Let 𝐀=𝐗𝐒𝐗𝐒𝐀superscriptsubscript𝐗𝐒topsubscript𝐗𝐒{\mathbf{A}}={\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}}bold_A = bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, then we derive the bound on 𝐈𝐀12subscriptnorm𝐈superscript𝐀12\|{\mathbf{I}}-{\mathbf{A}}^{-1}\|_{2}∥ bold_I - bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT from the bounds of 𝐈𝐀2subscriptnorm𝐈𝐀2\|{\mathbf{I}}-{\mathbf{A}}\|_{2}∥ bold_I - bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and 𝐀12subscriptnormsuperscript𝐀12\|{\mathbf{A}}^{-1}\|_{2}∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. First, note that

𝐈𝐀2𝐈𝐀Fm7lnn2dandm7lnn2d0 as dformulae-sequencesubscriptnorm𝐈𝐀2subscriptnorm𝐈𝐀𝐹𝑚7𝑛2𝑑and𝑚7𝑛2𝑑0 as 𝑑\|{\mathbf{I}}-{\mathbf{A}}\|_{2}\leq\|{\mathbf{I}}-{\mathbf{A}}\|_{F}\leq m% \sqrt{\frac{7\ln{n}}{2d}}\quad\text{and}\quad m\sqrt{\frac{7\ln{n}}{2d}}% \rightarrow 0\text{ as }d\rightarrow\infty∥ bold_I - bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_I - bold_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_m square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG and italic_m square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG → 0 as italic_d → ∞

since m=o(d/lnd)𝑚𝑜𝑑𝑑m=o(\sqrt{d/\ln d})italic_m = italic_o ( square-root start_ARG italic_d / roman_ln italic_d end_ARG ) and n=poly(d)𝑛𝑝𝑜𝑙𝑦𝑑n=poly(d)italic_n = italic_p italic_o italic_l italic_y ( italic_d ). Moreover, we have

𝐀12=(𝐈(𝐈𝐀))12=𝐈+(𝐈𝐀)+(𝐈𝐀)2+2𝐈2+(𝐈𝐀)2+(𝐈𝐀)22+1+k=1(m7lnn2d)k2.subscriptdelimited-∥∥superscript𝐀12subscriptdelimited-∥∥superscript𝐈𝐈𝐀12subscriptdelimited-∥∥𝐈𝐈𝐀superscript𝐈𝐀22subscriptdelimited-∥∥𝐈2subscriptdelimited-∥∥𝐈𝐀2subscriptdelimited-∥∥superscript𝐈𝐀221superscriptsubscript𝑘1superscript𝑚7𝑛2𝑑𝑘2\begin{split}&\|{\mathbf{A}}^{-1}\|_{2}=\|({\mathbf{I}}-({\mathbf{I}}-{\mathbf% {A}}))^{-1}\|_{2}=\|{\mathbf{I}}+({\mathbf{I}}-{\mathbf{A}})+({\mathbf{I}}-{% \mathbf{A}})^{2}+\dots\|_{2}\\ &\leq\|{\mathbf{I}}\|_{2}+\|({\mathbf{I}}-{\mathbf{A}})\|_{2}+\|({\mathbf{I}}-% {\mathbf{A}})^{2}\|_{2}+\dots\leq 1+\sum_{k=1}^{\infty}\left(m\sqrt{\frac{7\ln% {n}}{2d}}\right)^{k}\leq 2.\end{split}start_ROW start_CELL end_CELL start_CELL ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ ( bold_I - ( bold_I - bold_A ) ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ bold_I + ( bold_I - bold_A ) + ( bold_I - bold_A ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + … ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∥ bold_I ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ ( bold_I - bold_A ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ ( bold_I - bold_A ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ⋯ ≤ 1 + ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( italic_m square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG ) start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ≤ 2 . end_CELL end_ROW

Finally, we have

𝐈𝐀12𝐀12𝐈𝐀2m7lnn2d.subscriptnorm𝐈superscript𝐀12subscriptnormsuperscript𝐀12subscriptnorm𝐈𝐀2𝑚7𝑛2𝑑\|{\mathbf{I}}-{\mathbf{A}}^{-1}\|_{2}\leq\|{\mathbf{A}}^{-1}\|_{2}\|{\mathbf{% I}}-{\mathbf{A}}\|_{2}\leq m\sqrt{\frac{7\ln{n}}{2d}}.∥ bold_I - bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_I - bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_m square-root start_ARG divide start_ARG 7 roman_ln italic_n end_ARG start_ARG 2 italic_d end_ARG end_ARG .

We next provide Theorem 3, which explains why selecting high-scoring (difficult) data samples results in a good performance when the subset size |𝐒|𝐒|{\mathbf{S}}|| bold_S | is large (m=ω(d2lnd)𝑚𝜔superscript𝑑2𝑑m=\omega(d^{2}\ln{d})italic_m = italic_ω ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d )). Assume that we select the subset 𝐗𝐒subscript𝐗𝐒{\mathbf{X}}_{{\mathbf{S}}}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT by observing the first element of each data, (𝐱i)1subscriptsubscript𝐱𝑖1({\mathbf{x}}_{i})_{1}( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Suppose that we select the data samples whose first elements are a1d,a2d,amdsubscript𝑎1𝑑subscript𝑎2𝑑subscript𝑎𝑚𝑑\frac{a_{1}}{\sqrt{d}},\frac{a_{2}}{\sqrt{d}},\dots\frac{a_{m}}{\sqrt{d}}divide start_ARG italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG , divide start_ARG italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG , … divide start_ARG italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG where aiΘ(1)subscript𝑎𝑖Θ1a_{i}\in\Theta(1)italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ roman_Θ ( 1 ), and let a:=i=1mai2massign𝑎superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖2𝑚a:=\frac{\sum_{i=1}^{m}a_{i}^{2}}{m}italic_a := divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG. The elements of the other coordinates are independent normals, i.e., (𝐱i)k𝒩(0,1)similar-tosubscriptsubscript𝐱𝑖𝑘𝒩01({\mathbf{x}}_{i})_{k}\sim\mathcal{N}(0,1)( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , 1 ) for k2𝑘2k\geq 2italic_k ≥ 2. We will prove that (dm𝐗𝐒𝐗𝐒)1superscript𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top1(\frac{d}{m}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})^{-1}( divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT can be approximated by a diagonal matrix of which the first element is equal to 1a1𝑎\frac{1}{a}divide start_ARG 1 end_ARG start_ARG italic_a end_ARG and other elements are equal to 1.

Theorem 3 (Sample-sufficient regime).

Let 𝐁=diag(a,1,1,1)d×d𝐁diaga111superscriptdd{\mathbf{B}}=\rm{diag}(a,1,1,\dots 1)\in\mathbb{R}^{d\times d}bold_B = roman_diag ( roman_a , 1 , 1 , … 1 ) ∈ blackboard_R start_POSTSUPERSCRIPT roman_d × roman_d end_POSTSUPERSCRIPT. If m=ω(d2lnd)𝑚𝜔superscript𝑑2𝑑m=\omega(d^{2}\ln{d})italic_m = italic_ω ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d ), then (dm𝐗𝐒𝐗𝐒)1𝐁12cd2lndmsubscriptnormsuperscript𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top1superscript𝐁12superscript𝑐superscript𝑑2𝑑𝑚\|(\frac{d}{m}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})^{% -1}-{\mathbf{B}}^{-1}\|_{2}\leq c^{\prime}d^{2}\dfrac{\ln{d}}{m}∥ ( divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ln italic_d end_ARG start_ARG italic_m end_ARG for some constant c>0superscript𝑐0c^{\prime}>0italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 with high probability as d𝑑d\to\inftyitalic_d → ∞.

Proof.

The elements of dm𝐗𝐒𝐗𝐒𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top\frac{d}{m}{\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top}divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT are expressed as follows, where kl[d]\{1}𝑘𝑙delimited-[]𝑑\1k\neq l\in[d]\text{\textbackslash}\{1\}italic_k ≠ italic_l ∈ [ italic_d ] \ { 1 }:

dm(𝐗𝐒𝐗𝐒)11=1mi=1mai2=a𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top111𝑚superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖2𝑎\displaystyle\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})_{11}=\frac{1}{m}\sum_{i=1}^{m}a_{i}^{2}=adivide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_a
dm(𝐗𝐒𝐗𝐒)k1=1mi=1mai𝒩i=𝒩(0,i=1mai2m2)=𝒩(0,am)𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘11𝑚superscriptsubscript𝑖1𝑚subscript𝑎𝑖subscript𝒩𝑖𝒩0superscriptsubscript𝑖1𝑚superscriptsubscript𝑎𝑖2superscript𝑚2𝒩0𝑎𝑚\displaystyle\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})_{k1}=\frac{1}{m}\sum_{i=1}^{m}a_{i}\mathcal{N}_{i}=\mathcal{N}\left(% 0,\frac{\sum_{i=1}^{m}a_{i}^{2}}{m^{2}}\right)=\mathcal{N}\left(0,\frac{a}{m}\right)divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = caligraphic_N ( 0 , divide start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) = caligraphic_N ( 0 , divide start_ARG italic_a end_ARG start_ARG italic_m end_ARG )
dm(𝐗𝐒𝐗𝐒)kk=dmi=1m(𝐱i)k2=1m𝒩ik2=1mχ2(m)𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘𝑘𝑑𝑚superscriptsubscript𝑖1𝑚superscriptsubscriptsubscript𝐱𝑖𝑘21𝑚superscriptsubscript𝒩𝑖𝑘21𝑚superscript𝜒2𝑚\displaystyle\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})_{kk}=\frac{d}{m}\sum_{i=1}^{m}({\mathbf{x}}_{i})_{k}^{2}=\frac{1}{m}% \mathcal{N}_{ik}^{2}=\frac{1}{m}\chi^{2}(m)divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT = divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_m )
dm(𝐗𝐒𝐗𝐒)kl=dmi=1m(𝐱i)k(𝐱i)l=1mi=1m𝒩ik𝒩il.𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘𝑙𝑑𝑚superscriptsubscript𝑖1𝑚subscriptsubscript𝐱𝑖𝑘subscriptsubscript𝐱𝑖𝑙1𝑚superscriptsubscript𝑖1𝑚subscript𝒩𝑖𝑘subscript𝒩𝑖𝑙\displaystyle\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})_{kl}=\frac{d}{m}\sum_{i=1}^{m}({\mathbf{x}}_{i})_{k}({\mathbf{x}}_{i% })_{l}=\frac{1}{m}\sum_{i=1}^{m}\mathcal{N}_{ik}\mathcal{N}_{il}.divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT = divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT .

By Gaussian tail bound (Lemma A.2), if 𝐱𝒩(0,a/m)similar-to𝐱𝒩0𝑎𝑚{\mathbf{x}}\sim\mathcal{N}(0,{a}/{m})bold_x ∼ caligraphic_N ( 0 , italic_a / italic_m ), then we have

(|𝐱|2alnd/m)1/d2.𝐱2𝑎𝑑𝑚1superscript𝑑2\mathbb{P}(|{\mathbf{x}}|\geq 2\sqrt{{a\ln{d}}/{m}})\leq{1}/{d^{2}}.blackboard_P ( | bold_x | ≥ 2 square-root start_ARG italic_a roman_ln italic_d / italic_m end_ARG ) ≤ 1 / italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Note that 1mχ2(m)=1m𝐱221𝑚superscript𝜒2𝑚1𝑚superscriptsubscriptnorm𝐱22\frac{1}{m}\chi^{2}(m)=\frac{1}{m}\|{\mathbf{x}}\|_{2}^{2}divide start_ARG 1 end_ARG start_ARG italic_m end_ARG italic_χ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_m ) = divide start_ARG 1 end_ARG start_ARG italic_m end_ARG ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT for 𝐱𝒩(0,𝐈m)similar-to𝐱𝒩0subscript𝐈𝑚{\mathbf{x}}\sim\mathcal{N}(0,{\mathbf{I}}_{m})bold_x ∼ caligraphic_N ( 0 , bold_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). By Lemma A.1, we have a result similar to Eq. 4,

(|𝐱21|7lnd/2m)1/d2.subscriptnorm𝐱217𝑑2𝑚1superscript𝑑2\mathbb{P}(|\|{\mathbf{x}}\|_{2}-1|\geq\sqrt{{7\ln{d}}/{2m}})\leq{1}/{d^{2}}.blackboard_P ( | ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT - 1 | ≥ square-root start_ARG 7 roman_ln italic_d / 2 italic_m end_ARG ) ≤ 1 / italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

For 𝒩ik𝒩ilsubscript𝒩𝑖𝑘subscript𝒩𝑖𝑙\mathcal{N}_{ik}\mathcal{N}_{il}caligraphic_N start_POSTSUBSCRIPT italic_i italic_k end_POSTSUBSCRIPT caligraphic_N start_POSTSUBSCRIPT italic_i italic_l end_POSTSUBSCRIPT, by applying the result of Eq. 4, we can also show that

(𝐱27lnd/2m)1/d3.subscriptnorm𝐱27𝑑2𝑚1superscript𝑑3\mathbb{P}(\|{\mathbf{x}}\|_{2}\geq\sqrt{{7\ln{d}}/{2m}})\leq{1}/{d^{3}}.blackboard_P ( ∥ bold_x ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≥ square-root start_ARG 7 roman_ln italic_d / 2 italic_m end_ARG ) ≤ 1 / italic_d start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT .

Combining the above three bounds, we obtain that for kl[d]for-all𝑘𝑙delimited-[]𝑑\forall k\neq l\in[d]∀ italic_k ≠ italic_l ∈ [ italic_d ], with probability 1O(1d)1𝑂1𝑑1-O(\frac{1}{d})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ),

|dm(𝐗𝐒𝐗𝐒)k1|2alndm,|dm(𝐗𝐒𝐗𝐒)kk1|7lnd2m,and|dm(𝐗𝐒𝐗𝐒)kl|7lnd2m.formulae-sequence𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘12𝑎𝑑𝑚formulae-sequence𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘𝑘17𝑑2𝑚and𝑑𝑚subscriptsubscript𝐗𝐒superscriptsubscript𝐗𝐒top𝑘𝑙7𝑑2𝑚\left|\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top% })_{k1}\right|\leq 2\sqrt{\frac{a\ln{d}}{m}},\quad\left|\frac{d}{m}({\mathbf{X% }}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})_{kk}-1\right|\leq\sqrt{% \frac{7\ln{d}}{2m}},\;\text{and}\quad\left|\frac{d}{m}({\mathbf{X}}_{{\mathbf{% S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})_{kl}\right|\leq\sqrt{\frac{7\ln{d}}{2m% }}.| divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k 1 end_POSTSUBSCRIPT | ≤ 2 square-root start_ARG divide start_ARG italic_a roman_ln italic_d end_ARG start_ARG italic_m end_ARG end_ARG , | divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT - 1 | ≤ square-root start_ARG divide start_ARG 7 roman_ln italic_d end_ARG start_ARG 2 italic_m end_ARG end_ARG , and | divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_k italic_l end_POSTSUBSCRIPT | ≤ square-root start_ARG divide start_ARG 7 roman_ln italic_d end_ARG start_ARG 2 italic_m end_ARG end_ARG .

Thus, we obtain dm(𝐗𝐒𝐗𝐒)𝐁Fcd2lndmsubscriptnorm𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top𝐁𝐹𝑐superscript𝑑2𝑑𝑚\|\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}}^{\top})-{% \mathbf{B}}\|_{F}\leq cd^{2}\frac{\ln{d}}{m}∥ divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) - bold_B ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_c italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ln italic_d end_ARG start_ARG italic_m end_ARG for some constant c>0𝑐0c>0italic_c > 0, with probability 1O(1d)1𝑂1𝑑1-O(\frac{1}{d})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ). Let 𝐀=dm(𝐗𝐒𝐗𝐒)𝐀𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top{\mathbf{A}}=\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})bold_A = divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ), then

𝐀1𝐁12𝐀12𝐈𝐀𝐁12𝐀12𝐁𝐀2𝐁12𝐀12𝐁𝐀F𝐁12𝐀12cd2lndm1.subscriptdelimited-∥∥superscript𝐀1superscript𝐁12subscriptdelimited-∥∥superscript𝐀12subscriptdelimited-∥∥𝐈superscript𝐀𝐁12subscriptdelimited-∥∥superscript𝐀12subscriptdelimited-∥∥𝐁𝐀2subscriptdelimited-∥∥superscript𝐁12subscriptdelimited-∥∥superscript𝐀12subscriptdelimited-∥∥𝐁𝐀𝐹subscriptdelimited-∥∥superscript𝐁12subscriptdelimited-∥∥superscript𝐀12𝑐superscript𝑑2𝑑𝑚1\begin{split}\|{\mathbf{A}}^{-1}-{\mathbf{B}}^{-1}\|_{2}&\leq\|{\mathbf{A}}^{-% 1}\|_{2}\|{\mathbf{I}}-{\mathbf{A}}{\mathbf{B}}^{-1}\|_{2}\leq\|{\mathbf{A}}^{% -1}\|_{2}\|{\mathbf{B}}-{\mathbf{A}}\|_{2}\|{\mathbf{B}}^{-1}\|_{2}\\ &\leq\|{\mathbf{A}}^{-1}\|_{2}\|{\mathbf{B}}-{\mathbf{A}}\|_{F}\|{\mathbf{B}}^% {-1}\|_{2}\leq\|{\mathbf{A}}^{-1}\|_{2}\;cd^{2}\dfrac{\ln{d}}{m}\cdot 1.\end{split}start_ROW start_CELL ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL start_CELL ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_I - bold_AB start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_B - bold_A ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_B - bold_A ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ∥ bold_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ln italic_d end_ARG start_ARG italic_m end_ARG ⋅ 1 . end_CELL end_ROW

It is remaining to prove that 𝐀12subscriptnormsuperscript𝐀12\|{\mathbf{A}}^{-1}\|_{2}∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is bounded. The eigenvalues of 𝐀=dm(𝐗𝐒𝐗𝐒)𝐀𝑑𝑚subscript𝐗𝐒superscriptsubscript𝐗𝐒top{\mathbf{A}}=\frac{d}{m}({\mathbf{X}}_{{\mathbf{S}}}{\mathbf{X}}_{{\mathbf{S}}% }^{\top})bold_A = divide start_ARG italic_d end_ARG start_ARG italic_m end_ARG ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) are almost equal to the diagonal elements by utilizing Gershgorin circle theorem. Since mω(d2lnd)𝑚𝜔superscript𝑑2𝑑m\in\omega(d^{2}\ln{d})italic_m ∈ italic_ω ( italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ln italic_d ),

r1=ji|a1j|<2dalndmasubscript𝑟1subscript𝑗𝑖subscript𝑎1𝑗2𝑑𝑎𝑑𝑚much-less-than𝑎r_{1}=\sum_{j\neq i}|a_{1j}|<2d\sqrt{\frac{a\ln{d}}{m}}\ll aitalic_r start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT 1 italic_j end_POSTSUBSCRIPT | < 2 italic_d square-root start_ARG divide start_ARG italic_a roman_ln italic_d end_ARG start_ARG italic_m end_ARG end_ARG ≪ italic_a  and  rk=ji|akj|<d7lnd2m1subscript𝑟𝑘subscript𝑗𝑖subscript𝑎𝑘𝑗𝑑7𝑑2𝑚much-less-than1r_{k}=\sum_{j\neq i}|a_{kj}|<d\sqrt{\frac{7\ln{d}}{2m}}\ll 1italic_r start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j ≠ italic_i end_POSTSUBSCRIPT | italic_a start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT | < italic_d square-root start_ARG divide start_ARG 7 roman_ln italic_d end_ARG start_ARG 2 italic_m end_ARG end_ARG ≪ 1.

Therefore, every eigenvalues of 𝐀𝐀{\mathbf{A}}bold_A are close to either a𝑎aitalic_a or 1111 with probability 1O(1d)1𝑂1𝑑1-O(\frac{1}{d})1 - italic_O ( divide start_ARG 1 end_ARG start_ARG italic_d end_ARG ). Thus, 𝐀12subscriptnormsuperscript𝐀12\|{\mathbf{A}}^{-1}\|_{2}∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is bounded above by max(1a,1)1𝑎1\max(\frac{1}{a},1)roman_max ( divide start_ARG 1 end_ARG start_ARG italic_a end_ARG , 1 ) plus some some constant, which shows that 𝐀12subscriptnormsuperscript𝐀12\|{\mathbf{A}}^{-1}\|_{2}∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is bounded. Therefore, we have

𝐀1𝐁12𝐀12cd2lndmcd2lndm for some constant c>0.formulae-sequencesubscriptnormsuperscript𝐀1superscript𝐁12subscriptnormsuperscript𝐀12𝑐superscript𝑑2𝑑𝑚superscript𝑐superscript𝑑2𝑑𝑚 for some constant superscript𝑐0\|{\mathbf{A}}^{-1}-{\mathbf{B}}^{-1}\|_{2}\leq\|{\mathbf{A}}^{-1}\|_{2}\;cd^{% 2}\dfrac{\ln{d}}{m}\leq c^{\prime}d^{2}\dfrac{\ln{d}}{m}\quad\text{ for some % constant }c^{\prime}>0.∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT - bold_B start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ∥ bold_A start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_c italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ln italic_d end_ARG start_ARG italic_m end_ARG ≤ italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG roman_ln italic_d end_ARG start_ARG italic_m end_ARG for some constant italic_c start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 .

A.3 Toy experiment

Refer to caption
(a) Plots for entire regime
Refer to caption
(b) Plots focused on sample-sufficient regime
Figure 7: Results of window sliding experiment at the setting of theoretical analysis. In our setting, the dimension d𝑑ditalic_d is 256256256256, the number of full dataset n𝑛nitalic_n is 256,000256000256,000256 , 000, and the subset size m𝑚mitalic_m are selected among 16,64,256,512,2048,3072,and 4096166425651220483072and 409616,64,256,512,2048,3072,\text{and }409616 , 64 , 256 , 512 , 2048 , 3072 , and 4096. Left figure covers the entire regime (m=16,64,256,512,and 2048𝑚1664256512and 2048m=16,64,256,512,\text{and }2048italic_m = 16 , 64 , 256 , 512 , and 2048), while the right figure focuses on sample-sufficient regime.

To validate our theoretical analysis, we conduct a window sliding experiment similar to the one in Sec. 5, at the setting of the theoretical analysis in Sec. A.2, while varying the subset sizes and the starting points of the window subsets at d=256𝑑256d=256italic_d = 256 and n=256,000𝑛256000n=256,000italic_n = 256 , 000. The results are shown in Fig. 7. Fig. 7(a) shows the plot for both the sample-deficient and sufficient regimes, including m=16,64,256,512,2048𝑚16642565122048m=16,64,256,512,2048italic_m = 16 , 64 , 256 , 512 , 2048, while Fig. 7(b) shows focused plots for sample sufficient regime where m=2048,3072,4096𝑚204830724096m=2048,3072,4096italic_m = 2048 , 3072 , 4096. The x-axis in Fig. 7(a) is the starting point of the window subset, which identifies the ranking of the hardest sample in the window subset, while that in Fig. 7(b) is the end point of the window subset, which identifies the ranking of the easiest sample in the window subset. The y-axis is the cosine similarity between 𝐰𝐰{\mathbf{w}}bold_w and e^1=(1,0,,0)subscript^𝑒1100\hat{e}_{1}=(1,0,\dots,0)over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = ( 1 , 0 , … , 0 ), where 𝐰𝐰{\mathbf{w}}bold_w is the solution of the regression problem, and e^1subscript^𝑒1\hat{e}_{1}over^ start_ARG italic_e end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is the unit vector with its first coordinate equal to 1, which is the true decision boundary. A higher cosine similarity implies a better solution. Red lines show the results when subset size m𝑚mitalic_m is smaller, and the blue or black lines show the result of larger subset sizes.

In the sample-deficient regime where md𝑚𝑑m\leq ditalic_m ≤ italic_d (red lines), the cosine similarity increases as the starting point of the window increases, meaning that it is better to use easy samples to learn the linear classifier. On the other hand, in the sample-sufficient regime where m>d𝑚𝑑m>ditalic_m > italic_d (blue and black lines), the cosine similarity is larger for windows having a smaller end point, meaning that it is better to include difficulty samples to learn a better classifier. This result coincides with the theoretical analysis, which claims that the inclusion of easier (harder) data samples results in a better solution for a smaller (larger) subset size, respectively. As we conjectured at Sec. 3.2, the transition of a desirable selection strategy occurs near m=Θ(d)𝑚Θ𝑑m=\Theta(d)italic_m = roman_Θ ( italic_d ).

Appendix B Discussions on Using Kernel Ridge Regression as a Proxy

In Algorithm 1, we use the kernel ridge regression as a proxy for training neural networks to evaluate the performance of window subsets. In this section, we provide some theoretical rationale behind the use of the kernel ridge regression.

Our use of the kernel ridge regression as a proxy for training neural networks can be partly explained by the recent progress in theoretical understanding of training neural networks using kernel methods. In particular, some recent works (Neal, 1996; Lee et al., 2018; Jacot et al., 2018; Arora et al., 2019) have shown that training and generalization of neural networks can be approximated by two associated kernel matrices: the Conjugate Kernel (CK) and Neural Tangent Kernel (NTK). The Conjugate Kernel is defined by the gram matrix of the derived features produced by the final hidden layer of the network, while NTK is the gram matrix of the Jacobian of in-sample predictions with respect to the network weights. These two kernels also have fundamental relations in terms of their eigenvalue distributions as analyzed in (Fan & Wang, 2020). Our proxy task is motivated by the observation that the kernel regression with these model-related kernels can provide a good approximation to the original model (under some assumptions such as enough width, random initialization, and small enough learning rate, etc.). As an example, the work by Arora et al. (2019) provides the following theorem, which connects the training of a neural network with kernel ridge regression using NTK.

Theorem 4 (Informal version of Arora et al. (2019)).

Consider a fully connected neural network with sufficiently large width d1=d2=dLsubscript𝑑1subscript𝑑2subscript𝑑𝐿d_{1}=d_{2}=\dots d_{L}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = … italic_d start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT where dlsubscript𝑑𝑙d_{l}italic_d start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is the number of nodes in l𝑙litalic_lth layer. Given a training dataset {(𝐱i,yi)}i=1nd×superscriptsubscriptsubscript𝐱𝑖subscript𝑦𝑖𝑖1𝑛superscript𝑑\{({\mathbf{x}}_{i},y_{i})\}_{i=1}^{n}\subset\mathbb{R}^{d}\times\mathbb{R}{ ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT × blackboard_R with normalized inputs 𝐱i2=1subscriptnormsubscript𝐱𝑖21\|{\mathbf{x}}_{i}\|_{2}=1∥ bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1, the network is trained by gradient descent with a sufficiently small learning rate to minimize the square-loss i=1n(fnn(𝐱i)yi)2superscriptsubscript𝑖1𝑛superscriptsubscript𝑓𝑛𝑛subscript𝐱𝑖subscript𝑦𝑖2\sum_{i=1}^{n}(f_{nn}({\mathbf{x}}_{i})-y_{i})^{2}∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, where fnnsubscript𝑓𝑛𝑛f_{nn}italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT is the trained network. With a kernel function of the network K(,)𝐾K(\cdot,\cdot)italic_K ( ⋅ , ⋅ ), NTK of training data 𝐇n×n𝐇superscript𝑛𝑛\mathbf{H}\in\mathbb{R}^{n\times n}bold_H ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is defined by 𝐇ij=K(𝐱i,𝐱j)subscript𝐇𝑖𝑗𝐾subscript𝐱𝑖subscript𝐱𝑗\mathbf{H}_{ij}=K({\mathbf{x}}_{i},{\mathbf{x}}_{j})bold_H start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_K ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ). And, for a test data 𝐱tesubscript𝐱𝑡𝑒{\mathbf{x}}_{te}bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT, the kernel between the test data and the training dataset 𝐗=[𝐱1,𝐱2,𝐱n]d×n𝐗subscript𝐱1subscript𝐱2subscript𝐱𝑛superscript𝑑𝑛{\mathbf{X}}=[{\mathbf{x}}_{1},{\mathbf{x}}_{2},\dots\mathbf{x}_{n}]\in\mathbb% {R}^{d\times n}bold_X = [ bold_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … bold_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_n end_POSTSUPERSCRIPT is defined by 𝐊(𝐱te,𝐗)n𝐊subscript𝐱𝑡𝑒𝐗superscript𝑛{\mathbf{K}}({\mathbf{x}}_{te},{\mathbf{X}})\in\mathbb{R}^{n}bold_K ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , bold_X ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT where 𝐊(𝐱te,𝐗)i=K(𝐱te,𝐱i)𝐊subscriptsubscript𝐱𝑡𝑒𝐗𝑖𝐾subscript𝐱𝑡𝑒subscript𝐱𝑖{\mathbf{K}}({\mathbf{x}}_{te},{\mathbf{X}})_{i}=K({\mathbf{x}}_{te},{\mathbf{% x}}_{i})bold_K ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , bold_X ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_K ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Let fntk(𝐱te)=(𝐊(𝐱te,𝐗))𝐇1𝐲subscript𝑓𝑛𝑡𝑘subscript𝐱𝑡𝑒superscript𝐊subscript𝐱𝑡𝑒𝐗topsuperscript𝐇1𝐲f_{ntk}({\mathbf{x}}_{te})=({\mathbf{K}}({\mathbf{x}}_{te},{\mathbf{X}}))^{% \top}\mathbf{H}^{-1}{\mathbf{y}}italic_f start_POSTSUBSCRIPT italic_n italic_t italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ) = ( bold_K ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT , bold_X ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y. Then,

|fnn(𝐱)fntk(𝐱)|ϵsubscript𝑓𝑛𝑛𝐱subscript𝑓𝑛𝑡𝑘𝐱italic-ϵ|f_{nn}({\mathbf{x}})-f_{ntk}({\mathbf{x}})|\leq\epsilon| italic_f start_POSTSUBSCRIPT italic_n italic_n end_POSTSUBSCRIPT ( bold_x ) - italic_f start_POSTSUBSCRIPT italic_n italic_t italic_k end_POSTSUBSCRIPT ( bold_x ) | ≤ italic_ϵ

Theorem 4 justifies that the kernel regression with NTK can provide a good approximation to the neural network training under the specified assumptions. However, calculating the NTK for the entire neural network requires high computational cost as it involves computing the Jacobian with respect to the network weights.

To address such a problem, Conjugate Kernel is often considered as a promising replacement of NTK. For example, the work by Zhou et al. (2022) utilizes the kernel ridge regression based on Conjugate Kernel for dataset distillation. We also use the Conjugate Kernel in our kernel ridge regression (Eq. 1), by defining the kernel matrix as 𝐗𝐒𝐗𝐒superscriptsubscript𝐗𝐒topsubscript𝐗𝐒{\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}}bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT where 𝐗𝐒=[𝐟1,,𝐟m]subscript𝐗𝐒subscript𝐟1subscript𝐟𝑚\mathbf{X}_{\mathbf{S}}=[\mathbf{f}_{1},\dots,\mathbf{f}_{m}]bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = [ bold_f start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , bold_f start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ] is composed of features produced by the exact target network of our consideration (ResNet18 for CIFAR-10 and ResNet50 for CIFAR-100/ImageNet). By considering the features from the target network, we can obtain the (approximate) network predictions that are linear in these derived features. In detail, the output of the CK-regression for the test example 𝐱tesubscript𝐱𝑡𝑒{\mathbf{x}}_{te}bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT can be written as fntk(𝐱te)=𝐱te𝐗𝐒(𝐗𝐒𝐗𝐒)1𝐲𝐒=𝐱te𝐰subscript𝑓𝑛𝑡𝑘subscript𝐱𝑡𝑒superscriptsubscript𝐱𝑡𝑒topsubscript𝐗𝐒superscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1subscript𝐲𝐒superscriptsubscript𝐱𝑡𝑒top𝐰f_{ntk}({\mathbf{x}}_{te})={\mathbf{x}}_{te}^{\top}{\mathbf{X}}_{{\mathbf{S}}}% ({\mathbf{X}}_{{\mathbf{S}}}^{\top}{\mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y% }}_{{\mathbf{S}}}={\mathbf{x}}_{te}^{\top}{\mathbf{w}}italic_f start_POSTSUBSCRIPT italic_n italic_t italic_k end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT ) = bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT = bold_x start_POSTSUBSCRIPT italic_t italic_e end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_w where 𝐰=𝐗𝐒(𝐗𝐒𝐗𝐒)1𝐲𝐰subscript𝐗𝐒superscriptsuperscriptsubscript𝐗𝐒topsubscript𝐗𝐒1𝐲{\mathbf{w}}={\mathbf{X}}_{{\mathbf{S}}}({\mathbf{X}}_{{\mathbf{S}}}^{\top}{% \mathbf{X}}_{{\mathbf{S}}})^{-1}{\mathbf{y}}bold_w = bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ( bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_X start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_y in Eq. 2 for λ=0𝜆0\lambda=0italic_λ = 0.

Of course, this kernel approximation of the neural network models, which assumes a fixed feature extractor, does not exactly match our situation where the selected subset not only affects the linear classifier but also the feature extractor itself during the training. However, this is still a good proxy that can reflect the network architecture of our interest in a computationally-efficient manner. Also, our analysis in Table 2 shows that this proxy finds the best window subset that aligns well with the result from the actual training of the full model.

Appendix C Implementation Details

C.1 Baseline details

We benchmark our BWS algorithm against eight different state-of-the-art methods, Forgetting score (Toneva et al., 2019), EL2N score (Paul et al., 2021), AdaCore (Pooladzandi et al., 2022), LCMat (Shin et al., 2023), Moderate DS (Xia et al., 2023), CCS (Zheng et al., 2023), SSL Prototype score (Sorscher et al., 2022), and Memorization score (Feldman & Zhang, 2020).

In Forgetting and EL2N scores, the scores are derived by averaging the results of five independent training runs using the full CIFAR-10/100 dataset. Specifically, Forgetting scores are obtained at the 200th epoch (full training), while EL2N scores are captured at the 20th epoch. For our ImageNet experiments, pre-calculated Forgetting, EL2N, SSL prototype, and Memorization scores are sourced from https://github.com/rgeirhos/dataset-pruning-metrics (Sorscher et al., 2022).

In the AdaCore and LCMat methodology, subset selection is conducted only once at the 10th epoch, to ensure a fair comparison to other baselines. Both AdaCore and LCMat implementations are sourced from the LCMat repository. For Moderate DS, models are trained using the full dataset, and the individual data features are extracted from the models. These features are defined as the outputs of the penultimate layer, with dimensions being 512 for ResNet18 and 2048 for ResNet50. The CCS algorithm employs the aforementioned Forgetting score. Within the CCS approach, we consistently set the hyperparameter β𝛽\betaitalic_β to zero across all data selection ratios. We also provide the results of CCS with optimal β𝛽\betaitalic_β obtained by grid search in Appendix §E.

All computational tasks utilized consistent network architectures, as detailed in Section 6: ResNet18 for CIFAR-10 and ResNet-50 for both CIFAR-100 and the ImageNet dataset. Additional experimental specifications related with learning algorithms are reported in Table 5 of Appendix §C.2.

Details of the baselines are summarized below:

  • EL2N score: The Error L2-Norm (EL2N) score of data (𝐱i,yi)subscript𝐱𝑖subscript𝑦𝑖({\mathbf{x}}_{i},y_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is defined as 𝔼[f(𝐖(t),𝐱i)yi2]𝔼delimited-[]subscriptnorm𝑓𝐖𝑡subscript𝐱𝑖subscript𝑦𝑖2\mathbb{E}[\|f({\mathbf{W}}(t),{\mathbf{x}}_{i})-y_{i}\|_{2}]blackboard_E [ ∥ italic_f ( bold_W ( italic_t ) , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ], where f(𝐖(t),𝐱)𝑓𝐖𝑡𝐱f({\mathbf{W}}(t),{\mathbf{x}})italic_f ( bold_W ( italic_t ) , bold_x ) is the output of the neural network for the sample (𝐱,y)𝐱𝑦({\mathbf{x}},y)( bold_x , italic_y ) at the t𝑡titalic_t-th epoch.

  • Forgetting score: The Forgetting score is defined as the number of times during training (until epoch T𝑇Titalic_T) that the decision of the sample switches from a correct one to an incorrect one. Forgetting(𝐱i,yi)Forgettingsubscript𝐱𝑖subscript𝑦𝑖\text{Forgetting}({\mathbf{x}}_{i},y_{i})Forgetting ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is defined as

    t=2T𝟙{argmaxf(𝐖(t1),𝐱i)=yi}(1𝟙{argmaxf(𝐖(t),𝐱i)=yi}).superscriptsubscript𝑡2𝑇1argmax𝑓𝐖𝑡1subscript𝐱𝑖subscript𝑦𝑖11argmax𝑓𝐖𝑡subscript𝐱𝑖subscript𝑦𝑖\sum_{t=2}^{T}\mathbbm{1}\{\operatorname*{arg\,max}f({\mathbf{W}}(t-1),{% \mathbf{x}}_{i})=y_{i}\}(1-\mathbbm{1}\{\operatorname*{arg\,max}f({\mathbf{W}}% (t),{\mathbf{x}}_{i})=y_{i}\}).∑ start_POSTSUBSCRIPT italic_t = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT blackboard_1 { start_OPERATOR roman_arg roman_max end_OPERATOR italic_f ( bold_W ( italic_t - 1 ) , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ( 1 - blackboard_1 { start_OPERATOR roman_arg roman_max end_OPERATOR italic_f ( bold_W ( italic_t ) , bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } ) . (8)
  • AdaCore: Adaptive Second order Coresets (AdaCore) is an algorithm that solves the optimization problem, which finds a subset that imitates the full gradient preconditioned with the Hessian matrix:

    SargminSViVminjS𝐇i(wt)1𝐠i(wt)𝐇j(wt)1𝐠j(wt), s.t. |S|rformulae-sequencesuperscript𝑆subscriptargmin𝑆𝑉subscript𝑖𝑉subscript𝑗𝑆normsubscript𝐇𝑖superscriptsubscript𝑤𝑡1subscript𝐠𝑖subscript𝑤𝑡subscript𝐇𝑗superscriptsubscript𝑤𝑡1subscript𝐠𝑗subscript𝑤𝑡 s.t. 𝑆𝑟S^{*}\in\text{argmin}_{S\subset V}\sum_{i\in V}\min_{j\in S}\|{\mathbf{H}}_{i}% (w_{t})^{-1}\mathbf{g}_{i}(w_{t})-{\mathbf{H}}_{j}(w_{t})^{-1}\mathbf{g}_{j}(w% _{t})\|,\text{ s.t. }|S|\leq ritalic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmin start_POSTSUBSCRIPT italic_S ⊂ italic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT ∥ bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ , s.t. | italic_S | ≤ italic_r (9)

    where 𝐠i(wt)=l(wt,(𝐱i,yi))subscript𝐠𝑖subscript𝑤𝑡𝑙subscript𝑤𝑡subscript𝐱𝑖subscript𝑦𝑖\mathbf{g}_{i}(w_{t})=\nabla l(w_{t},(\mathbf{x}_{i},y_{i}))bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∇ italic_l ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) and 𝐇i(wt)=2l(wt,(𝐱i,yi))subscript𝐇𝑖subscript𝑤𝑡superscript2𝑙subscript𝑤𝑡subscript𝐱𝑖subscript𝑦𝑖\mathbf{H}_{i}(w_{t})=\nabla^{2}l(w_{t},(\mathbf{x}_{i},y_{i}))bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_l ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) represent the gradient and Hessian of the loss function for the data point (𝐱i,yi)subscript𝐱𝑖subscript𝑦𝑖(\mathbf{x}_{i},y_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) using the model parameter wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the t𝑡titalic_t-th epoch of training, respectively. Let V𝑉Vitalic_V represents the full dataset and S𝑆Sitalic_S be the coreset of size r𝑟ritalic_r. In the AdaCore method, when employing the cross entropy loss with a softmax layer as the final layer, the gradient 𝐠i(wt)subscript𝐠𝑖subscript𝑤𝑡\mathbf{g}_{i}(w_{t})bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is approximated by piyisubscript𝑝𝑖subscript𝑦𝑖p_{i}-y_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the softmax output for the data point (𝐱i,yi)subscript𝐱𝑖subscript𝑦𝑖(\mathbf{x}_{i},y_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). Moreover, to reduce computational complexity, the Hessian 𝐇i(wt)subscript𝐇𝑖subscript𝑤𝑡\mathbf{H}_{i}(w_{t})bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is approximated using only its diagonal.

  • LCMat: Loss-Curvature Matching (LCMat) is an algorithm that solves the optimization problem, which finds a subset that matches the loss curvature of full dataset. Due to the intractability of utilizing the loss curvature, an alternative optimization problem is suggested as follows:

    SargminSViVminjS𝐠i(wt)𝐠j(wt)+12ρk𝒦|λi,kλj,k|, s.t. |S|rformulae-sequencesuperscript𝑆subscriptargmin𝑆𝑉subscript𝑖𝑉subscript𝑗𝑆normsubscript𝐠𝑖subscript𝑤𝑡subscript𝐠𝑗subscript𝑤𝑡12𝜌subscript𝑘𝒦subscript𝜆𝑖𝑘subscript𝜆𝑗𝑘 s.t. 𝑆𝑟S^{*}\in\text{argmin}_{S\subset V}\sum_{i\in V}\min_{j\in S}\|\mathbf{g}_{i}(w% _{t})-\mathbf{g}_{j}(w_{t})\|+\dfrac{1}{2}\rho\sum_{k\in\mathcal{K}}|\lambda_{% i,k}-\lambda_{j,k}|,\text{ s.t. }|S|\leq ritalic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ argmin start_POSTSUBSCRIPT italic_S ⊂ italic_V end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT ∥ bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_g start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ + divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ ∑ start_POSTSUBSCRIPT italic_k ∈ caligraphic_K end_POSTSUBSCRIPT | italic_λ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT italic_j , italic_k end_POSTSUBSCRIPT | , s.t. | italic_S | ≤ italic_r (10)

    where 𝐠i(wt)=l(wt,(𝐱i,yi))subscript𝐠𝑖subscript𝑤𝑡𝑙subscript𝑤𝑡subscript𝐱𝑖subscript𝑦𝑖\mathbf{g}_{i}(w_{t})=\nabla l(w_{t},(\mathbf{x}_{i},y_{i}))bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ∇ italic_l ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) and λi,k=[𝐇i(wt)]kk=kk2l(w,(𝐱i,yi))subscript𝜆𝑖𝑘subscriptdelimited-[]subscript𝐇𝑖subscript𝑤𝑡𝑘𝑘subscriptsuperscript2𝑘𝑘𝑙𝑤subscript𝐱𝑖subscript𝑦𝑖\lambda_{i,k}=[\mathbf{H}_{i}(w_{t})]_{kk}=\nabla^{2}_{kk}l(w,(\mathbf{x}_{i},% y_{i}))italic_λ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT = [ bold_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT = ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k italic_k end_POSTSUBSCRIPT italic_l ( italic_w , ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) denote the gradient and the k𝑘kitalic_k-th diagonal element of the Hessian of the loss function for the data point (𝐱i,yi)subscript𝐱𝑖subscript𝑦𝑖(\mathbf{x}_{i},y_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with the model parameter wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at the t𝑡titalic_t-th epoch of training, respectively. Let V𝑉Vitalic_V represent the full dataset, S𝑆Sitalic_S the coreset with size r𝑟ritalic_r and W𝑊Witalic_W the model parameter space. 𝒦=argmax|𝒦|=Kj𝒦Vari(λi,k)𝒦subscriptargmax𝒦𝐾subscript𝑗𝒦subscriptVar𝑖subscript𝜆𝑖𝑘\mathcal{K}=\operatorname*{arg\,max}_{|\mathcal{K}|=K}\sum_{j\in\mathcal{K}}% \text{Var}_{i}(\lambda_{i,k})caligraphic_K = start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT | caligraphic_K | = italic_K end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_j ∈ caligraphic_K end_POSTSUBSCRIPT Var start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_λ start_POSTSUBSCRIPT italic_i , italic_k end_POSTSUBSCRIPT ) is a set of indices for K𝐾Kitalic_K sub-dimensions on W𝑊Witalic_W, where the dimension variance is high. In LCMat, when employing the cross entropy loss with a softmax layer as the final layer, the gradient 𝐠i(w)subscript𝐠𝑖𝑤\mathbf{g}_{i}(w)bold_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_w ) is approximated by piyisubscript𝑝𝑖subscript𝑦𝑖p_{i}-y_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where pisubscript𝑝𝑖p_{i}italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the softmax output for the data point (𝐱i,yi)subscript𝐱𝑖subscript𝑦𝑖(\mathbf{x}_{i},y_{i})( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ).

  • Moderate DS: For each class of a given dataset, Moderate Coreset calculates the distance between features and the feature mean of the class, which is defined as di=𝐟ijS𝐟j|S|2subscript𝑑𝑖subscriptnormsubscript𝐟𝑖subscript𝑗𝑆subscript𝐟𝑗𝑆2d_{i}=\|{\mathbf{f}}_{i}-\frac{\sum_{j\in S}{\mathbf{f}}_{j}}{|S|}\|_{2}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ∥ bold_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - divide start_ARG ∑ start_POSTSUBSCRIPT italic_j ∈ italic_S end_POSTSUBSCRIPT bold_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG | italic_S | end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT where S𝑆Sitalic_S is the set of features whose label is the same as fisubscript𝑓𝑖f_{i}italic_f start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, the data points with distances closest to the distance-median(median({di}iS)𝑚𝑒𝑑𝑖𝑎𝑛subscriptsubscript𝑑𝑖𝑖𝑆median(\{d_{i}\}_{i\in S})italic_m italic_e italic_d italic_i italic_a italic_n ( { italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT )) are selected.

  • CCS: Coverage-Centric Coreset Selection (CCS) is an algorithm based on difficulty-based score, which considers overall data coverage upon a distribution as well as important data. CCS prunes β%percent𝛽\beta\%italic_β % hardest data first and splits the remained data into k𝑘kitalic_k subsets {𝐁i}i=1ksuperscriptsubscriptsubscript𝐁𝑖𝑖1𝑘\{{\mathbf{B}}_{i}\}_{i=1}^{k}{ bold_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT based on evenly divided score ranges {𝐑i}i=1ksuperscriptsubscriptsubscript𝐑𝑖𝑖1𝑘\{{\mathbf{R}}_{i}\}_{i=1}^{k}{ bold_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT. Then, CCS selects the same number of samples from each score range to make the score distribution of the selected samples uniform.

  • SSL Prototype score: The work by Sorscher et al. (2022) conducts k𝑘kitalic_k-means clustering of samples in the embedding space of a model pre-trained on the ImageNet dataset. It then defines a self-supervised prototype metric (SSL Prototype score) as the Euclidean distance to its nearest cluster centroid, or prototype. Points located closer to the center have lower SSL scores.

  • Memorization score: Memorization score (Feldman & Zhang, 2020) calculates the influence of each training example (𝐱i,y)subscript𝐱𝑖𝑦({\mathbf{x}}_{i},y)( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ) on the classification accuracy of that same example (𝐱i,y)subscript𝐱𝑖𝑦({\mathbf{x}}_{i},y)( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y ), and is defined as follow

    mem((𝐱i,yi))=𝐏(hT(𝐱i)=yi)𝐏(hT{(𝐱i,yi)}(𝐱i)=yi)memsubscript𝐱𝑖subscript𝑦𝑖𝐏subscript𝑇subscript𝐱𝑖subscript𝑦𝑖𝐏subscript𝑇subscript𝐱𝑖subscript𝑦𝑖subscript𝐱𝑖subscript𝑦𝑖\text{mem}(({\mathbf{x}}_{i},y_{i}))=\mathbf{P}(h_{T}({\mathbf{x}}_{i})=y_{i})% -\mathbf{P}(h_{T\setminus{\{({\mathbf{x}}_{i},y_{i})\}}}({\mathbf{x}}_{i})=y_{% i})mem ( ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) = bold_P ( italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - bold_P ( italic_h start_POSTSUBSCRIPT italic_T ∖ { ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) } end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (11)

    where hS()subscript𝑆h_{S}(\cdot)italic_h start_POSTSUBSCRIPT italic_S end_POSTSUBSCRIPT ( ⋅ ) is a model trained on the set S𝑆Sitalic_S and T𝑇Titalic_T is the training dataset.

C.2 Experiment details

Data pruning experiment

We conduct experiments with three public datasets, CIFAR-10/100 and ImageNet by training ResNet networks (He et al., 2016) of different depths. ResNet18 is used for CIFAR-10 and ResNet50 is used for CIFAR-100 and ImageNet dataset. The implementation is based on the ResNet network in torchvision (Paszke et al., 2019). Since CIFAR-10/100 images are smaller than ImageNet images, we replace the front parts of the ResNet (convolution layer with 7×7777\times 77 × 7 kernel and 2×2222\times 22 × 2 stride, max pooling layer with 3×3333\times 33 × 3 kernel and 2×2222\times 22 × 2 stride) with a single convolution layer with 3×3333\times 33 × 3 kernel and 1×1111\times 11 × 1 stride for the small size images. The details on hyperparameters and optimization methods used in training are summarized in Table 5.

Our experiments report the averaged results from three runs on CIFAR-10/100 and two on ImageNet, with shaded regions representing standard deviations. Networks are trained on datasets curated based on specific selection ratios and methods. Crucially, our data selection ensures equal selection from each class by preserving the portion data in each class.

Table 5: Details for the experiments used in the training of the dataset.
CIFAR-10 CIFAR-100 ImageNet
Architecture ResNet18 ResNet50 ResNet50
Batch size 128 128 256
Epochs 200 400 90
Initial Learning Rate 0.05 0.2 0.1
Weight decay 5e-4 5e-4 1e-4
Learning Rate Scheduler Cosine annealing scheduler Step scheduler
Optimizer SGD with momentum 0.9
Data Augmentation Random Zero Padded Crop** (4 pixels) Random Resized Crop**
Random left-right flip** (probability 0.5)
Normalize by dataset’s mean, variance

Cross-architecture robustness

We conduct cross-architecture experiments on the CIFAR-10 dataset, training three distinct networks: a simple CNN, EfficientNet-B0, and a Vision Transformer (ViT) pretrained on the ImageNet dataset.

For the simple CNN, we design an architecture comprising three convolutional layers with a 3×3333\times 33 × 3 kernel and 1×1111\times 11 × 1 stride (channels: 64, 128, 256). This is paired with two max-pooling layers with a 2×2222\times 22 × 2 kernel. The convolutional layers are interspersed with these max-pooling layers. Following the convolutional layers, the network is connected to two fully connected layers (channels: 128, 256). Each convolutional layer is equipped with a batch normalization layer followed by a non-linear ReLU activation layer. We set the initial learning rate to 0.05 and weight decay to 1e-4. Other details are the same as CIFAR-10 case in Table 5. For EfficientNet-B0, we closely follow the implementation details of Tan & Le (2019), by setting the learning rate to 1e-4 and a weight decay of the same magnitude. Other implementation specifications are the same with the details in Table 5 of CIFAR-10. For the ViT, we adhere to the implementation specifications as detailed in Dosovitskiy et al. (2021). We obtain a ViT model pretrained on the ImageNet dataset using the timm module in PyTorch, which we subsequently fine-tune on the CIFAR-10 dataset for 10 epochs. For fine-tuning a model pre-trained on ImageNet to adapt to the CIFAR-10 dataset, we resize the data to fit the 224x224 pixel dimensions. We set the initial learning rate to 1e-4 and weight decay to 1e-4. We do not use a learning rate scheduler. Other details are the same as CIFAR-10 case in Table 5.

Within our algorithm, BWS, we utilize a Forgetting score sourced from the ResNet18 architecture. Furthermore, we establish a feature extractor using either simple CNN, EfficientNet-B0, or ViT architecture, repectively. For the CNN and EfficientNet-B0 architecture, we execute training for 20 epochs, while for the ViT setup, we fine-tune for 3 epochs. We report the averaged results from three independent runs on the three networks, with the shaded regions indicating the standard deviations. Similar to previous experiments, data samples are selected to ensure a balanced portion of each class, preserving the original class ratios within the CIFAR-10 dataset.

Robustness to label noise

We generate a noise version of the CIFAR-10 dataset with symmetric label noise at levels of 20% and 40%, respectively. To evaluate this noisy dataset, we compute the EL2N score using a ResNet18 model, averaging the outcomes over five independent runs. The EL2N score is selected due to its lower computational cost compared to the Forgetting score, especially when it is required to re-calculate the new EL2N score for the entire noisy dataset. In our analysis, we apply Algorithm 1 to the CIFAR-10 dataset, where the samples are ordered by their EL2N scores. To calculate the classification accuracy of 𝐰𝐒subscript𝐰𝐒{\mathbf{w}}_{{\mathbf{S}}}bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT, we specifically use the lower-scoring 50% of the samples, represented as 1ni=n2n𝟙(argmaxc(𝐰𝐒𝐱i)c=yi)\frac{1}{n}\sum_{i=\frac{n}{2}}^{n}\mathbbm{1}(\operatorname*{arg\,max}_{c}({% \mathbf{w}}_{{\mathbf{S}}}^{\top}{\mathbf{x}}_{i})_{c}=y_{i})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = divide start_ARG italic_n end_ARG start_ARG 2 end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), instead of the typical approach 1ni=1n𝟙(argmaxc(𝐰𝐒𝐱i)c=yi)\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}(\operatorname*{arg\,max}_{c}({\mathbf{w}}% _{{\mathbf{S}}}^{\top}{\mathbf{x}}_{i})_{c}=y_{i})divide start_ARG 1 end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT blackboard_1 ( start_OPERATOR roman_arg roman_max end_OPERATOR start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ( bold_w start_POSTSUBSCRIPT bold_S end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT = italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ). This adjustment was made to exclude noisy samples from the quality evaluation of window subsets and thus prevent overfitting to noise in the data.

Ablation on different window selection methods

The formal definitions of Gradient difference and Gradient similarity are as follows:

  • Gradient difference: minimizing the difference between the gradients of the full training dataset (V𝑉Vitalic_V) and window subset (S𝑆Sitalic_S).

    Gradient Difference(V,S)=iVf𝐰(𝐱i)|V|iSf𝐰(𝐱i)|S|2Gradient Difference𝑉𝑆subscriptnormsubscript𝑖𝑉subscript𝑓𝐰subscript𝐱𝑖𝑉subscript𝑖𝑆subscript𝑓𝐰subscript𝐱𝑖𝑆2\text{Gradient Difference}(V,S)=\left\|\frac{\sum_{i\in V}\nabla f_{{\mathbf{w% }}}({\mathbf{x}}_{i})}{|V|}-\frac{\sum_{i\in S}\nabla f_{{\mathbf{w}}}({% \mathbf{x}}_{i})}{|S|}\right\|_{2}Gradient Difference ( italic_V , italic_S ) = ∥ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_V | end_ARG - divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG | italic_S | end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT (12)
  • Gradient similarity: maximizing the cosine similarity between the gradients of the full training dataset (V𝑉Vitalic_V) and window subset (S𝑆Sitalic_S).

    Gradient Similarity(V,S)=iVf𝐰(𝐱i)iVf𝐰(𝐱i)2iSf𝐰(𝐱i)iSf𝐰(𝐱i)2Gradient Similarity𝑉𝑆subscript𝑖𝑉subscript𝑓𝐰subscript𝐱𝑖subscriptnormsubscript𝑖𝑉subscript𝑓𝐰subscript𝐱𝑖2subscript𝑖𝑆subscript𝑓𝐰subscript𝐱𝑖subscriptnormsubscript𝑖𝑆subscript𝑓𝐰subscript𝐱𝑖2\text{Gradient Similarity}(V,S)=\frac{\sum_{i\in V}\nabla f_{{\mathbf{w}}}({% \mathbf{x}}_{i})}{\left\|\sum_{i\in V}\nabla f_{{\mathbf{w}}}({\mathbf{x}}_{i}% )\right\|_{2}}\cdot\frac{\sum_{i\in S}\nabla f_{{\mathbf{w}}}({\mathbf{x}}_{i}% )}{\left\|\sum_{i\in S}\nabla f_{{\mathbf{w}}}({\mathbf{x}}_{i})\right\|_{2}}Gradient Similarity ( italic_V , italic_S ) = divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ italic_V end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG ⋅ divide start_ARG ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ∥ ∑ start_POSTSUBSCRIPT italic_i ∈ italic_S end_POSTSUBSCRIPT ∇ italic_f start_POSTSUBSCRIPT bold_w end_POSTSUBSCRIPT ( bold_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG (13)

C.3 Computational cost

Table 6: Time cost (in seconds) for subset selection of each algorithm across selection ratios.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
CIFAR-10 BWS (Ours) 4.3 4.8 7.2 8.5 9.7 10.4 10.4 9.0 6.3
LCMat 520 1197 2260 4213 5800 7173 8450 10320 10631
AdaCore 224 699 1256 2273 3186 3977 4649 5637 5839
CIFAR-100 BWS (Ours) 14.6 55.6 57.3 62.2 64.7 64.0 61.7 45.7 30.2
LCMat 1465 1468 1471 1478 1483 1489 1493 1501 1504
AdaCore 1295 1300 1304 1309 1315 1320 1324 1331 1334
ImageNet BWS (Ours) 1423 2910 3941 6141 7590 8616 9029 10118 6499
LCMat 238451 239027 239694 240934 242003 242864 243594 244854 245245
AdaCore 213733 214309 214963 216181 217067 217866 218521 219593 219924
GPU Nvidia A100 40GB

We compare the computational cost of our BWS algorithm, as detailed in Algorithm 1, with other optimization-based coreset selection baselines, namely LCMat and AdaCore. We assume that the sample scores, used for sorting, is readily available and that the feature extractor is also pre-provided for both ours and optimization-based methods. We report and compare the time taken to select the subset for each algorithm.

In Table 6, we detail the time required to select subsets from various datasets using the different methods. Clearly, our method outperforms optimization-based techniques in terms of time cost for subset selection. As we have previously described in Sec. 5, our strategy, which selects the best window subset from a continuous interval of samples sorted by their scores, greatly reduces the search space compared to the general optimization techniques, leading to improved efficiency.

Appendix D Detailed Review of Related Works

In this section, we provide additional related works on data subset selection regarding various perspectives.

When there is no validation set, some score-based selection methods, such as EL2N (Paul et al., 2021) and Forgetting (Toneva et al., 2019), suffer from performance degradation when the dataset includes label-noise samples, since these methods often assign high scores to label-noisy samples, as label-noise samples are inherently hard to learn. Some recent methods adopt more cautious measures to distinguish hard-to-learn but clean-label samples, known to be valuable to enhance the generalization ability of neural networks, from label-noise samples. For instance, Cartography (Swayamdipta et al., 2020) utilizes two measures, confidence mean and confidence variance of data sample, to distinguish hard-to-learn samples from mere label-noise samples. Second-Split Forgetting (Maini et al., 2022) achieves this goal by observing the learning time and forgetting time of each sample while training a model. AUM (Pleiss et al., 2020) observes the logit value of a given label and the next largest logit value, and uses the gap to separate noisy data samples and ambiguous data samples.

Another important issue that has been recently explored in data subset selection is the computational overhead in quantifying the data value. There are several recent attempts to valuate data without training of a neural network. For example, CG-score (Ki et al., 2023) evaluates data instances without model training by calculating the analytical gap in generalization errors when an instance is held out. LAVA (Just et al., 2023) evaluates the value of each data instance without a model training by using a proxy function, the class-wise Wasserstein distance between training and validation set, for the validation performance. DAVINZ (Wu et al., 2022) utilizes the Neural Tangent Kernel (NTK) of a network at initialization for calculating the contribution of each data instance to domain-aware generalization error bound.

In optimization-based selection, many works have utilized the effect of data on the model training. MaxMargin (Har-Peled et al., 2007), IWeS (Citovsky et al., 2023), and Selection-Via-Proxy (Coleman et al., 2020) use the confidence of a model to identify uncertain data during optimization. Maximum Margin Coresets (Har-Peled et al., 2007) selects data with the smallest margin in an SVM setting, IWeS (Citovsky et al., 2023) selects examples using importance sampling with a sampling probability based on the confidences of two models, and Selection-Via-Proxy (Coleman et al., 2020) applies confidence-based methods to a small proxy model to perform data selection. Das et al. (2021) solve a convex linear programming problem to find high-value data that contributes much to the loss and optimization, and Glister (Killamsetty et al., 2021b) finds data that contributes significantly to the loss during the training of a neural network. GradMatch (Killamsetty et al., 2021a) finds a subset whose gradient matches better with the gradient of the full dataset.

Additionally, in active learning, where data samples are selectively labeled for semi-supervised learning, Neural-Preconditioning (Kong et al., 2022) obtains the label of data that dominates the eigenspace of the NTK, and Culotta & McCallum (2005) obtain the label of the least confident data to reduce labeling costs. The works by Har-Peled et al. (2007); Citovsky et al. (2023); Coleman et al. (2020) utilize the confidence of a model to identify uncertain data during the optimization, and the works by Das et al. (2021); Killamsetty et al. (2021b) find the data that contributes much to the loss.

Appendix E Comparison with CCS

CCS (Zheng et al., 2023) prunes the hardest β𝛽\betaitalic_β% samples, divides the remaining data into non-overlap** k𝑘kitalic_k ranges based on the difficulty scores, and uniformly assigns budgets to each range. Samples are then chosen from each range within the budget. If a range has fewer data than the assigned budget, the remaining budget is iteratively reassigned to other ranges. While methodologies based on difficulty scores suffer from a drastic performance drop at low subset ratios, CCS achieves high performance across a broad range by selecting diverse data with appropriate β𝛽\betaitalic_β and k𝑘kitalic_k. However, CCS does not propose an efficient method to find the desired β𝛽\betaitalic_β and reports the result with the optimal β𝛽\betaitalic_β obtained by grid search, which may require high computational cost. Since choosing the best performing model among the models trained on each subset chosen by different β𝛽\betaitalic_β is not fair for comparison to other baselines, including BWS, we reported the result of CCS obtained by setting the hyperparameter β𝛽\betaitalic_β as 0 in the main experimental results. In this section, we additionally report the results of CCS with the optimal β𝛽\betaitalic_β found by grid search and compare the performance with those of BWS and oracle window in Table 9. In Table 8, we also report the optimal β𝛽\betaitalic_β for CCS across different selection ratios. For the subset ratios of 10%, 20%, 30%, and 50%, we utilize the β𝛽\betaitalic_β values reported in the original paper (marked with ), and for other subset ratio, we conduct a grid search to find the best β𝛽\betaitalic_β by exploring it with a step size of 10%.

Several key observations emerge from the results. First, CCS after hyperparameter tuning, achieves performance comparable to the Oracle Window at lower selection ratios (1-10%). This is a natural consequence, given both methods’ ability to discard the top hardest samples in favor of easier ones at low ratios. However, at higher ratios (20%-90%), CCS’s efficacy decreases relative to both Oracle Window and BWS, even after tuning β𝛽\betaitalic_β. This decline can be attributed to CCS’s strategy of selecting samples across a uniform score distribution after pruning the hardest ones. Even CCS adjusts β𝛽\betaitalic_β to 0 or lower values (e.g., 10 or 20) for ratios beyond 30%, the sample selection with uniform score distribution makes CCS incorporate not only hard samples but also easy samples, which are less effective in high subset ratios. In comparison, Optimal Window or BWS, which select samples from a contiguous difficulty range, focus on selecting harder samples as the subset ratio increases, resulting in better performance. Moreover, we analyze the computational costs associated with CCS and BWS as summarized in Table 7. Since CCS requires the repeated training of deep neural networks in the process of tuning the hyperparameter β𝛽\betaitalic_β, it requires significant computational overhead compared to BWS. On the other hand, BWS circumvents the need for hyperparameter tuning, by solving a simple proxy task to identify the best window subset, which considerably shortens the time requirement.

Table 7: Time cost (in seconds) to compute CCS and BWS.
Selection ratio 5% 30% 75%
CIFAR-10 CCS 812 3897 3654
BWS 13 59 130
CIFAR-100 CCS 3909 18767 17594
BWS 75 182 339
Table 8: Optimal β𝛽\betaitalic_β (%) at different selection ratios in various dataset.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
CIFAR-10 80 50 30 10 10 10 0 0 0
CIFAR-100 99 80 50 40 20 20 20 20 0
ImageNet 80 30 30 20 20 10 10 10 0
Table 9: Test accuracy of CCS with optimal β𝛽\betaitalic_β at different selection ratios.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
CIFAR-10 CCS 46.58 72.12 81.99 88.89 91.74 93.10 94.09 94.95 95.29
Oracle 47.17 72.89 82.67 89.06 91.80 93.59 94.54 95.23 95.37
BWS 46.10 70.70 82.29 88.74 91.80 93.59 94.54 95.23 95.37
CIFAR-100 CCS 11.01 31.11 45.52 56.09 64.26 68.51 70.80 75.83 78.13
Oracle 10.63 30.39 45.16 58.91 67.51 72.70 75.00 78.42 79.00
BWS 8.43 29.25 44.11 58.30 67.20 72.17 73.83 77.78 79.00
ImageNet CCS 8.13 31.40 45.10 57.12 62.65 67.54 69.58 73.10 74.59
Oracle 7.97 33.58 48.84 62.83 68.22 71.33 72.74 74.73 75.25
BWS 7.02 33.31 46.72 62.32 67.16 70.47 72.68 74.73 75.25

Appendix F Additional Experiments

Refer to caption
Figure 8: Sliding window experiments in ImageNet dataset to measure the test accuracy of the models trained by window subsets while changing the starting point of the windows. Samples are sorted in descending order by their difficulty scores. The horizontal lines are results from random selection. For each subset ratio, there exists the best window, and its starting point shifts toward left as the subset ratio increases.

F.1 Sliding window experiment for ImageNet dataset

We investigate the efficacy of the window selection approach by varying the starting points and demonstrate the existence of an optimal window subset. In this process, we arrange the ImageNet samples in descending order based on their Forgetting scores (Toneva et al., 2019), and then select windows of varying sizes, from 10%percent1010\%10 % to 40%percent4040\%40 %. The starting point for these windows is adjusted from 00 to (100w)%percent100𝑤(100-w)\%( 100 - italic_w ) %, incrementing in steps of 5%percent55\%5 %. Subsequently, we train a ResNet50 model using these window subsets and present the resulting test accuracies in Fig. 8. Consistent with the observations in Fig. 3, within the ImageNet dataset, we note that for each subset ratio, there is an optimal starting point. Notably, this optimal point shifts progressively towards lower values, which correspond to more difficult samples, as the size of the window subset increases.

F.2 Cross architecture robustness

To test the robustness of our method across changes in neural network architectures, we conduct data pruning experiments on CIFAR-10 while using different architectures during sample scoring and training. The window subsets are constructed using samples ordered by their Forgetting scores, calculated on ResNet18 architecture. Then, the best window selection (Alg. 1) and the model training are conducted using a simpler CNN architecture or EfficientNet-B0 architecture. The results on the CNN architecture are presented in Fig. 9(a), and those on the EfficientNet-B0 are shown in Fig. 9(b). In all cases, our method (BWS) consistently achieves competitive performances across all selection ratios, demonstrating its robustness to changes in neural network architectures during data subset selection.

Refer to caption
(a) CNN
Refer to caption
(b) EfficientNet-B0           
Figure 9: Cross-architecture experiments with CNN (left), and EfficientNet-B0 (right), where samples scores are calculated using ResNet18 model. Full results are reported in Table 1617.

F.3 Robustness to label noise

Refer to caption
(a) 20% label-noise CIFAR-10
Refer to caption
(b) 40% label-noise CIFAR-10
Figure 10: Data pruning experiments with CIFAR-10, including (a) 20% label-noise, (b) 40% label-noise. Our method (BWS) attains the accuracy of the full training dataset despite the presence of label noise. Full results are reported in Table 19-20.

We test the robustness of BWS in the presence of label noise in the training dataset. We corrupt randomly chosen 20% and 40% samples of CIFAR-10 by random label noise. It has been previously reported that the difficulty score-based selection methods are susceptible to label noise since such methods tend to assign high scores to label-noise samples (Toneva et al., 2019; Paul et al., 2021). Thus, these methods often ends up prioritizing the label-noise samples in the selection process, leading to suboptimal results. On the other hand, our algorithm offers flexibility in choosing window subsets with varying levels of difficulty by changing the starting point, and adopts an approach to select the best window by solving a proxy task using the kernel ridge regression. To further enhance the robustness of our method, we can modify Alg. 1 to evaluate the solution of kernel ridge regression using only the low-scoring 50% samples from the training dataset, which will rarely include label-noise samples, instead of the full dataset. We use EL2N (Paul et al., 2021) as the difficulty score to align the samples in our algorithm. In Fig. 10, we compare the performance of this modified version of BWS with other baselines. While difficulty score-based selection and optimization-based selection methods suffer from performance degradation due to label noise, our method, along with another label noise-robust method, Moderate DS, achieves performance even higher than what is achievable with the full training dataset, which includes the 20% or 40% label noise, respectively.

Appendix G Ablation studies

G.1 Ablation study on window type: two half-width sliding windows

BWS sorts samples in a dataset by their difficulty scores and then selects the optimal window subset from one continuous single-interval regime. Thus, the window selection chooses the samples of similar difficulty level. To further examine possible benefits from non-contiguous subset selection, we conduct an additional experiment on the CIFAR-10 dataset by finding the optimal two half-width windows while varying their starting points. In detail, we sort the samples from CIFAR-10 in descending order based on Forgetting score (Toneva et al., 2019) and for a subset of size w%percent𝑤w\%italic_w %, we search over all combinations of two half-width windows, denoted by [x1,x1+w/2][x2,x2+w/2]subscript𝑥1subscript𝑥1𝑤2subscript𝑥2subscript𝑥2𝑤2[x_{1},x_{1}+w/2]\cup[x_{2},x_{2}+w/2][ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w / 2 ] ∪ [ italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_w / 2 ] while varying their starting points (x1,x2)subscript𝑥1subscript𝑥2(x_{1},x_{2})( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) in x1[0,100w]subscript𝑥10100𝑤x_{1}\in[0,100-w]italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∈ [ 0 , 100 - italic_w ] and x2[x1+w/2,100w/2]subscript𝑥2subscript𝑥1𝑤2100𝑤2x_{2}\in[x_{1}+w/2,100-w/2]italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_w / 2 , 100 - italic_w / 2 ] with a step size of 5%. We train ResNet18 on each subset and evaluate the corresponding test accuracies. The full results are presented in Fig. 11, and in Table 10 we report the top five results (the compositions of half-width windows and their test accuracies) for subset ratios ranging from 10 to 40%. We highlight the cases where the two half-width windows are contiguous to each other with bold letters.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: Test accuracy of the models trained with two half-width windows of subset ratios 10% (top left), 20% (top right), 30% (bottom left), and 40 %(bottom right) with varying starting points. The numbers in axes indicate the starting points of each interval, and the color indicates the test accuracy for each composition of half-width windows. We note that a contiguous window, a point near the diagonal, attains performance levels comparable to the best performance.
Table 10: Top-five test accuracies and their corresponding half-width window compositions on CIFAR-10 dataset. We highlight the cases where the two half-width windows are contiguous to each other with bold letters.
Ratio Ranking 1st 2nd 3rd 4th 5th
10% Half-width windows 40-45%, 50-55% 40-45%, 55-60% 45-50%, 55-60% 50-55%, 55-60% 45-50%, 50-55%
Test Acc 83.04 82.87 82.71 82.67 82.46
20% Half-width windows 20-30%, 35-45% 25-35%, 35-45% 20-30%, 40-50% 20-30%, 50-60% 25-35%, 45-55%
Test Acc 89.16 89.06 88.98 88.84 88.77
30% Half-width windows 10-25%, 30-45% 10-25%, 35-50% 5-20%, 30-45% 15-30%, 35-50% 15-30%, 30-45%
Test Acc 92.02 91.98 91.90 91.84 91.80
40% Half-width windows 5-25%, 30-50% 5-25%, 25-45% 10-30%, 35-55% 5-25%, 35-55% 5-25%, 40-60%
Test Acc 93.67 93.59 93.54 93.46 93.40

We can observe that for every considered subset ratio, the top-five best performing cases include contiguous windows or windows near to each other with the gap of only 5%, even though we allowed flexibility in choosing the two half-width windows far away from each other. This result further supports our use of window selection, which only considers subsets from a continuous interval of samples based on difficulty scores, in choosing near-optimal subset in an efficient manner across a broad range of selection ratios.

Refer to caption
(a) One and a half times wider window
Refer to caption
(b) Twice wider window
Figure 12: Test accuracy using subsets randomly sampled from windows of (a) one and a half times (×1.5absent1.5\times 1.5× 1.5) and (b) twice (×2absent2\times 2× 2) larger than the subset ratio, while varying the starting points of these windows. The horizontal lines represent the results from the oracle window, which is the maximum test accuracy obtained in sliding window experiments, for each subset ratio. Our observations indicate that at lower subset ratios, there are subsets whose performance is comparable to that of the oracle window, but, at higher subset ratios, the performance of all subsets consistently falls short of the oracle window.

G.2 Ablation study on window type: wider sliding windows

We also conduct an additional experiment on the CIFAR-10 dataset to explore non-contiguous sample selection by considering random selection from wider windows. By arranging the samples in descending order according to difficulty scores and selecting a starting point, denoted as s%percent𝑠s\%italic_s %, for a given subset ratio of w%percent𝑤w\%italic_w %, we randomly choose samples within the range [s,s+cw]%percent𝑠𝑠𝑐𝑤[s,s+c\cdot w]\%[ italic_s , italic_s + italic_c ⋅ italic_w ] % , where c𝑐citalic_c is a constant greater than 1. In particular, we sort the CIFAR-10 samples in descending order based on their Forgetting scores (Toneva et al., 2019), and then select windows of various sizes, ranging from 10% to 40%, by adjusting the starting point from 0 to (100cw)%percent100𝑐𝑤(100-cw)\%( 100 - italic_c italic_w ) % in 5% increments. The window width is cw𝑐𝑤cwitalic_c italic_w for a given ratio w𝑤witalic_w% and a constant c𝑐citalic_c equal to either 1.5 or 2. Subsequently, we randomly select w𝑤witalic_w% of the data from the window, train the ResNet18 network with this subset, and plot the resulting test accuracies in Fig. 12.

We observe that training with a wider window, regardless of the constant c𝑐citalic_c, results in test accuracy curves similar to those shown in Fig. 3(a). However, the sliding window experiment for wider windows in Fig. 12 shows that the best contiguous window (horizontal lines) achieves better performance than wider windows, especially in high ratios. Thus, this result supports the use of a contiguous window subset in sample selection across a broad range of selection ratios.

G.3 Ablation study on difficulty scores

Table 11: Test accuracy of the BWS algorithm at different data selection ratios, depending on the difficulty score. Due to the high correlation between the difficulty scores, there is a similar sorting order across the scores. Thus, similar window positions are selected by BWS, regardless of the specific difficulty score in use. This similarity in subset selection leads to consistently strong performance regardless of the chosen difficulty score.
Selection methods Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
BWS with C-score Test accuracy 46.25 71.34 82.02 89.12 91.85 93.62 94.62 95.18 95.28
Window index 85% 55% 45% 25% 15% 5% 0% 0% 0%
BWS with EL2N Test accuracy 45.02 71.87 81.79 88.87 91.59 93.39 94.44 95.06 95.32
Window index 80% 60% 45% 25% 10% 5% 0% 0% 0%
BWS with forgetting Test accuracy 46.10 70.70 82.29 88.74 91.80 93.59 94.54 95.23 95.37
(Ours) Window index 90% 70% 55% 30% 15% 5% 0% 0% 0%
Oracle window Test accuracy 47.17 72.89 82.67 89.06 91.80 93.59 94.54 95.23 95.37
Window index 85% 55% 50% 25% 15% 5% 0% 0% 0%

In the implementation of our BWS algorithm, we employ the Forgetting score (Toneva et al., 2019) as a difficulty score. To test the algorithm’s adaptability to alternative difficulty scores, we examine its performance when configured with the EL2N score (Paul et al., 2021) and C-score (Jiang et al., 2021). Table 11 presents a comparison of the results obtained by our BWS algorithm when utilizing the EL2N score and C-score, against those achieved with the Forgetting score. Regardless of the difficulty score used, all the results demonstrate competitive performances, closely approaching those of the oracle window, across a wide range of selection ratios. We anticipate that the observed phenomenon arises due to a strong correlation between the difficulty scores. The rank correlation between the EL2N score (C-score) and the forgetting score used for comparison is notably high as 0.8836 (0.8500). This suggests that samples sorted by the different difficulty scores would likely follow a similar order of forgetting score. As a result, the best windows selected by BWS for the two different score cases exhibit similarity, as shown by Table 11. This consistency shows that the effectiveness of BWS is not limited by the choice of the difficulty score, highlighting its robustness to the sample scores used in sorting.

Appendix H Full Results

Table 12 compares the starting points of the window subsets selected by BWS and those of the oracle window subsets. BWS finds the nearly optimal subsets at broad subset ratios for CIFAR-10/100 and ImageNet dataset. In Table 13-20, the oracle window achieves the highest performance among the considered methodologies for almost every selection ratio and dataset, since it finds the best window by directly measuring and comparing the test accuracy of models trained by each window using the test dataset. Since the oracle window cannot be implemented in practice due to significant computational overhead, it is fair to compare the performances among the methods except the oracle window. Thus, we highlight the highest and the second-highest values among the rest of the methodologies except the oracle window in the tables.

Table 12: The starting points (%) of the window subsets obtained by BWS, and those of the oracle window subsets. BWS finds the nearly optimal subsets at broad subset ratios for all datasets.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90%
CIFAR-10 BWS(Ours) 90 70 55 30 15 5 0 0 0
Oracle window 85 55 50 25 15 5 0 0 0
CIFAR-100 BWS(Ours) 85 95 75 60 25 10 0 0 0
Oracle window 95 85 80 55 40 20 5 5 0
ImageNet BWS(Ours) 90 75 70 5 0 0 0 0 0
Oracle window 85 65 40 15 5 10 10 0 0
Table 13: Test accuracy of CIFAR-10 dataset on ResNet18. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Forgetting 30.08±plus-or-minus\pm±1.79 42.39±plus-or-minus\pm±1.31 54.31±plus-or-minus\pm±0.23 79.19±plus-or-minus\pm±0.26 89.13±plus-or-minus\pm±0.17 93.41±plus-or-minus\pm±0.05 94.49±plus-or-minus\pm±0.02 95.31±plus-or-minus\pm±0.08 95.14±plus-or-minus\pm±0.04 95.40±plus-or-minus\pm±0.08
EL2N 15.27±plus-or-minus\pm±0.36 27.01±plus-or-minus\pm±0.76 41.27±plus-or-minus\pm±0.62 71.67±plus-or-minus\pm±0.82 87.17±plus-or-minus\pm±0.48 93.24±plus-or-minus\pm±0.06 94.43±plus-or-minus\pm±0.13 95.13±plus-or-minus\pm±0.05 95.26±plus-or-minus\pm±0.11
LCMat 41.53±plus-or-minus\pm±0.61 66.86±plus-or-minus\pm±1.00 77.48±plus-or-minus\pm±1.62 87.34±plus-or-minus\pm±0.22 90.72±plus-or-minus\pm±0.06 92.45±plus-or-minus\pm±0.05 93.38±plus-or-minus\pm±0.07 94.90±plus-or-minus\pm±0.06 95.19±plus-or-minus\pm±0.01
AdaCore 39.87±plus-or-minus\pm±0.75 66.40±plus-or-minus\pm±1.10 77.84±plus-or-minus\pm±0.49 86.88±plus-or-minus\pm±0.05 89.90±plus-or-minus\pm±0.08 91.48±plus-or-minus\pm±0.24 92.73±plus-or-minus\pm±0.17 94.47±plus-or-minus\pm±0.18 95.04±plus-or-minus\pm±0.23
CCS 31.86±plus-or-minus\pm±0.72 58.89±plus-or-minus\pm±1.43 72.61±plus-or-minus\pm±3.59 86.64±plus-or-minus\pm±0.35 90.94±plus-or-minus\pm±0.55 93.00±plus-or-minus\pm±0.05 94.09±plus-or-minus\pm±0.17 94.95±plus-or-minus\pm±0.21 95.29±plus-or-minus\pm±0.09
Moderate DS 40.67±plus-or-minus\pm±0.50 67.53±plus-or-minus\pm±0.75 76.62±plus-or-minus\pm±1.29 84.86±plus-or-minus\pm±0.09 88.46±plus-or-minus\pm±0.07 90.63±plus-or-minus\pm±0.01 91.52±plus-or-minus\pm±0.08 93.69±plus-or-minus\pm±0.21 94.68±plus-or-minus\pm±0.07
Random 39.10±plus-or-minus\pm±0.14 67.14±plus-or-minus\pm±0.29 78.43±plus-or-minus\pm±0.72 86.87±plus-or-minus\pm±0.31 89.91±plus-or-minus\pm±0.31 91.66±plus-or-minus\pm±0.06 92.83±plus-or-minus\pm±0.04 94.40±plus-or-minus\pm±0.05 95.08±plus-or-minus\pm±0.19
BWS (Ours) 46.10±plus-or-minus\pm±2.68 70.70±plus-or-minus\pm±0.53 82.29±plus-or-minus\pm±0.35 88.74±plus-or-minus\pm±0.18 91.80±plus-or-minus\pm±0.03 93.59±plus-or-minus\pm±0.17 94.54±plus-or-minus\pm±0.06 95.23±plus-or-minus\pm±0.08 95.37±plus-or-minus\pm±0.07
Oracle window 47.17±plus-or-minus\pm±0.25 72.89±plus-or-minus\pm±1.05 82.67±plus-or-minus\pm±0.43 89.06±plus-or-minus\pm±0.34 91.80±plus-or-minus\pm±0.03 93.59±plus-or-minus\pm±0.17 94.54±plus-or-minus\pm±0.06 95.23±plus-or-minus\pm±0.08 95.37±plus-or-minus\pm±0.07
Table 14: Test accuracy of CIFAR-100 dataset on ResNet50. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Forgetting 7.01±plus-or-minus\pm±0.50 20.69±plus-or-minus\pm±1.13 34.22±plus-or-minus\pm±1.27 50.95±plus-or-minus\pm±0.78 61.54±plus-or-minus\pm±1.06 68.92±plus-or-minus\pm±0.87 73.84±plus-or-minus\pm±0.95 78.55±plus-or-minus\pm±0.44 79.69±plus-or-minus\pm±0.19 78.81±plus-or-minus\pm±0.13
EL2N 3.40±plus-or-minus\pm±0.12 8.15±plus-or-minus\pm±0.17 14.06±plus-or-minus\pm±0.48 28.14±plus-or-minus\pm±1.21 48.13±plus-or-minus\pm±1.77 52.25±plus-or-minus\pm±5.85 71.72±plus-or-minus\pm±0.17 77.33±plus-or-minus\pm±0.70 78.96±plus-or-minus\pm±0.10
LCMat 8.43±plus-or-minus\pm±0.44 28.51±plus-or-minus\pm±0.65 42.81±plus-or-minus\pm±0.31 55.77±plus-or-minus\pm±1.45 64.39±plus-or-minus\pm±1.02 67.22±plus-or-minus\pm±0.96 73.11±plus-or-minus\pm±0.81 77.51±plus-or-minus\pm±0.37 78.47±plus-or-minus\pm±0.65
AdaCore 5.56±plus-or-minus\pm±0.14 22.76±plus-or-minus\pm±1.20 39.56±plus-or-minus\pm±2.53 56.81±plus-or-minus\pm±1.60 65.30±plus-or-minus\pm±0.64 70.51±plus-or-minus\pm±0.64 71.18±plus-or-minus\pm±1.00 76.62±plus-or-minus\pm±0.47 78.37±plus-or-minus\pm±0.32
CCS 7.49±plus-or-minus\pm±0.66 24.34±plus-or-minus\pm±0.35 40.81±plus-or-minus\pm±2.11 56.81±plus-or-minus\pm±1.81 63.35±plus-or-minus\pm±0.40 67.70±plus-or-minus\pm±0.64 71.04±plus-or-minus\pm±0.52 74.94±plus-or-minus\pm±0.73 78.13±plus-or-minus\pm±0.31
Moderate DS 6.05±plus-or-minus\pm±0.29 24.53±plus-or-minus\pm±1.28 42.23±plus-or-minus\pm±3.03 54.72±plus-or-minus\pm±1.76 64.71±plus-or-minus\pm±1.27 68.71±plus-or-minus\pm±2.45 72.61±plus-or-minus\pm±0.31 75.80±plus-or-minus\pm±0.48 78.48±plus-or-minus\pm±0.13
Random 5.89±plus-or-minus\pm±0.52 23.76±plus-or-minus\pm±1.12 42.03±plus-or-minus\pm±1.56 55.03±plus-or-minus\pm±1.17 65.98±plus-or-minus\pm±0.50 69.23±plus-or-minus\pm±1.04 72.37±plus-or-minus\pm±0.13 76.53±plus-or-minus\pm±0.52 78.29±plus-or-minus\pm±0.22
BWS (Ours) 8.43±plus-or-minus\pm±0.49 29.25±plus-or-minus\pm±1.15 44.11±plus-or-minus\pm±3.13 58.30±plus-or-minus\pm±0.65 67.20±plus-or-minus\pm±1.67 72.17±plus-or-minus\pm±0.42 73.83±plus-or-minus\pm±0.72 77.78±plus-or-minus\pm±0.55 79.00±plus-or-minus\pm±0.29
Oracle window 10.63±plus-or-minus\pm±0.87 30.39±plus-or-minus\pm±2.87 45.16±plus-or-minus\pm±1.28 58.91±plus-or-minus\pm±0.52 67.51±plus-or-minus\pm±1.22 72.70±plus-or-minus\pm±0.50 75.00±plus-or-minus\pm±0.23 78.42±plus-or-minus\pm±0.14 79.00±plus-or-minus\pm±0.29
Table 15: Test accuracy of ImageNet dataset on ResNet50. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Forgetting 4.78±plus-or-minus\pm±0.10 28.18±plus-or-minus\pm±0.46 45.84±plus-or-minus\pm±0.67 60.75±plus-or-minus\pm±0.60 67.48±plus-or-minus\pm±0.11 70.26±plus-or-minus\pm±0.48 72.73±plus-or-minus\pm±0.09 74.63±plus-or-minus\pm±0.13 75.53±plus-or-minus\pm±0.06 75.85±plus-or-minus\pm±0.07
EL2N 2.10±plus-or-minus\pm±0.08 9.80±plus-or-minus\pm±0.03 20.42±plus-or-minus\pm±0.47 41.14±plus-or-minus\pm±0.04 54.42±plus-or-minus\pm±0.39 63.19±plus-or-minus\pm±0.29 68.19±plus-or-minus\pm±0.13 73.91±plus-or-minus\pm±0.36 74.79±plus-or-minus\pm±0.27
Memorization 0.52±plus-or-minus\pm±0.04 9.70±plus-or-minus\pm±0.21 23.80±plus-or-minus\pm±0.31 44.58±plus-or-minus\pm±0.09 59.66±plus-or-minus\pm±0.06 65.92±plus-or-minus\pm±0.04 70.22±plus-or-minus\pm±0.02 74.56±plus-or-minus\pm±0.24 74.94±plus-or-minus\pm±0.16
SSL Prototype 1.33±plus-or-minus\pm±0.21 20.07±plus-or-minus\pm±1.39 37.98±plus-or-minus\pm±0.08 55.25±plus-or-minus\pm±1.02 61.97±plus-or-minus\pm±0.25 66.58±plus-or-minus\pm±0.28 68.85±plus-or-minus\pm±0.19 73.43±plus-or-minus\pm±0.29 74.63±plus-or-minus\pm±0.28
LCMat 6.01±plus-or-minus\pm±0.31 32.26±plus-or-minus\pm±0.84 46.08±plus-or-minus\pm±0.64 59.02±plus-or-minus\pm±0.36 65.28±plus-or-minus\pm±0.21 68.50±plus-or-minus\pm±0.56 70.30±plus-or-minus\pm±0.46 74.13±plus-or-minus\pm±0.12 74.81±plus-or-minus\pm±0.02
AdaCore 6.01±plus-or-minus\pm±0.44 31.52±plus-or-minus\pm±0.58 46.98±plus-or-minus\pm±0.80 59.26±plus-or-minus\pm±1.58 65.18±plus-or-minus\pm±0.05 68.28±plus-or-minus\pm±0.05 70.72±plus-or-minus\pm±0.04 73.53±plus-or-minus\pm±0.13 74.69±plus-or-minus\pm±0.00
CCS 5.04±plus-or-minus\pm±0.40 31.83±plus-or-minus\pm±0.62 46.64±plus-or-minus\pm±1.08 58.77±plus-or-minus\pm±0.80 64.85±plus-or-minus\pm±0.12 67.82±plus-or-minus\pm±0.24 69.89±plus-or-minus\pm±0.24 73.57±plus-or-minus\pm±0.12 74.59±plus-or-minus\pm±0.03
Moderate DS 5.97±plus-or-minus\pm±0.60 32.47±plus-or-minus\pm±0.21 47.83±plus-or-minus\pm±0.11 58.86±plus-or-minus\pm±0.14 64.71±plus-or-minus\pm±0.01 67.47±plus-or-minus\pm±0.03 69.73±plus-or-minus\pm±0.08 73.16±plus-or-minus\pm±0.25 74.67±plus-or-minus\pm±0.07
Random 6.14±plus-or-minus\pm±0.01 33.17±plus-or-minus\pm±0.11 45.87±plus-or-minus\pm±0.07 59.19±plus-or-minus\pm±0.04 65.94±plus-or-minus\pm±0.38 68.23±plus-or-minus\pm±0.00 70.14±plus-or-minus\pm±0.31 73.74±plus-or-minus\pm±0.14 74.83±plus-or-minus\pm±0.08
BWS (Ours) 7.61±plus-or-minus\pm±0.84 33.96±plus-or-minus\pm±1.08 46.64±plus-or-minus\pm±0.20 62.08±plus-or-minus\pm±0.51 67.28±plus-or-minus\pm±0.20 70.53±plus-or-minus\pm±0.16 72.63±plus-or-minus\pm±0.14 74.67±plus-or-minus\pm±0.10 75.28±plus-or-minus\pm±0.28
Oracle window 7.89±plus-or-minus\pm±0.27 33.98±plus-or-minus\pm±0.57 49.21±plus-or-minus\pm±0.76 62.62±plus-or-minus\pm±0.30 68.27±plus-or-minus\pm±0.56 71.35±plus-or-minus\pm±0.24 72.91±plus-or-minus\pm±0.38 74.67±plus-or-minus\pm±0.10 75.28±plus-or-minus\pm±0.28
Table 16: Test accuracy of CIFAR-10 dataset by training a simple CNN architecture. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Forgetting 34.01±plus-or-minus\pm±0.45 46.28±plus-or-minus\pm±0.67 55.04±plus-or-minus\pm±0.55 67.98±plus-or-minus\pm±0.29 75.27±plus-or-minus\pm±0.08 80.19±plus-or-minus\pm±0.31 83.45±plus-or-minus\pm±0.23 86.92±plus-or-minus\pm±0.17 87.66±plus-or-minus\pm±0.29 87.64±plus-or-minus\pm±0.14
EL2N 16.60±plus-or-minus\pm±0.86 30.58±plus-or-minus\pm±0.27 42.90±plus-or-minus\pm±0.18 62.90±plus-or-minus\pm±0.15 73.67±plus-or-minus\pm±0.50 79.43±plus-or-minus\pm±0.36 82.83±plus-or-minus\pm±0.28 86.72±plus-or-minus\pm±0.35 87.86±plus-or-minus\pm±0.09
LCMat 46.42±plus-or-minus\pm±0.23 65.74±plus-or-minus\pm±0.65 72.54±plus-or-minus\pm±0.42 78.19±plus-or-minus\pm±0.07 81.15±plus-or-minus\pm±0.11 83.36±plus-or-minus\pm±0.05 84.65±plus-or-minus\pm±0.01 86.70±plus-or-minus\pm±0.29 87.73±plus-or-minus\pm±0.26
AdaCore 46.72±plus-or-minus\pm±0.21 66.69±plus-or-minus\pm±0.43 73.52±plus-or-minus\pm±0.49 78.64±plus-or-minus\pm±0.27 81.62±plus-or-minus\pm±0.16 83.38±plus-or-minus\pm±0.49 84.71±plus-or-minus\pm±0.05 86.59±plus-or-minus\pm±0.21 87.17±plus-or-minus\pm±0.22
CCS 39.50±plus-or-minus\pm±0.96 59.86±plus-or-minus\pm±0.21 68.89±plus-or-minus\pm±0.44 76.07±plus-or-minus\pm±0.57 80.55±plus-or-minus\pm±0.14 83.21±plus-or-minus\pm±0.44 84.85±plus-or-minus\pm±0.08 86.78±plus-or-minus\pm±0.37 87.46±plus-or-minus\pm±0.21
Moderate DS 48.68±plus-or-minus\pm±0.46 66.61±plus-or-minus\pm±0.29 72.64±plus-or-minus\pm±0.25 76.82±plus-or-minus\pm±0.28 79.72±plus-or-minus\pm±0.17 81.35±plus-or-minus\pm±0.28 82.78±plus-or-minus\pm±0.24 85.64±plus-or-minus\pm±0.27 86.91±plus-or-minus\pm±0.10
Random 46.70±plus-or-minus\pm±0.91 66.27±plus-or-minus\pm±0.48 73.65±plus-or-minus\pm±0.53 78.63±plus-or-minus\pm±0.37 81.70±plus-or-minus\pm±0.36 83.04±plus-or-minus\pm±0.16 84.55±plus-or-minus\pm±0.10 86.43±plus-or-minus\pm±0.05 87.23±plus-or-minus\pm±0.09
BWS (Ours) 52.48±plus-or-minus\pm±0.42 69.71±plus-or-minus\pm±0.37 76.17±plus-or-minus\pm±0.30 80.69±plus-or-minus\pm±0.06 82.58±plus-or-minus\pm±0.20 84.34±plus-or-minus\pm±0.49 84.76±plus-or-minus\pm±0.22 86.86±plus-or-minus\pm±0.05 87.68±plus-or-minus\pm±0.06
Oracle window 53.08±plus-or-minus\pm±0.29 69.71±plus-or-minus\pm±0.37 76.17±plus-or-minus\pm±0.30 80.69±plus-or-minus\pm±0.06 82.59±plus-or-minus\pm±0.20 84.34±plus-or-minus\pm±0.49 85.09±plus-or-minus\pm±0.16 86.86±plus-or-minus\pm±0.05 87.68±plus-or-minus\pm±0.06
Table 17: Test accuracy of CIFAR-10 dataset by training EfficientNet-B0 architecture. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 30% 40% 50% 75% 90% 100%
Forgetting 26.22±plus-or-minus\pm±0.74 31.84±plus-or-minus\pm±2.06 39.56±plus-or-minus\pm±3.81 62.36±plus-or-minus\pm±6.14 77.14±plus-or-minus\pm±6.16 84.53±plus-or-minus\pm±1.51 89.60±plus-or-minus\pm±1.15 92.23±plus-or-minus\pm±0.06 92.60±plus-or-minus\pm±0.22 92.60±plus-or-minus\pm±0.12
EL2N 14.67±plus-or-minus\pm±0.28 24.97±plus-or-minus\pm±0.84 32.32±plus-or-minus\pm±2.63 60.11±plus-or-minus\pm±2.11 76.03±plus-or-minus\pm±0.88 84.82±plus-or-minus\pm±0.56 88.95±plus-or-minus\pm±0.74 91.04±plus-or-minus\pm±0.15 92.01±plus-or-minus\pm±0.40
LCMat 29.12±plus-or-minus\pm±2.16 54.45±plus-or-minus\pm±4.23 67.35±plus-or-minus\pm±4.97 76.52±plus-or-minus\pm±1.82 84.88±plus-or-minus\pm±0.53 87.20±plus-or-minus\pm±1.08 89.02±plus-or-minus\pm±0.16 91.47±plus-or-minus\pm±0.20 92.46±plus-or-minus\pm±0.31
AdaCore 31.50±plus-or-minus\pm±2.29 52.96±plus-or-minus\pm±5.90 69.13±plus-or-minus\pm±2.58 79.65±plus-or-minus\pm±0.45 85.06±plus-or-minus\pm±0.93 86.29±plus-or-minus\pm±0.96 87.71±plus-or-minus\pm±1.37 90.64±plus-or-minus\pm±0.99 91.95±plus-or-minus\pm±0.10
CCS 26.38±plus-or-minus\pm±1.58 45.78±plus-or-minus\pm±4.05 59.21±plus-or-minus\pm±6.03 76.05±plus-or-minus\pm±4.40 83.85±plus-or-minus\pm±1.55 87.93±plus-or-minus\pm±0.72 88.99±plus-or-minus\pm±0.52 91.80±plus-or-minus\pm±0.20 92.27±plus-or-minus\pm±0.38
Moderate DS 35.10±plus-or-minus\pm±0.78 55.24±plus-or-minus\pm±2.45 68.01±plus-or-minus\pm±3.00 78.91±plus-or-minus\pm±1.42 80.93±plus-or-minus\pm±2.77 85.59±plus-or-minus\pm±0.66 85.26±plus-or-minus\pm±1.32 89.96±plus-or-minus\pm±0.48 91.81±plus-or-minus\pm±0.21
Random 34.34±plus-or-minus\pm±3.70 48.49±plus-or-minus\pm±8.17 62.80±plus-or-minus\pm±8.18 80.71±plus-or-minus\pm±0.37 84.56±plus-or-minus\pm±0.46 85.86±plus-or-minus\pm±0.47 89.10±plus-or-minus\pm±0.16 91.40±plus-or-minus\pm±0.18 92.02±plus-or-minus\pm±0.24
BWS (Ours) 35.86±plus-or-minus\pm±3.15 58.06±plus-or-minus\pm±4.62 75.88±plus-or-minus\pm±1.07 82.46±plus-or-minus\pm±0.67 86.16±plus-or-minus\pm±0.49 88.23±plus-or-minus\pm±0.50 89.84±plus-or-minus\pm±0.69 92.06±plus-or-minus\pm±0.21 92.45±plus-or-minus\pm±0.29
Oracle window 42.65±plus-or-minus\pm±0.66 64.18±plus-or-minus\pm±4.56 75.88±plus-or-minus\pm±1.07 83.47±plus-or-minus\pm±1.32 86.16±plus-or-minus\pm±0.49 88.23±plus-or-minus\pm±0.50 89.84±plus-or-minus\pm±0.69 92.06±plus-or-minus\pm±0.21 92.45±plus-or-minus\pm±0.29
Table 18: Test accuracy of CIFAR-10 dataset by fine-tuning ViT pretrained on ImageNet.We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 1% 5% 10% 20% 100%
Forgetting 41.77±plus-or-minus\pm±8.60 96.95±plus-or-minus\pm±0.31 98.05±plus-or-minus\pm±0.12 98.54±plus-or-minus\pm±0.04 98.60±plus-or-minus\pm±0.03
EL2N 36.13±plus-or-minus\pm±8.15 94.90±plus-or-minus\pm±0.61 97.45±plus-or-minus\pm±0.26 98.27±plus-or-minus\pm±0.01
LCMat 66.89±plus-or-minus\pm±4.74 95.96±plus-or-minus\pm±0.10 97.47±plus-or-minus\pm±0.13 98.02±plus-or-minus\pm±0.10
AdaCore 64.65±plus-or-minus\pm±4.38 96.17±plus-or-minus\pm±0.30 97.24±plus-or-minus\pm±0.17 97.87±plus-or-minus\pm±0.13
CCS 56.84±plus-or-minus\pm±7.70 96.77±plus-or-minus\pm±0.11 97.88±plus-or-minus\pm±0.14 98.28±plus-or-minus\pm±0.10
Moderate DS 66.84±plus-or-minus\pm±3.38 95.87±plus-or-minus\pm±0.06 97.17±plus-or-minus\pm±0.14 97.73±plus-or-minus\pm±0.05
Random 66.34±plus-or-minus\pm±4.61 96.37±plus-or-minus\pm±0.21 97.42±plus-or-minus\pm±0.12 98.01±plus-or-minus\pm±0.11
BWS (Ours) 71.42±plus-or-minus\pm±3.54 97.05±plus-or-minus\pm±0.23 98.03±plus-or-minus\pm±0.07 98.45±plus-or-minus\pm±0.04
Oracle window 73.81±plus-or-minus\pm±2.00 97.16±plus-or-minus\pm±0.25 98.06±plus-or-minus\pm±0.09 98.45±plus-or-minus\pm±0.04
Table 19: Test accuracy of 20% label-noise CIFAR-10 dataset. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 10% 20% 30% 40% 100%
Forgetting 49.82±plus-or-minus\pm±0.87 64.52±plus-or-minus\pm±1.60 69.89±plus-or-minus\pm±1.10 72.21±plus-or-minus\pm±0.10 82.66±plus-or-minus\pm±0.00
EL2N 7.02±plus-or-minus\pm±0.22 9.48±plus-or-minus\pm±0.15 21.11±plus-or-minus\pm±0.19 38.91±plus-or-minus\pm±0.87
LCMat 59.20±plus-or-minus\pm±1.21 71.54±plus-or-minus\pm±1.28 77.77±plus-or-minus\pm±0.48 81.23±plus-or-minus\pm±0.42
AdaCore 10.96±plus-or-minus\pm±0.10 10.85±plus-or-minus\pm±0.26 39.14±plus-or-minus\pm±0.65 58.51±plus-or-minus\pm±0.20
CCS 56.87±plus-or-minus\pm±0.52 72.20±plus-or-minus\pm±0.60 77.16±plus-or-minus\pm±0.61 80.22±plus-or-minus\pm±0.73
Moderate DS 78.75±plus-or-minus\pm±0.32 86.53±plus-or-minus\pm±0.24 89.61±plus-or-minus\pm±0.32 91.35±plus-or-minus\pm±0.21
Random 64.24±plus-or-minus\pm±1.35 75.45±plus-or-minus\pm±0.67 79.58±plus-or-minus\pm±0.70 81.99±plus-or-minus\pm±0.27
BWS (Ours) 78.74±plus-or-minus\pm±0.56 84.77±plus-or-minus\pm±0.28 88.06±plus-or-minus\pm±0.03 89.79±plus-or-minus\pm±0.08
Oracle window 81.01±plus-or-minus\pm±0.21 87.32±plus-or-minus\pm±0.12 90.06±plus-or-minus\pm±0.06 91.48±plus-or-minus\pm±0.09
Table 20: Test accuracy of 40% label-noise CIFAR-10 dataset. We highlight the highest values in bold and the second-highest values in underscore.
Selection ratio 10% 20% 30% 40% 100%
Forgetting 53.90±plus-or-minus\pm±1.26 66.95±plus-or-minus\pm±1.09 70.78±plus-or-minus\pm±0.18 69.65±plus-or-minus\pm±0.39 72.71±plus-or-minus\pm±0.40
EL2N 5.75±plus-or-minus\pm±0.19 5.81±plus-or-minus\pm±0.17 5.68±plus-or-minus\pm±0.27 8.63±plus-or-minus\pm±0.14
LCMat 43.32±plus-or-minus\pm±0.44 57.73±plus-or-minus\pm±1.35 65.00±plus-or-minus\pm±0.39 69.11±plus-or-minus\pm±0.31
AdaCore 9.76±plus-or-minus\pm±0.48 10.03±plus-or-minus\pm±0.09 10.42±plus-or-minus\pm±0.32 10.73±plus-or-minus\pm±0.21
CCS 52.08±plus-or-minus\pm±2.15 64.65±plus-or-minus\pm±1.34 70.08±plus-or-minus\pm±0.19 72.09±plus-or-minus\pm±0.18
Moderate DS 77.69±plus-or-minus\pm±0.97 86.21±plus-or-minus\pm±0.19 88.63±plus-or-minus\pm±0.13 83.42±plus-or-minus\pm±0.16
Random 50.33±plus-or-minus\pm±0.40 60.21±plus-or-minus\pm±1.28 65.01±plus-or-minus\pm±0.60 67.68±plus-or-minus\pm±0.65
BWS (Ours) 78.40±plus-or-minus\pm±0.37 85.42±plus-or-minus\pm±0.08 87.76±plus-or-minus\pm±0.17 88.97±plus-or-minus\pm±0.16
Oracle window 80.58±plus-or-minus\pm±0.41 86.06±plus-or-minus\pm±0.17 88.12±plus-or-minus\pm±0.28 88.97±plus-or-minus\pm±0.16
Table 21: Comparison of window subsets of CIFAR-10 in terms of their 1) test accuracy, measured on models trained with the window subsets (top rows) and 2) accuracy of kernel ridge regression on the training dataset (bottom rows). The best performing windows align well between the two measures.
Ratio Starting point 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%
10% Test Acc 56.34 58.34 68.54 70.24 74.77 78.54 81.32 81.71 82.24 82.46 82.67 82.29 80.95 80.53 79.75 77.99 77.41 74.88 71.09
Regression Acc 56.93 59.32 61.41 63.94 65.01 65.93 67.03 67.25 67.55 67.81 67.74 67.81 67.73 67.55 67.27 66.91 66.65 65.98 65.10
20% Test Acc 79.08 83.39 86.03 87.79 88.33 89.06 88.74 88.42 88.06 87.23 86.86 86.12 85.37 84.29 83.10 82.09 80.42 - -
Regression Acc 76.27 77.07 77.85 78.98 79.41 79.69 79.83 79.79 79.61 79.49 79.29 79.06 78.86 78.63 78.35 78.16 77.81 - -
30% Test Acc 89.45 91.14 91.77 91.80 91.67 90.94 90.68 89.97 89.47 88.92 88.13 87.47 86.69 85.78 84.47 - - - -
Regression Acc 83.81 84.23 84.42 84.49 84.47 84.34 84.20 84.00 83.85 83.70 83.50 83.33 83.17 83.05 82.86 - - - -
40% Test Acc 93.08 93.59 93.39 93.00 92.46 91.63 91.11 90.54 89.88 89.02 88.46 87.76 86.96 - - - - - -
Regression Acc 87.42 87.48 87.39 87.30 87.19 87.04 86.88 86.71 86.58 86.42 86.29 86.18 86.02 - - - - - -
Table 22: Comparison of window subsets of CIFAR-100 in terms of their 1) test accuracy, measured on models trained with the window subsets (top rows) and 2) accuracy of kernel ridge regression on the training dataset (bottom rows). The best performing windows align well between the two measures.
Ratio Starting point 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%
10% Test Acc 35.64 34.81 28.31 30.53 33.22 33.07 32.12 38.03 34.69 37.98 38.36 43.06 42.24 41.73 43.60 44.11 45.16 42.77 42.70
Regression Acc 11.12 11.23 11.02 10.89 10.96 11.28 12.11 12.34 12.50 12.84 12.88 13.12 13.48 13.60 13.99 14.24 14.16 14.02 14.06
20% Test Acc 48.78 52.02 49.43 53.03 54.39 54.99 56.60 56.51 57.08 56.83 58.70 58.91 58.30 56.26 56.34 56.71 52.96 - -
Regression Acc 25.62 25.58 25.78 25.76 26.48 27.00 27.36 27.59 27.67 27.85 28.17 28.29 28.51 28.44 28.23 27.93 27.30 - -
30% Test Acc 62.25 61.17 62.66 63.78 66.62 67.20 66.56 65.33 67.51 66.61 63.93 64.88 63.09 60.47 59.06 - - - -
Regression Acc 43.27 43.46 43.83 43.85 44.11 44.37 44.22 44.23 44.01 43.78 43.56 43.21 42.56 41.95 40.92 - - - -
40% Test Acc 70.50 69.91 72.17 70.78 72.70 72.11 69.79 71.38 69.06 68.87 67.83 66.25 63.98 - - - - - -
Regression Acc 54.05 54.36 54.44 54.30 53.96 53.67 53.25 52.74 52.23 51.56 50.72 49.73 48.64 - - - - - -
Table 23: Comparison of window subsets of ImageNet in terms of their 1) test accuracy, measured on models trained with the window subsets (top rows) and 2) accuracy of kernel ridge regression on the training dataset (bottom rows).
Ratio Starting point 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%
10% Test Acc 46.78 46.34 46.49 46.89 45.50 47.07 45.82 48.07 49.21 48.00 48.62 47.76 46.73 46.67 46.64 45.03 43.43 40.08 32.68
Regression Acc 35.05 35.45 35.71 35.84 35.74 35.72 35.89 35.99 36.03 36.06 36.05 36.04 36.19 36.23 36.24 36.17 35.90 35.01 32.43
20% Test Acc 61.02 62.08 60.60 62.62 61.49 62.38 62.29 61.34 61.46 61.02 59.77 59.46 58.53 56.91 54.43 52.37 46.93 - -
Regression Acc 44.72 44.91 44.83 44.72 44.59 44.54 44.35 44.21 44.12 43.94 43.89 43.84 43.75 43.65 43.39 42.81 41.30 - -
30% Test Acc 67.28 68.27 67.89 67.90 68.00 67.81 67.85 66.87 66.91 65.68 64.53 62.77 61.66 58.88 55.25 - - - -
Regression Acc 52.19 52.14 51.93 51.70 51.52 51.26 51.04 50.83 50.65 50.46 50.30 50.14 49.86 49.39 48.36 - - - -
40% Test Acc 70.53 70.65 71.35 71.27 70.87 70.31 69.87 68.79 68.42 67.05 66.11 63.92 61.54 - - - - - -
Regression Acc 53.48 53.41 53.23 52.98 52.74 52.54 52.27 52.05 51.87 51.67 51.42 50.99 50.14 - - - - - -
Table 24: Comparison of window subsets of CIFAR-10 dataset with 20% label noise in terms of their 1) test accuracy, measured on models trained with the window subsets (top rows) and 2) accuracy of kernel ridge regression on the training dataset (middle rows). We also report the noise portion with each window subset (bottom rows). The best window alignment between the two measures gets less accurate, compared to the case without label noise, since our method (regression) tends to choose more easier samples. However, such tendency also makes the choice of window subset mostly composed of clean-label samples.
Ratio Starting point 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%
10% Test Acc 6.98 8.93 13.80 31.90 58.00 68.77 75.21 78.29 79.18 79.04 79.85 79.99 81.01 80.86 79.72 79.53 78.74 77.81 75.71
Regression Acc 51.54 55.95 58.10 59.52 60.42 61.68 63.86 64.94 66.11 67.58 68.74 69.64 69.73 70.09 70.20 70.43 70.54 70.13 69.73
Noise Portion 92% 85% 69% 39% 15% 8% 5% 4% 4% 4% 3% 3% 3% 3% 3% 3% 3% 3% 3%
20% Test Acc 9.56 17.79 35.29 63.55 78.99 84.46 86.52 87.07 87.32 87.17 86.85 86.47 86.32 85.36 84.77 84.08 82.82 - -
Regression Acc 73.28 75.17 76.06 77.03 78.38 78.66 79.05 79.63 80.01 80.47 80.67 80.67 80.66 80.63 80.70 80.60 80.35 - -
Noise Portion 80% 62% 42% 24% 10% 6% 5% 4% 4% 3% 3% 3% 3% 3% 3% 3% 3% - -
30% Test Acc 20.90 38.99 62.51 79.74 87.82 89.80 90.06 89.83 89.34 88.94 88.46 88.06 87.66 86.93 86.04 - - - -
Regression Acc 80.43 81.39 82.33 83.20 83.45 83.84 84.10 84.30 84.36 84.55 84.47 84.60 84.39 84.41 84.16 - - - -
Noise Portion 59% 44% 30% 17% 8% 5% 4% 4% 3% 3% 3% 3% 3% 3% 3% - - - -
40% Test Acc 38.21 59.90 77.70 87.11 90.70 91.48 91.32 90.75 90.10 89.79 89.04 88.84 88.16 - - - - - -
Regression Acc 86.19 86.50 86.63 86.87 86.96 87.11 87.21 87.16 87.29 87.39 87.25 87.25 87.16 - - - - - -
Noise Portion 45% 34% 23% 14% 7% 5% 4% 4% 3% 3% 3% 3% 3% - - - - - -
Table 25: Comparison of window subsets of CIFAR-10 dataset with 40% label noise in terms of their 1) test accuracy, measured on models trained with the window subsets (top rows) and 2) accuracy of kernel ridge regression on the training dataset (middle rows). We also report the noise portion in each window subset (bottom rows). The best window alignment between the two measures gets less accurate, compared to the case without label noise, since our method (regression) tends to choose more easier samples. However, such tendency also makes the choice of window subset mostly composed of clean-label samples.
Ratio Starting point 0% 5% 10% 15% 20% 25% 30% 35% 40% 45% 50% 55% 60% 65% 70% 75% 80% 85% 90%
10% Test Acc 6.28 6.12 6.82 7.82 7.60 9.81 21.05 45.22 62.11 73.62 77.57 78.94 80.58 79.96 79.59 78.40 77.92 77.07 75.51
Regression Acc 1.87 1.88 1.47 2.16 3.13 12.03 25.36 46.69 56.92 61.44 63.84 65.21 65.17 65.18 65.45 65.63 65.11 64.73 64.03
Noise Portion 92% 93% 92% 90% 88% 82% 61% 35% 21% 15% 13% 11% 10% 9% 9% 8% 7% 8% 9%
20% Test Acc 5.61 6.62 6.84 8.17 13.36 28.17 52.85 72.56 82.01 85.09 86.06 85.59 85.42 84.86 83.92 83.01 82.05 - -
Regression Acc 0.89 0.98 1.35 3.32 14.45 39.13 60.53 69.82 73.12 74.90 75.85 75.86 75.93 75.56 75.27 74.93 74.35 - -
Noise Portion 92% 91% 90% 86% 74% 58% 41% 25% 17% 13% 11% 10% 9% 8% 8% 8% 8% - -
30% Test Acc 5.46 6.71 10.28 17.60 35.26 56.16 72.82 83.20 87.32 88.12 87.76 87.61 86.86 86.17 85.23 - - - -
Regression Acc 0.48 0.98 4.05 18.44 51.29 71.18 79.48 82.93 84.46 85.28 85.45 85.38 85.03 84.82 84.45 - - - -
Noise Portion 90% 88% 80% 69% 56% 44% 32% 21% 14% 12% 11% 9% 9% 8% 8% - - - -
40% Test Acc 8.50 12.79 22.46 39.42 58.61 72.52 81.96 87.17 88.90 88.97 88.63 88.11 87.28 - - - - - -
Regression Acc 2.85 11.02 30.74 60.30 79.54 86.17 88.11 88.82 89.34 89.46 89.36 89.16 88.91 - - - - - -
Noise Portion 83% 75% 65% 55% 46% 36% 26% 18% 13% 11% 10% 9% 9% - - - - - -