Scaling laws for learning with real and surrogate data
Abstract
Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as ‘surrogate data.’ We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein’s paradox. In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.
1 Introduction and overview
1.1 Motivation and formulation
Consider a standard learning setting where we are given i.i.d. points from a target distribution . Given a family of rich parametric models governed by the parameter , the goal is to find the parameter that minimizes the expected test loss between the model predictions for a chosen and the data generated from the distribution . In many application domains, the available data from the target distribution, referred to as either real or original data, may be difficult or expensive to acquire. One may then attempt to supplement these data with a different, cheaper source. Examples of such cheaper sources are publicly available datasets; datasets owned by the same research group or company but acquired in different circumstances, e.g. in a different location; synthetic data produced by a generative model.
We will denote the data points obtained from this source by , and assume we have of them. To formalize, assume ‘surrogate’ data is a collection of i.i.d. samples from some distribution . In general, we will not assume the distribution of synthetic data to be close to the original data distribution , however we assume that these distributions are over the same domain. A number of questions arise: How should we use the surrogate data in training? How many surrogate samples should we add to the original data? Can we predict the improvement in test error achieved by adding surrogate samples to the training?
A natural approach would be to add the surrogate data to the original one in the usual training procedure, and indeed many authors have explored this approach (see Section 1.3). Namely, one attempts to minimize the overall empirical risk , where is a train loss function.
However, a moment of reflection reveals that this approach has serious shortcomings. Consider a simple mean estimation problem, whereby , , , and . A straightforward calculation yields that the test error of the empirical risk minimizer is
(1) |
As increases the variance (the second term) decreases, but the bias due to the difference increases, and the error approaches , i.e. the model will be only as good as if training only on surrogate data.
In order to overcome these limitations, we propose a weighted ERM approach, and will show that the weight plays a crucial role. Namely, we consider the following regularized empirical risk:
(2) |
where is the weight of the surrogate dataset and is a regularizer, e.g. a ridge . We denote by
(3) |
the corresponding empirical risk minimizer. and the resulting test error will be .
For supervised learning tasks, a sample is represented as , where is covariate vector and is response variable and parametrizes a family of models that predict the response given covariate vector . We consider losses of the form and for some functions and . We allow for the test loss to be different from the train loss , but we will omit the subscript ‘test’ whenever clear from the context.
![Refer to caption](x1.png)
Figure 1 provides a preview of our results, for a sentiment analysis task. (Technical details provided in Section 4 and Appendix A.2). Each frame corresponds to a different combination of and , and we report the test error of our approach as a function of the weight parameter (red circles). Solid lines report the prediction of a scaling law that will be one of the main results presented below.
We observe that the weighted ERM approach systematically achieves better test error than either training only on original data () or on surrogate data (). Further the error for optimal is always monotone decreasing both in and , and the approach outperforms the naive unweighted approach. Also, while scaling laws typically do not capture the dependence on hyperparameters, the scaling law presented below predicts the dependence on reasonably well. This is particularly useful, because it can be used to tune optimally and to predict the amount of surrogate data needed.
1.2 Summary of results
We study the method outlined above both mathematically and via numerical experiments. Our mathematical results are developed in four different settings: The Gaussian sequence model (Section 3.1); A non-parametric function estimation setting (Section 3.2); Low-dimensional empirical-risk minimization (Section 3.3); High dimensional ridge regression (Section 3.4);
We carry out experiments with the following data sources. Simulated data from linear or Gaussian mixture models: this allows us to explicitly control the distribution shift between the original and surrogate datasets, as well as check our theoretical results in a controlled setting. Real natural language processing (NLP) data for sentiment analysis, with the role of original dataset played by IMDB reviews and the role of surrogate datasets played respectively by Rotten Tomatoes review and Goodreads book reviews. Progression-free survival analysis using Lasso on TCGA PanCancer dataset with female patients data and male patients data as original and surrogate data, respectively. Real image classification data, with CIFAR-10 and CIFAR-100 datasets respectively playing the role of original and surrogate data. Our results support the following conclusions:
Surrogate data improve test error. Including surrogate data in training generally improves the test error on the original data, even if the surrogate data distribution is far from the original one. In agreement with the interpretation of surrogate data as a regularizer (see also Sec. (2)), the improvement is generally positive, although its size depend on the data distributions.
Tuning of . The above conclusion holds under the condition that can be tuned (nearly) optimally. For each of the theoretical settings already mentioned, we characterize this optimal value. We verify that nearly optimal can be effectively selected by minimizing error on a validation split of the original data. An attractive alternative is to use the scaling law we discuss next.
Scaling law. We propose a scaling law that captures the behavior of the test error with :
(4) |
Here is the minimal (Bayes) error, is the excess test error when training on the surrogate data (and testing on original), is the excess test error111We assume here that , i.e. that we achieve Bayes risk with infinitely many original samples. See Section 5. when training on original data (and testing on original), and is a scaling exponent as described in Section 4. The above scaling admits natural generalizations, see Section 5.
Practical uses of the scaling law. Given data and a source of surrogate data, we would like to predict how much the test error can be decreased by including any number of surrogate samples to the mix. The scaling law (4) suggests a simple approach: Learn models on purely original data to extract the behavior of test loss ; Learn models on purely surrogate data to extract the behavior of (A relatively small sample is sufficient for this step.) Use the minimum over of Eq. (4) to predict test error at any given pair .
1.3 Related work
The use of surrogate data to enhance training has attracted increasing research effort, also because of the recent progresses in generative modeling.
This line of work has largely focused on the techniques to generate synthetic data that are well suited for training. A broad variety of methods have been demonstrated to be useful to generate data for computer vision tasks, ranging from object classification to semantic segmentation [RSM+16, JRBM+17, AAMM+18, TPA+18, CLCG19, HSY+22, MPT+22, YCFB+22]. We refer to [SLW20] for a review. More recently, synthetic data have been used for training in natural language processing [HNK+22, MHZH22].
Scaling laws have been broadly successful in guiding the development of large machine learning models [HNA+17, RRBS19, HKK+20, KMH+20, TDR+21, HKHM21, HBM+22, ANZ22, MRB+23]. We expect them to similarly useful in integrating heterogeneous data into training. The change of scaling laws when training on synthetic data was the object of a recent empirical study [FCK+23]. On the other hand, no systematic attempt was made at integrating real and synthetic data.
2 Regularization, Gaussian mean estimation, Stein paradox
The role of the parameter can be understood by considering the limit :
and is the population risk for surrogate data. This suggests to think of the surrogate data as an additional (highly non-trivial) regularizer, with parameter . This leads to a simple yet important insight: adding surrogate data to the original data is beneficial if is chosen optimally, and large reduces statistical fluctuations in this regularizer. This contrasts with the unweighted approach whose test error in general deteriorates for large .
As a toy example, reconsider the mean estimation problem mentioned in the introduction: and , and . We have . In other words, the weighted ERM shrinks the mean of the original data towards the mean of the surrogate data. For a given , the resulting test errors are
(5) |
and for the optimum value , this yields
(6) |
Note that is the error of training only on original data and the prefactor is always strictly smaller than one. Hence, weighted ERM always achieves better error than training only on original data, regardless of the distance between original and surrogate data, although the improvement is larger for small . This might seem paradoxical at first. As mentioned above, we are shrinking towards an arbitrary point given by the empirical mean of the surrogate data: how can this help?
In fact, this is a disguised version of the celebrated Stein paradox [EM77, Ste81]: in estimating a Gaussian mean, a procedure that shrinks the empirical mean towards an arbitrary point by a carefully chosen amount outperforms the naive empirical mean. In our toy example, the naive empirical mean corresponds to estimation purely based on the original data, and we shrink it towards the mean of the surrogate data. Of course, the improvement over empirical mean is only possible if is chosen optimally. Equation (6) assumes is chosen by an oracle that knows the value of . Stein’s analysis implies that in the Gaussian mean problem, can be chosen empirically as long as the dimension of is . In the settings we are interested in, can be chosen via cross-validation.
3 Theoretical results
3.1 Gaussian sequence model
The sequence model captures the behavior of many models in non-parametric statistics while being simpler to analyze [Tsy09, GN21]. It is also known to approximate the behavior of overparametrized linear regression [CM22]. The unknown target is (with potentially ), and we observe
(7) |
where is also unknown, and are i.i.d. We study the penalized estimator
(8) |
where and is a regularization weight matrix. We will be concerned with the expected risk
(9) |
The proof of the next result is presented in Appendix C.
Theorem 1.
Let be the ordered eigenvalues of , and denote by the corresponding eigenvectors. Further denote by , the projections of , onto , and similarly for , . Assume that , , , , and let be such that (for all ): . Then the following hold:
-
There exists an explicit such that, letting ,
(10) -
If , there exists and there exist satisfying the assumptions in point , such that,
(11)
Note that since the theorem also implies and , this result confirms the scaling law (4).
3.2 Non-parametric regression in Sobolev classes
In this section we consider the classic non-parametric regression model. We assume that for some integer , and the original data are defined through
(12) |
where are independent of and of each other, and equally spaced grid points in the -dimensional unit-cube, i.e. . Surrogate data have a similar distribution, with equally spaced points in the unit cube, and , where . We assume that has small Sobolev norm, that is,
Recall that is a special reproducing kernel Hilbert space (RKHS) norm: we expect some of the considerations below to generalize to other RKHS norms.
Following our general methodology, we use the estimator
(13) |
We are interested in , which is the excess squared loss for a test point .
In order to avoid technical burden we will carry out the analysis for a continuous model, the so-called white noise model, where we observe the function at all points , perturbed by -dimensional white noise:
(14) |
and similarly for . We use an estimator that naturally generalizes (13) to the continuous case. Our results for the white noise model are as follows.
Theorem 2.
Let . If and , then for every there exists a constant such that
(15) |
with high probability, where .
Remark 3.1.
The white noise model (14) is known to be equivalent to the original model (12) (with deterministic equispaced designs) in the sense of Le Cam, for [BL96, Rei08]. While suggestive, this equivalence does not allow us to formally deduce results for the data (12), because it does not apply to the specific estimators of interest here.
3.3 Low-dimensional asymptotics
We study the estimator of Eqs. (2), (3) under the classical asymptotics at fixed. Since this type of analysis is more standard, we defer it to Appendix B. The main result of this analysis is that the scaling law (4) holds in this setting, with the classical parametric exponent , for for a suitable . Importantly, the interval includes the optimal choice of the weight .
3.4 High-dimensional linear regression
In this section, we study ridge regression in the high-dimensional regime in which the number of samples is proportional to the number of parameters. Denoting the original data by (with the vector of responses and the matrix of covariates), and the surrogate data by (with and ), we minimize the regularized empirical risk
(16) |
We assume a simple distribution, whereby the rows of , (denoted by , ) are standard normal vectors and
(17) |
for , . Note that the two data distributions differ in the true coefficient vectors versus as well as in the noise variance. We will denote by the ridge estimator, .
The excess test error (for square loss) is given by . The next result characterizes this error in the proportional asymptotics.
Theorem 3.
Consider the ridge regression estimator . Let , and . Assume such that , , with 222The same proof, with some additional technical work, yields a characterization for as well. We omit it here for brevity.. For defined in Appendix E.1, let
be the unique minimizer. Then for any , there exist such that, for all
where Further, we can take if .
Remark 3.2 (Optimizing over the validation set).
Note that the concentration of around the theoretical prediction in Theorem 3 is uniform over . This means that we can find the optimal by computing over a grid of values, estimating over the validation set and choosing the optimal . The uniform guarantee insures that this procedure will achieve risk .
Remark 3.3 (Relation to scaling laws).
An analysis of the equations for reveals that, for large , the predicted excess risk behaves as (for some constants ). This matches the low-dimensional asymptotics and our scaling law (4) with . In practice, we find that, for moderate , the behavior of is better approximated by a different value of (see Appendix A.)
4 Empirical results
In this section, we present experiments validating that the scaling law (4) is a good approximation both for simulated and real-world data. For simulated data, we select two different distributions for the original and surrogate datasets. The test and validation sets are generated from the same distribution as the original dataset. In case of real-world data, we choose two different datasets as the original and surrogate datasets. We split the original dataset into train, test, and validation sets, while all examples in the surrogate datasets are allocated solely to the train split.
For each dataset and model discussed in this section, we carry out the same experiment: We use models trained on original data to fit the scaling curve and obtain and We use models trained on purely surrogate data to fit the scaling curve to obtain and . Since assume , we let and excess risk estimates , and , and we use , the fit exponent obtained from original data); For each combination of , we use our estimates of , (as measured empirically on the test set), , , and to plot the predicted as a function of using scaling law (4). We then train the model using original and surrogate examples with weights and, for the two datasets, respectively. We average the results of 10 independent runs to compare it against those predicted by the scaling law. For ridge regression, we also compare with exact high-dimensional asymptotics from Theorem 3.
Let us emphasize that these plots probe the dependence on the hyperparameter . These are much more demanding tests that the usual ones in scaling laws. We generally observe that the scaling law captures well the behavior of the test error for data mixtures.
Binary classification with Gaussian mixture data
This is a simple simulated setting. The original dataset consists of independent and identically distributed examples , , where is uniform over , and , where , . Surrogate data have the same distribution, with a different unit vector . This data distribution is parametrized by and the angle between the original and surrogate parameters, . We use in our experiments. For each , we average the results over 10 independent runs.
We use two different models for classification: (1) Logistic regression; A one-hidden layer neural network with 32 hidden ReLU neurons. Results for both models are presented in Appendix A.1.
![Refer to caption](x2.png)
Sentiment analysis in movie reviews
As original data, we use the IMDB dataset (link) which has 25k reviews for training, each labeled as positive or negative. For validation and testing, we split the IMDB test dataset of 25k reviews into a validation set of 10k reviews and test set of 15k reviews.
We experiment with two different surrogate datasets: 1) Rotten Tomatoes dataset of movie reviews (link): these are data with different distribution but within the same domain. This dataset contains movie reviews and the corresponding sentiments, 2) Goodreads book reviews (link): these are data from a substantially different domain. This dataset has reviews and their ratings. We choose 10k reviews each with a rating of 5 and 1, and label them as positive and negative, respectively.
We convert reviews into feature vectors with dimensions as explained in Appendix A.2. We use logistic regression and neural network models with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension).
Image classification with CIFAR10 and CIFAR100
We use 50,000 CIFAR10 training images as original data, its 10 classes for the classification task, and test on the 10,000 CIFAR10 test images. We use 50,000 CIFAR100 training images as surrogate data. We train a 9-layer ResNet model for classification. Appendix A.3 presents details on the data pre-processing and map** of labels. Results are shown in Figure 2. Note that CIFAR10 and CIFAR100 datasets are quite different from each other, as they have no overlap either in the images or in their label sets. Yet, the test error on training on their mixture is well predicted by the scaling law (4).
Lasso-based Cox regression on TCGA PanCancer dataset
We use the public domain TCGA pancancer dataset [GCH+20] (link), with gene expressions as covariates and progression-free survival (PFS) as response. After filtering and feature selection, we are left with 3580 female patients, which we use as original data, and 3640 male patients, which we use as surrogate data. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. The results are shown in Figure 3. The details of pre-processing and experiment parameters333We observe that training at yields a somewhat singular behavior: we use a as a proxy of , see appendices. are in Appendix A.4.
![Refer to caption](x3.png)
High-dimensional ridge regression
We simulate the data distribution in Section 3.4, i.e., , ; , ; with , , , and fit a simple linear model using ridge regression. The results are shown in Figure 4. In our experiments, we use , , and regularization parameter . Under these settings, the model is parametrized by the angle between and , where . We used and in our experiments.444For ridge regression simulations, we directly plot the excess test risks, as the parameter for original data is known. For any the excess test risk in this model is simply .
The theoretical predictions of Theorem 3 for these curves in high-dimensional asymptotics , with , are reported as blue lines, and match remarkably well with the empirical data. The simple scaling law (4) nevertheless provides a good approximation of these (more complicated) theoretical formulas.
Note in particular that in the top row of Figure 4, we have , i.e. the surrogate data are as far as possible from the original ones. Nevertheless, the induced regularization effect leads to smaller test error on the original distribution.
![Refer to caption](x4.png)
We observe proposed scaling law (4) predicts well the behavior of the experiments, across of the datasets above, and for most combinations of original and surrogate examples we have tested.
5 Discussion
We conclude by discussing two possible generalizations of the scaling law (4), and its applicability. First, throughout this paper we assumed that , namely that we can achieve the Bayes error by training on infinitely many original samples. In practice this will not hold because of the limited model complexity. Following standard scaling laws [KMH+20, HBM+22], this effect can be accounted for by an additional term , where is the model size (number of parameters). Second, the scaling law (4) implies as special cases that , . In particular, the exponent is the same when training on real or surrogate data. In practice, we observe often two somewhat different exponents . In these cases, we set , and this appears to work reasonably well. However, we can imagine cases in which the difference between and is significant enough (4) will stop being accurate.
References
- [AAMM+18] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother, Augmented reality meets computer vision: Efficient data generation for urban driving scenes, International Journal of Computer Vision 126 (2018), 961–972.
- [ANZ22] Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai, Revisiting neural scaling laws in language and vision, Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
- [Bir06] Steven Bird, Nltk: the natural language toolkit, Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 2006, pp. 69–72.
- [BL96] Lawrence D Brown and Mark G Low, Asymptotic equivalence of nonparametric regression and white noise, The Annals of Statistics 24 (1996), no. 6, 2384–2398.
- [CLCG19] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool, Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1841–1850.
- [CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression, arXiv:2210.08571 (2022).
- [EM77] Bradley Efron and Carl Morris, Stein’s paradox in statistics, Scientific American 236 (1977), no. 5, 119–127.
- [FCK+23] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian, Scaling laws of synthetic images for model training… for now, arXiv preprint arXiv:2312.04567 (2023).
- [GCH+20] Mary J Goldman, Brian Craft, Mim Hastie, Kristupas Repečka, Fran McDade, Akhil Kamath, Ayan Banerjee, Yunhai Luo, Dave Rogers, Angela N Brooks, et al., Visualizing and interpreting cancer genomics data via the xena platform, Nature biotechnology 38 (2020), no. 6, 675–678.
- [GN21] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge University Press, 2021.
- [Gor85] Yehoram Gordon, Some inequalities for gaussian processes and applications, Israel Journal of Mathematics 50 (1985), no. 4, 265–289.
- [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022).
- [HKHM21] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021).
- [HKK+20] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al., Scaling laws for autoregressive generative modeling, arXiv preprint arXiv:2010.14701 (2020).
- [HNA+17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017).
- [HNK+22] Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi, Generate, annotate, and learn: Nlp with synthetic text, Transactions of the Association for Computational Linguistics 10 (2022), 826–842.
- [HSY+22] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi, Is synthetic data from generative models ready for image recognition?, arXiv preprint arXiv:2210.07574 (2022).
- [JRBM+17] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan, Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?, 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 746–753.
- [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020).
- [MHZH22] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han, Generating training data with language models: Towards zero-shot language understanding, Advances in Neural Information Processing Systems 35 (2022), 462–477.
- [MM21] Léo Miolane and Andrea Montanari, The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning, The Annals of Statistics 49 (2021), no. 4, 2313–2335.
- [MPRP16] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes, The benefit of multitask representation learning, Journal of Machine Learning Research 17 (2016), no. 81, 1–32.
- [MPT+22] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle, Lens: Localization enhanced by nerf synthesis, Conference on Robot Learning, PMLR, 2022, pp. 1347–1356.
- [MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel, Scaling data-constrained language models, arXiv preprint arXiv:2305.16264 (2023).
- [Rei08] Markus Reiß, Asymptotic equivalence for nonparametric regression with multivariate and random design, The Annals of Statistics (2008), 1957–1982.
- [RRBS19] Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit, A constructive prediction of the generalization error across scales, International Conference on Learning Representations, 2019.
- [RSM+16] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez, The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
- [SLW20] Viktor Seib, Benjamin Lange, and Stefan Wirtz, Mixing real and synthetic data to enhance neural network training–a review of current approaches, arXiv preprint arXiv:2007.08781 (2020).
- [Ste81] Charles M Stein, Estimation of the mean of a multivariate normal distribution, The annals of Statistics (1981), 1135–1151.
- [TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Precise error analysis of regularized -estimators in high dimensions, IEEE Transactions on Information Theory 64 (2018), no. 8, 5592–5628.
- [TDR+21] Yi Tay, Mostafa Dehghani, **feng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler, Scale efficiently: Insights from pretraining and finetuning transformers, International Conference on Learning Representations, 2021.
- [TJJ20] Nilesh Tripuraneni, Michael Jordan, and Chi **, On the theory of transfer learning: The importance of task diversity, Advances in neural information processing systems 33 (2020), 7852–7862.
- [TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, Regularized linear regression: A precise analysis of the estimation error, Proceedings of Machine Learning Research 40 (2015), 1683–1709.
- [TPA+18] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield, Training deep networks with synthetic data: Bridging the reality gap by domain randomization, Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 969–977.
- [Tsy09] Alexandre B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
- [vdV00] Aaad W van der Vaart, Asymptotic statistics, Cambridge University Press, 2000.
- [Ver18] Roman Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
- [YCFB+22] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Tsung-Yi Lin, Alberto Rodriguez, and Phillip Isola, Nerf-supervision: Learning dense object descriptors from neural radiance fields, 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 6496–6503.
Appendix A Details of empirical results
A.1 Binary classification with Gaussian mixture data
![Refer to caption](x5.png)
![Refer to caption](x6.png)
We provide details for the models used in the simulations of Section A.1.
Logistic regression: We use the scikit-learn implementation with the lbfgs solver, fitting the intercept, with maximum iterations set to 10k. For each run of each combination, we set the penalty (parameter C in scikit-learn) to and , and only report the test result for the value that achieves the best validation error. The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 5 and 6.
![Refer to caption](x7.png)
![Refer to caption](x8.png)
Neural network: The network has one hidden layer with 32 ReLU neurons, and an output neuron using sigmoid. For training, we use the binary cross entropy loss, a constant learning rate of 0.05, and batch size 64. We train the network for 1,000 epochs. Similar to the procedure in logistic regression, we use regularization (weight decay) and use the validation set to choose the best regularization parameter from the set . The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 7 and 8.
A.2 Sentiment analysis in movie reviews
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
To convert the movie reviews and book reviews to vectors, we use a combination of two different embedding: We use all the reviews in the training data and then use nltk tagger [Bir06] to find the most frequent 500 adjectives appearing in the samples used for training. Then we use the common Tfidf vectorizer (we used scikit-learn’s implementation of tfidf vectorizer) for which we use the list of these most common 500 adjectives as vocabulary. This gives us a vector of length 500 dimension for each review. In addition, we also apply “Paraphrase-MiniLM-L6-v2” sentence transformer which is based on BERT with 6 Transformer Encoder Layers, and return a 384 dimension vector representation of the reviews. For each movie review we concatenate the results of tfidf vectorizer and sentence transformer to get a 884 dimensional representation that we use as our input vector.
We use logistic regression and neural networks with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension). We plot the average loss over 10 independent runs.
A.3 Image classification with CIFAR10 and CIFAR100
![Refer to caption](x14.png)
We largely use the model and the training procedure described at https://jovian.ml/aakashns/05b-cifar10-resnet. We normalize the images for mean and standard deviation. We train a 9-layer ResNet model for classification, using Adam for optimization, weight decay, and gradient clip**, trained over 16 epochs with a one-cycle learning rate scheduling policy, minimizing cross entropy loss. For each combination of , , and , we report the average test error over runs. Since there is no overlap between the label sets of CIFAR10 and CIFAR100, the latter dataset needs to be relabeled. We do this by training a separate 9-layer ResNet model on 10,000 randomly chosen CIFAR10 images from the training set of 50,000 examples (without creating a separate split for them), and use its predictions on CIFAR100 images as labels.
A.4 Lasso on TCGA PanCancer dataset
We used public domain TCGA pancancer dataset. After, filtering samples with incomplete values we are left with 9220 patients, each having 20,531 gene expression values and the outcome was PFS (progression-free survival). Out of these we used a group of 2000 patients, splitted into train and test set of 1000 each to select 500 genes having the largest absolute Cox PH score. We also used the mean and standard deviation of gene expression values of these 2000 patients to normalize the gene expression columns for the remaining 7220 patients. Among the remaining of 7220 patients 3580 were females. We treated the female patients data as original data, and split them into train (50), test (25) and validation split (25). The remaining 3640 patients data was used as surrogate dataset. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. We used the validation split to choose best value of penalty parameter from in the model. We observed discontinuity at . To avoid this discontinuity, we approximated by if and by if , where we choose . We plot the average loss over 10 independent runs. The results are presented in Figures 15 and 3.
![Refer to caption](x15.png)
A.5 High-dimensional ridge regression
We present additional ridge regression experiments here in Figs. 16–27. We plot the average loss over 10 independent runs. In these experiments, as in the main paper, we set , , , , except for the last four Figs. 24–27, where we use . We used angle and in our experiments.
We consider two methods: Fix to a very small value , and For each random draw of datasets select that achieves the best validation performance. For the latter method, we try , where . For ridge regression simulations, we directly plot the excess test risks, as the parameter for original data is known and for any the excess test risk in this model is .
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
![Refer to caption](x25.png)
![Refer to caption](x26.png)
![Refer to caption](x27.png)
Appendix B Low-dimensional asymptotics
B.1 Formal statements
In this appendix, we present our results on the estimator of Eqs. (2), (3) under the classical asymptotics at fixed. For simplicity, we assume no regularizer is used in this regime.
Beyond classical regularity assumptions of low-dimensional asymptotics, in this section we will make the following assumption which guarantees that original and surrogate distribution are ‘not arbitrarily far.’ Recall that denotes the population error on surrogate data.
Assumption 1 (Distribution shift for low- asymptotics).
There exists a constant such that for all ,
(18) |
The regularity conditions are similar to the ones in [vdV00]. Here and in the following is the ball of radius centered at .
Assumption 2 (‘Classical’ regularity).
-
The original population risk is uniquely minimized at a point .
-
is non-negative lower semicontinuous. Further, define the following limit in for :
(19) Then we assume for some .
-
is differentiable at almost surely, both under and under . Further, there exists such that, letting , the following holds for a constant :
(20) -
The functions , , are twice differentiable in a neighborhood of , with Lipschitz continuous Hessian. Further (strictly positive definite).
Proposition B.1.
Under Assumption 1 and Assumption 2, define the following matrices
(21) | ||||
(22) | ||||
(23) |
where , denote the covariances, respectively, with respect to the original data (i.e., with respect to ), and with respect to the surrogate data (i.e., with respect to ). Further define the -dimensional vector
(24) |
Then there exists (depending only on the constants in the assumptions) such that, for all , the excess risk of the estimator satisfies (for bounded by a constant)
(25) | ||||
(Here the big hides dependence on the constants in Assumptions 1 and 2.)
Remark B.1.
For economy of notation we stated Proposition B.1 in the case in which the excess risk is measured by using the same loss as for training, i.e. . However the same result Eq. (25) applies with minor modifications to the case (and thus, with replaced by ), provided is also twice differentiable with Lipschitz Hessian, and . In this case, (25) has to be modified replacing by .
Remark B.2.
The error terms in Eq. (25) are negligible under two conditions: and are large, which is the classical condition for low-dimensional asymptotics to hold; is small. In particular, the latter condition will hold in two cases. First, when is of order one (i.e. the distribution shift is large), but is small (surrogate data are downweighted). Note that, when the distribution shift is large, and the sample size is large enough, we expect small to be optimal and therefore Eq. (25) covers the ‘interesting’ regime.
Second, when is small (i.e. the shift is small) and is of order one. If in addition we have , it can be shown that the range of validity of Eq. (25) covers the whole interval .
Remark B.3.
Note that the distribution shift is measured in Eq. (25) by the first term . The original and surrogate distribution can be very different in other metrics (e.g. in total variation or transportation distance), but as long as is small (as measured in the norm defined by ), surrogate data will reduce test error.
Note that, within the setting of Proposition B.1, the excess error of training only on original data is , while . Hence Eq. (B.1) can be recast in the form of our general scaling law (4), namely:
which (as expected) corresponds to the parametric scaling exponent .
An immediate consequence of Proposition B.1 is that surrogate data do not hurt, and will help if their distribution is close enough to the original one (under the assumption of optimally chosen ).
B.2 Proofs
Lemma B.3.
Proof.
Fix By Assumption 2., for some constant . Hence, using Assumption 1, for any
In the other hand , whence
which is strictly positive for . Hence the minimum must be achieved in (note that since , are lower semicontinuous, the minimum is achieved).
By Assumption 2., for sufficiently small, is strictly convex in and therefore the minimizer is unique. This proves point .
Point follows from a modification of Theorem 5.14 in [vdV00]. Namely, for a diverging sequence , we consider to , where . This function is lower semicontinuous on the compact set and converges almost surely to its expectation for every fixed in this set, and hence the argument of Theorem 5.14 [vdV00] applies here. ∎
Proof of Proposition B.1.
By a modification of Theorem 5.39 in [vdV00] (here is defined as in Lemma B.3)
(26) | ||||
(27) |
where . Note that in the present setting the error is of order because we assume the Hessian to be Lipschitz continuous.
The population minimizer solves
where . Denoting by the Lipschitz constant of the Hessian (in operator norm), and recalling that , we have
Recalling that, by Lemma B.3, as , this implies
(28) |
Substituting in Eq. (26), we get
(29) | ||||
(30) | ||||
(31) |
The claim follows by substituting the above in
(32) |
and using . ∎
Appendix C Gaussian sequence model: Proofs for Section 3.1
C.1 General ridge regression
We define , , and . We then have
(33) | ||||
(34) | ||||
(35) | ||||
(36) |
C.2 Proof of Theorem 1
Without loss of generality, we can assume with non-decreasing. A simple calculation gives the following general expression for the test error:
(37) | ||||
(38) | ||||
(39) | ||||
(40) |
We define (with if the condition is never verified)
(41) |
Note that
(42) | ||||
(43) |
We now estimate various sums by breaking them by the value of
and
since under the assumption , , we have .
Recalling the definitions in the theorem, and letting
we have
whence
Next we specialize to the case , . In this case we have , and therefore, by suitably adjusting the constant
We now bound . By Cauchy-Schwarz and monotonicity of ,
and further
(44) |
Therefore,
Proof of claim . The stated assumption on imply that (eventually adjusting the constant ):
We now select so that where . (this is possible for all large enough under the assumption on ), to A straightforward calculation yields:
which proves claim .
Proof of Claim . We choose , , , with . We will choose a sufficiently small numerical constant. Note that, for
where, for any , the first inequality holds with probability at least for all . We can therefore select the , so that for some .
Following the calculation at point decompose the bias term as
Note that . Therefore
By a similar calculation, we also obtain
and therefore
The proof is completed by minimizing over .
Appendix D Analysis of the nonparametric model: Proofs for Section 3.2
This appendix is devoted to proving Theorem 2. Recall that this is established within the white noise model of Eq. (14), which we copy here for the readers’ convenience
(45) |
The adaptation of the estimator (13) to this continuous setting is given explicitly below
(46) |
The proof of Theorem 2 is based on a reduction to a suitable ‘sequence model’ via the Fourier transform, defined as
(47) |
for , where . The inverse Fourier transform is defined as
(48) |
We let , , and respectively denote the Fourier transform of , , and .
The Fourier transforms of the observations are given by
(49) |
where and are i.i.d. standard Gaussian. It then follows that
(50) |
where we abuse the notation to define
(51) |
with . Minimizing (50) we get
(52) |
Taking the inverse Fourier transform and plugging it into the excess risk formula we get
(53) |
where
(54) |
The convexity of implies
(55) |
for and therefore we can upper bound the first sum in (53) by taking for any , which yields
(56) |
D.1 Proof of Theorem 2
We now upper bound the first sum above. We note that, defining via (with an abuse of notation ), whence for all :
where in we used the fact that . Letting be constants depending on , we have
For convergence we requite , in which case
(57) |
Appendix E Analysis of high-dimensional regression: Proofs for Section 3.4
E.1 Auxiliary definition for Theorem 3
Our characterization is given in terms of a variational principle. For , define via
(58) | ||||
where are defined by
(59) | ||||
(60) |
and , , with solving the polynomial equation
(61) |
and is given by
(62) |
Theorem 3 states that the asymptotics of the test error is determined by the minimizer of .
E.2 Proof of Theorem 3
The proof is based on Gordon Gaussian comparison inequality [Gor85, Ver18], and follow a standard route, see e.g. [TOH15, TAH18, MM21]. We will limit ourselves to outlining the main steps of the calculation. Throughout, we consider the case , because the other one ( and ) is analogous and less interesting.
We begin by rewriting the ridge cost function in terms of a Lagrangian
(63) | ||||
(64) | ||||
Let ., where are independent standard normal random variables, independent of . By Gordon’s inequality [Gor85], we can compare the Gaussian process to
(65) | ||||
Next we define the orthonormal vectors
(66) |
where is the projector orthogonal to . We then decompose
(67) |
where , and define . Defining , , Eq. (60) follows.
With these notations, and letting , , we get
(68) | ||||
(69) | ||||
We finally decompose where and , and similarly for , and define
(70) |
Defining via , we obtain
(71) | ||||
where is the contribution of the perpendicular components. Simple concentration estimates imply that for any there exist such that
(72) | ||||
(73) | ||||
(74) |
We can then estimate by
(75) | ||||
Differentiating with respect to and and setting the derivatives to yields , , with given by Eqs. (61), (62). By computing second derivatives, one obtain that this is a local maximum. Since as , the maximum over is either achieved at this point or at the boundary . By checking the signs of partial derivatives along this boundary, the only other possibility is .
For economy of notation, write . For any unit vector , the directional derivative is
By maximizing over the direction, we see that can be chosen so that . Hence cannot be the global aximum for .
Hence, we get
(76) |
We further note that, for fixed , the function is jointly strictly convex for . Hence is also strictly convex for . Therefore, it has a unique minimizer, which we denote by . Proceeding as in [MM21], we obtain the following result.
Proposition E.1.
Under the assumptions of Proposition 3, for any there exists such that, if (letting )
(77) |
In particular, the last proposition implies (a weaker form of) Theorem 3 whereby the supremum is taken over a finite net. Namely for , we define
Recalling that, in the present case, , we obtain (after adjusting the constant ) we have therefore:
(78) |
Finally, let be the matrix obtained by stacking and . Given constants , define the good event
(79) |
By a standard bound on eigenvalues of Wishart matrices [Ver18], for , we can choose such that
(80) |
Further on , is bounded (in norm, and Lipschitz continuous in ). As a consequence, for a sufficiently large constant ,
(81) |
The claim follows by using this estimate together with Eq. (78).