Scaling laws for learning with real and surrogate data

Ayush Jain^∗ Andrea Montanari Granica Computing Inc. — granica.aiDepartment of Statistics and Department of Mathematics, Stanford University Eren Sasoglu^∗

Abstract

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of $n$ data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as ‘surrogate data.’ We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: $(i)$ Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein’s paradox. $(ii)$ In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. $(iii)$ The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

1 Introduction and overview

1.1 Motivation and formulation

Consider a standard learning setting where we are given $n$ i.i.d. points ${\boldsymbol{z}}_{i}$ from a target distribution $\mathcal{D}$ . Given a family of rich parametric models governed by the parameter ${\boldsymbol{\theta}}$ , the goal is to find the parameter ${\boldsymbol{\theta}}$ that minimizes the expected test loss $R_{\mbox{\tiny\rm test}}({\boldsymbol{\theta}})$ between the model predictions for a chosen ${\boldsymbol{\theta}}$ and the data generated from the distribution $\mathcal{D}$ . In many application domains, the available data ${\boldsymbol{Z}}=({\boldsymbol{z}}_{i})_{i\leq n}$ from the target distribution, referred to as either real or original data, may be difficult or expensive to acquire. One may then attempt to supplement these data with a different, cheaper source. Examples of such cheaper sources are $(i)$ publicly available datasets; $(ii)$ datasets owned by the same research group or company but acquired in different circumstances, e.g. in a different location; $(iii)$ synthetic data produced by a generative model.

We will denote the data points obtained from this source by ${\boldsymbol{z}}^{s}_{i}$ , and assume we have $m$ of them. To formalize, assume ‘surrogate’ data ${\boldsymbol{Z}}^{s}=({\boldsymbol{z}}_{i}^{s})_{i\leq m}$ is a collection of i.i.d. samples from some distribution $\mathcal{D}^{s}$ . In general, we will not assume the distribution $\mathcal{D}^{s}$ of synthetic data to be close to the original data distribution $\mathcal{D}$ , however we assume that these distributions are over the same domain. A number of questions arise: $(i)$ How should we use the surrogate data in training? $(ii)$ How many surrogate samples should we add to the original data? $(iii)$ Can we predict the improvement in test error achieved by adding surrogate samples to the training?

A natural approach would be to add the surrogate data to the original one in the usual training procedure, and indeed many authors have explored this approach (see Section 1.3). Namely, one attempts to minimize the overall empirical risk $\widehat{R}_{n+m}^{\mbox{\tiny\rm naive}}({\boldsymbol{\theta}})=\sum_{i=1}^{n% }\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+\sum_{i=1}^{m}\ell({% \boldsymbol{\theta}};{\boldsymbol{z}}^{s}_{i})$ , where $\ell(z,\theta)$ is a train loss function.

However, a moment of reflection reveals that this approach has serious shortcomings. Consider a simple mean estimation problem, whereby ${\boldsymbol{z}}_{i}\sim{\sf N}({\boldsymbol{\theta}}_{*},{\boldsymbol{I}}_{d})$ , ${\boldsymbol{z}}^{s}_{i}\sim{\sf N}({\boldsymbol{\theta}}^{s}_{*},{\boldsymbol% {I}}_{d})$ , $\ell({\boldsymbol{\theta}};{\boldsymbol{z}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{z}}\|^{2}$ , and $R_{\mbox{\tiny\rm test}}({{\boldsymbol{\theta}}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{*}\|^{2}$ . A straightforward calculation yields that the test error of the empirical risk minimizer $\hat{\boldsymbol{\theta}}_{n+m}^{\mbox{\tiny\rm naive}}:=\arg\min\widehat{R}_{% n+m}^{\mbox{\tiny\rm naive}}({\boldsymbol{\theta}})$ is

\displaystyle R_{\mbox{\tiny\rm test}}(\hat{\boldsymbol{\theta}}_{n+m}^{\mbox{% \tiny\rm naive}})=\left(\frac{m}{n+m}\right)^{2}\|{\boldsymbol{\theta}}^{s}_{*% }-{\boldsymbol{\theta}}_{*}\|^{2}+\frac{1}{n+m}\,\,.

(1)

As $m$ increases the variance (the second term) decreases, but the bias due to the difference $\|{\boldsymbol{\theta}}^{s}_{*}-{\boldsymbol{\theta}}_{*}\|$ increases, and the error approaches $\|{\boldsymbol{\theta}}^{s}_{*}-{\boldsymbol{\theta}}_{*}\|^{2}$ , i.e. the model will be only as good as if training only on surrogate data.

In order to overcome these limitations, we propose a weighted ERM approach, and will show that the weight plays a crucial role. Namely, we consider the following regularized empirical risk:

\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha):=\frac{1-\alpha}{% n}\sum_{i=1}^{n}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+\frac{\alpha}% {m}\sum_{i=1}^{m}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}^{s}_{i})+\Omega({% \boldsymbol{\theta}})\,,

(2)

where $\alpha\in[0,1]$ is the weight of the surrogate dataset and $\Omega:\mathbb{R}^{d}\to\mathbb{R}_{\geq 0}$ is a regularizer, e.g. a ridge $\Omega({\boldsymbol{\theta}})=\lambda\|{\boldsymbol{\theta}}\|^{2}_{2}$ . We denote by

\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha):=\arg\min_{{\boldsymbol{% \theta}}}\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)

(3)

the corresponding empirical risk minimizer. and the resulting test error will be $R_{\mbox{\tiny\rm test}}(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))$ .

For supervised learning tasks, a sample ${\boldsymbol{z}}$ is represented as ${\boldsymbol{z}}=(y,{\boldsymbol{x}})$ , where ${\boldsymbol{x}}\in\mathbb{R}^{d}$ is covariate vector and $y\in\mathbb{R}$ is response variable and ${\boldsymbol{\theta}}$ parametrizes a family of models $f({\boldsymbol{x}};{\boldsymbol{\theta}})$ that predict the response $y$ given covariate vector ${\boldsymbol{x}}$ . We consider losses of the form $\ell({\boldsymbol{\theta}},{\boldsymbol{z}})=L(y,f({\boldsymbol{x}};{% \boldsymbol{\theta}}))$ and $R_{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}):={\mathbb{E}}_{z\sim\mathcal{D% }}L_{\mbox{\tiny\rm test}}(y,f({\boldsymbol{x}};{\boldsymbol{\theta}}))$ for some functions $L$ and $L_{\mbox{\tiny\rm test}}$ . We allow for the test loss $L_{\mbox{\tiny\rm test}}$ to be different from the train loss $L$ , but we will omit the subscript ‘test’ whenever clear from the context.

Refer to caption — Figure 1: IMDB and Rotten Tomatoes data and neural networks. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

Figure 1 provides a preview of our results, for a sentiment analysis task. (Technical details provided in Section 4 and Appendix A.2). Each frame corresponds to a different combination of $n$ and $m$ , and we report the test error of our approach as a function of the weight parameter $\alpha$ (red circles). Solid lines report the prediction of a scaling law that will be one of the main results presented below.

We observe that the weighted ERM approach systematically achieves better test error than either training only on original data ( $\alpha\to 0$ ) or on surrogate data ( $\alpha\to 1$ ). Further the error for optimal $\alpha$ is always monotone decreasing both in $m$ and $n$ , and the approach outperforms the naive unweighted approach. Also, while scaling laws typically do not capture the dependence on hyperparameters, the scaling law presented below predicts the dependence on $\alpha$ reasonably well. This is particularly useful, because it can be used to tune $\alpha$ optimally and to predict the amount of surrogate data needed.

1.2 Summary of results

We study the method outlined above both mathematically and via numerical experiments. Our mathematical results are developed in four different settings: $(i)$ The Gaussian sequence model (Section 3.1); $(ii)$ A non-parametric function estimation setting (Section 3.2); $(iii)$ Low-dimensional empirical-risk minimization (Section 3.3); $(iv)$ High dimensional ridge regression (Section 3.4);

We carry out experiments with the following data sources. $(1)$ Simulated data from linear or Gaussian mixture models: this allows us to explicitly control the distribution shift between the original and surrogate datasets, as well as check our theoretical results in a controlled setting. $(2)$ Real natural language processing (NLP) data for sentiment analysis, with the role of original dataset played by IMDB reviews and the role of surrogate datasets played respectively by Rotten Tomatoes review and Goodreads book reviews. $(3)$ Progression-free survival analysis using Lasso on TCGA PanCancer dataset with female patients data and male patients data as original and surrogate data, respectively. $(4)$ Real image classification data, with CIFAR-10 and CIFAR-100 datasets respectively playing the role of original and surrogate data. Our results support the following conclusions:

Surrogate data improve test error. Including surrogate data in training generally improves the test error on the original data, even if the surrogate data distribution is far from the original one. In agreement with the interpretation of surrogate data as a regularizer (see also Sec. (2)), the improvement is generally positive, although its size depend on the data distributions.

Tuning of $\alpha$ . The above conclusion holds under the condition that $\alpha$ can be tuned (nearly) optimally. For each of the theoretical settings already mentioned, we characterize this optimal value. We verify that nearly optimal $\alpha$ can be effectively selected by minimizing error on a validation split of the original data. An attractive alternative is to use the scaling law we discuss next.

Scaling law. We propose a scaling law that captures the behavior of the test error with $n,m,\alpha$ :

\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R_{*}\approx\alpha^{2% }R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\big{[}\alpha^{2}\big{(}R% ^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)-R^{\mbox{\sf\tiny ex}}_{\mbox{% \sf\tiny su}}(\infty)\big{)}^{1/\beta}+(1-\alpha)^{2}R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny or}}(n)^{1/\beta}\big{]}^{\beta}\,.

(4)

Here $R_{*}$ is the minimal (Bayes) error, $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m):=R(\hat{\boldsymbol{\theta}}_{0% ,m}(1))-R_{*}$ is the excess test error when training on the surrogate data (and testing on original), $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(n):=R(\hat{\boldsymbol{\theta}}_{n% ,0}(0))-R_{*}$ is the excess test error¹¹1We assume here that $\lim_{n\to\infty}R(\hat{\boldsymbol{\theta}}_{n,0}(0))=R_{*}$ , i.e. that we achieve Bayes risk with infinitely many original samples. See Section 5. when training on original data (and testing on original), and $\beta$ is a scaling exponent as described in Section 4. The above scaling admits natural generalizations, see Section 5.

Practical uses of the scaling law. Given data $\{{\boldsymbol{z}}_{i}\}_{i\leq n}$ and a source of surrogate data, we would like to predict how much the test error can be decreased by including any number $m$ of surrogate samples to the mix. The scaling law (4) suggests a simple approach: $(1)$ Learn models on purely original data to extract the behavior of test loss $R(\hat{\boldsymbol{\theta}}_{n,0}(0)).$ ; $(2)$ Learn models on purely surrogate data to extract the behavior of $R(\hat{\boldsymbol{\theta}}_{0,m}(1)).$ (A relatively small sample is sufficient for this step.) $(3)$ Use the minimum over $\alpha$ of Eq. (4) to predict test error at any given pair $n,m$ .

1.3 Related work

The use of surrogate data to enhance training has attracted increasing research effort, also because of the recent progresses in generative modeling.

This line of work has largely focused on the techniques to generate synthetic data that are well suited for training. A broad variety of methods have been demonstrated to be useful to generate data for computer vision tasks, ranging from object classification to semantic segmentation [RSM⁺16, JRBM⁺17, AAMM⁺18, TPA⁺18, CLCG19, HSY⁺22, MPT⁺22, YCFB⁺22]. We refer to [SLW20] for a review. More recently, synthetic data have been used for training in natural language processing [HNK⁺22, MHZH22].

Scaling laws have been broadly successful in guiding the development of large machine learning models [HNA⁺17, RRBS19, HKK⁺20, KMH⁺20, TDR⁺21, HKHM21, HBM⁺22, ANZ22, MRB⁺23]. We expect them to similarly useful in integrating heterogeneous data into training. The change of scaling laws when training on synthetic data was the object of a recent empirical study [FCK⁺23]. On the other hand, no systematic attempt was made at integrating real and synthetic data.

Our results are also relevant to problems in transfer learning. However, that literature focuses on learning shared data representations rather than training a model for the target distribution [MPRP16, TJJ20].

2 Regularization, Gaussian mean estimation, Stein paradox

The role of the parameter $\alpha$ can be understood by considering the limit $m\to\infty$ :

\displaystyle{\widehat{R}_{n,\infty}({\boldsymbol{\theta}};\alpha)=\frac{1-% \alpha}{n}\sum_{i=1}^{n}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+% \alpha\,R^{s}({\boldsymbol{\theta}})+\Omega({\boldsymbol{\theta}})\,},

and $R^{s}({\boldsymbol{\theta}})={\mathbb{E}}_{{\boldsymbol{z}}^{s}\sim\mathcal{D}% ^{s}}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}^{s})$ is the population risk for surrogate data. This suggests to think of the surrogate data as an additional (highly non-trivial) regularizer, with parameter $\alpha$ . This leads to a simple yet important insight: adding surrogate data to the original data is beneficial if $\alpha$ is chosen optimally, and large $m$ reduces statistical fluctuations in this regularizer. This contrasts with the unweighted approach whose test error in general deteriorates for large $m$ .

As a toy example, reconsider the mean estimation problem mentioned in the introduction: ${\boldsymbol{z}}_{i}\sim{\sf N}({\boldsymbol{\theta}}_{*},{\boldsymbol{I}}_{d})$ and ${\boldsymbol{z}}^{s}_{i}\sim{\sf N}({\boldsymbol{\theta}}^{s}_{*},{\boldsymbol% {I}}_{d})$ , $\ell({\boldsymbol{\theta}};{\boldsymbol{z}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{z}}\|^{2}$ and $R_{\mbox{\tiny\rm test}}({{\boldsymbol{\theta}}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{*}\|^{2}$ . We have $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=(1-\alpha)\sum_{i\leq n}{{\boldsymbol{% z}}_{i}}/n+\alpha\sum_{i\leq m}{{\boldsymbol{z}}^{s}_{i}}/m$ . In other words, the weighted ERM shrinks the mean of the original data towards the mean of the surrogate data. For a given $\alpha$ , the resulting test errors are

\displaystyle{R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))=\alpha^{2}R^{\mbox{% \sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\left(\frac{\alpha^{2}}{m}+\frac{(1% -\alpha)^{2}}{n}\right)d\,,\;\;\;R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(% \infty)=\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{\theta}}_{*}^{s}\|^{2}\,},

(5)

and for the optimum value $\alpha_{*}=\arg\min_{\alpha}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))$ , this yields

\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha_{*}))=\left(\frac{R^{% \mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+d/m}{R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny su}}(\infty)+d/m+d/n}\right)\cdot\frac{d}{n}\,.

(6)

Note that $1/n$ is the error of training only on original data and the prefactor is always strictly smaller than one. Hence, weighted ERM always achieves better error than training only on original data, regardless of the distance between original and surrogate data, although the improvement is larger for small $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)$ . This might seem paradoxical at first. As mentioned above, we are shrinking towards an arbitrary point given by the empirical mean of the surrogate data: how can this help?

In fact, this is a disguised version of the celebrated Stein paradox [EM77, Ste81]: in estimating a Gaussian mean, a procedure that shrinks the empirical mean towards an arbitrary point by a carefully chosen amount outperforms the naive empirical mean. In our toy example, the naive empirical mean corresponds to estimation purely based on the original data, and we shrink it towards the mean of the surrogate data. Of course, the improvement over empirical mean is only possible if $\alpha$ is chosen optimally. Equation (6) assumes $\alpha=\alpha_{*}$ is chosen by an oracle that knows the value of $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)$ . Stein’s analysis implies that in the Gaussian mean problem, $\alpha$ can be chosen empirically as long as the dimension of ${\boldsymbol{\theta}}$ is $d\geq 3$ . In the settings we are interested in, $\alpha$ can be chosen via cross-validation.

3 Theoretical results

3.1 Gaussian sequence model

The sequence model captures the behavior of many models in non-parametric statistics while being simpler to analyze [Tsy09, GN21]. It is also known to approximate the behavior of overparametrized linear regression [CM22]. The unknown target is ${\boldsymbol{\theta}}_{*}\in\mathbb{R}^{d}$ (with potentially $d=\infty$ ), and we observe

\displaystyle{\boldsymbol{y}}_{i}={\boldsymbol{\theta}}_{*}+\sigma\,{% \boldsymbol{g}}_{i},\;i\leq n\,,\;\;\;{\boldsymbol{y}}^{s}_{i}={\boldsymbol{% \theta}}^{s}_{*}+\sigma_{s}\,{\boldsymbol{g}}^{s}_{i},\;i\leq m\,,

(7)

where ${\boldsymbol{\theta}}_{*}^{s}$ is also unknown, and ${\boldsymbol{g}}_{i},{\boldsymbol{g}}_{i}^{s}\sim{\sf N}(0,{\boldsymbol{I}}_{d})$ are i.i.d. We study the penalized estimator

\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha):=\arg\min_{{\boldsymbol{% \theta}}}\Big{\{}\frac{(1-\alpha)}{n}\sum_{i=1}^{n}\|{\boldsymbol{y}}_{i}-{% \boldsymbol{\theta}}\|^{2}_{2}+\frac{\alpha}{m}\sum_{i=1}^{m}\|{\boldsymbol{y}% }_{i}^{s}-{\boldsymbol{\theta}}\|^{2}_{2}+\lambda\|{\boldsymbol{\theta}}\|_{{% \boldsymbol{\Omega}}}^{2}\Big{\}}\,,

(8)

where $\|{\boldsymbol{\theta}}\|_{{\boldsymbol{\Omega}}}^{2}=\langle{\boldsymbol{% \theta}},{\boldsymbol{\Omega}}{\boldsymbol{\theta}}\rangle$ and ${\boldsymbol{\Omega}}\succeq{\boldsymbol{0}}$ is a regularization weight matrix. We will be concerned with the expected risk

\displaystyle{R_{n,m}(\alpha,\lambda)={\mathbb{E}}\Big{\{}\big{\|}\hat{% \boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}\big{\|}^{2}\Big{% \}}\,}.

(9)

The proof of the next result is presented in Appendix C.

Theorem 1.

Let $\omega_{1}\leq\omega_{2}\leq\cdots$ be the ordered eigenvalues of ${\boldsymbol{\Omega}}$ , and denote by ${\boldsymbol{v}}_{i}$ the corresponding eigenvectors. Further denote by ${\boldsymbol{\theta}}_{*,>k}$ , ${\boldsymbol{\theta}}^{s}_{*,>k}$ the projections of ${\boldsymbol{\theta}}_{*}$ , ${\boldsymbol{\theta}}_{*,s}$ onto ${\rm span}({\boldsymbol{v}}_{i}:i>k)$ , and similarly for ${\boldsymbol{\theta}}_{*,\leq k}$ , ${\boldsymbol{\theta}}^{s}_{*,\leq k}$ . Assume that $\omega_{k}\asymp k^{\mu}$ , $\mu>1/2$ , $\|{\boldsymbol{\theta}}_{*,>k}\|^{2}\leq C_{\theta}k^{-2\rho}$ , $\rho\neq\mu$ , and let $\Delta_{k}$ be such that (for all $k$ ): $\Delta_{k}:=\omega_{k}^{-1}|\langle{\boldsymbol{\theta}}_{*,\leq k}-{% \boldsymbol{\theta}}^{s}_{*,\leq k},{\boldsymbol{\theta}}_{*,\leq k}\rangle_{{% \boldsymbol{\Omega}}}|\leq C_{0}k^{-2(\mu\wedge\rho)}$ . Then the following hold:

(a)

There exists an explicit $\lambda_{*}(\alpha)$ such that, letting $\beta:=2(\mu\wedge\rho)/(1+2(\mu\wedge\rho))$ ,

\displaystyle R_{n,m}\big{(}\alpha,\lambda_{*}(\alpha)\big{)}\leq\alpha^{2}R^{% \mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+C\cdot\left[(1-\alpha)^{2}% \frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{s}}{m}\right]^{\beta}\,.

(10)

(b)

If $\mu>2\rho-1/2$ , there exists $C^{\prime}>0$ and there exist ${\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*}^{s}$ satisfying the assumptions in point $(a)$ , such that,

\displaystyle\min_{\lambda}R_{n,m}\big{(}\alpha,\lambda\big{)}\geq\alpha^{2}R^% {\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+C^{\prime}\cdot\left[(1-% \alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{s}}{m}\right]^{% \beta}\,.

(11)

Note that since the theorem also implies $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)-R^{\mbox{\sf\tiny ex}}_{\mbox{% \sf\tiny su}}(\infty)\asymp(\sigma_{s}^{2}/m)^{\beta}$ and $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(m)\asymp(\sigma^{2}/n)^{\beta}$ , this result confirms the scaling law (4).

3.2 Non-parametric regression in Sobolev classes

In this section we consider the classic non-parametric regression model. We assume that $n=Q^{d}$ for some integer $Q\geq 2$ , and the original data $({\boldsymbol{x}}_{i},y_{i})_{i\leq n}$ are defined through

\displaystyle y_{i}=f_{*}({\boldsymbol{x}}_{i})+{\varepsilon}_{i}\,,\;\;\;{% \varepsilon}_{i}\sim{\sf N}(0,\sigma^{2})\,,

(12)

where ${\varepsilon}_{i}$ are independent of ${\boldsymbol{x}}_{i}$ and of each other, and $\{{\boldsymbol{x}}_{i}\}_{i\leq n}$ equally spaced grid points in the $d$ -dimensional unit-cube, i.e. ${\mathcal{X}}_{n}=\{{\boldsymbol{q}}/Q:\;\;{\boldsymbol{q}}\in[Q]^{d}\}$ . Surrogate data have a similar distribution, with $m=Q_{s}^{d}$ equally spaced points ${\boldsymbol{x}}^{s}_{i}$ in the unit cube, and $y^{s}_{i}=f_{*,s}({\boldsymbol{x}}^{s}_{i})+{\varepsilon}^{s}_{i}$ , where ${\varepsilon}_{i}^{s}\sim{\sf N}(0,\sigma^{2}_{s})$ . We assume that $f_{*}$ has small Sobolev norm, that is,

\|f_{*}\|^{2}_{r,2}:=\int_{[0,1]^{d}}\big{(}|f_{*}(t)|^{2}+\|f_{*}^{(r)}(t)\|^% {2}\big{)}{\rm d}t\leq 1\,.

Recall that $\|f\|^{2}_{r,2}$ is a special reproducing kernel Hilbert space (RKHS) norm: we expect some of the considerations below to generalize to other RKHS norms.

Following our general methodology, we use the estimator

\displaystyle\hat{f}_{n,m,\alpha}=\arg\min_{f}\bigg{\{}\frac{1-\alpha}{n}\sum_% {i=1}^{n}\big{(}y_{i}-f({\boldsymbol{x}}_{i})\big{)}^{2}+\frac{\alpha}{m}\sum_% {i=1}^{m}\big{(}y^{s}_{i}-f({\boldsymbol{x}}^{s}_{i})\big{)}^{2}+\lambda\|f\|_% {p,2}^{2}\bigg{\}}\,.

(13)

We are interested in $R(f)={\mathbb{E}}\{(f({\boldsymbol{x}})-f_{*}({\boldsymbol{x}}))^{2}\}$ , which is the excess squared loss for a test point ${\boldsymbol{x}}\sim{\sf Unif}([0,1]^{d})$ .

In order to avoid technical burden we will carry out the analysis for a continuous model, the so-called white noise model, where we observe the function $f$ at all points ${\boldsymbol{x}}\in[0,1]^{d}$ , perturbed by $d$ -dimensional white noise:

\displaystyle{\rm d}Y=f_{*}({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}+\frac{% \sigma}{\sqrt{n}}{\rm d}B({\boldsymbol{x}})\,,

(14)

and similarly for $Y^{s}$ . We use an estimator that naturally generalizes (13) to the continuous case. Our results for the white noise model are as follows.

Theorem 2.

Let $\beta=(2p\wedge 4r)/(d+(2p\wedge 4r))$ . If $r>d/4$ and $\lambda=(\delta K_{n,m}\sigma^{2})^{2r/(d+(2p\wedge 4r))}$ , then for every $\delta\in(0,1)$ there exists a constant $C=C(d,\delta)$ such that

\displaystyle R(\hat{f}_{n,m,\alpha})\leq(1+\delta)\alpha^{2}R_{\mbox{\sf\tiny su% }}^{\mbox{\sf\tiny ex}}(\infty)+C\left\{(1-\alpha)^{2}\cdot\frac{\sigma^{2}}{n% }+\alpha^{2}\cdot\frac{\sigma_{s}^{2}}{m}\right\}^{\beta}

(15)

with high probability, where $K_{n,m}:=(1-\alpha)^{2}/n+\alpha^{2}/m$ .

Remark 3.1.

The white noise model (14) is known to be equivalent to the original model (12) (with deterministic equispaced designs) in the sense of Le Cam, for $r>d/2$ [BL96, Rei08]. While suggestive, this equivalence does not allow us to formally deduce results for the data (12), because it does not apply to the specific estimators of interest here.

With the given choice of $\lambda$ , $r$ , the derivation of (15) also implies $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)-R_{\mbox{\sf\tiny su}}^{\mbox{% \sf\tiny ex}}(\infty)\geq C^{\prime}(\sigma_{s}^{2}/m)^{\beta}$ , $R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)\geq C^{\prime}(\sigma/n)^{\beta}$ (for the least favorable $f$ [Tsy09]). Hence (15) is consistent with the scaling law (4).

3.3 Low-dimensional asymptotics

We study the estimator of Eqs. (2), (3) under the classical asymptotics $n,m\to\infty$ at $d$ fixed. Since this type of analysis is more standard, we defer it to Appendix B. The main result of this analysis is that the scaling law (4) holds in this setting, with the classical parametric exponent $\beta=1$ , for $\alpha\in[0,\alpha_{\max}]$ for a suitable $\alpha_{\max}\in(0,1)$ . Importantly, the interval $[0,\alpha_{\max}]$ includes the optimal choice of the weight $\alpha$ .

3.4 High-dimensional linear regression

In this section, we study ridge regression in the high-dimensional regime in which the number of samples is proportional to the number of parameters. Denoting the original data by $({\boldsymbol{y}},{\boldsymbol{X}})$ (with ${\boldsymbol{y}}\in\mathbb{R}^{n}$ the vector of responses and ${\boldsymbol{X}}\in\mathbb{R}^{n\times d}$ the matrix of covariates), and the surrogate data by $({\boldsymbol{y}}^{s},{\boldsymbol{X}}^{s})$ (with ${\boldsymbol{y}}^{s}\in\mathbb{R}^{m}$ and ${\boldsymbol{X}}^{s}\in\mathbb{R}^{m\times d}$ ), we minimize the regularized empirical risk

\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)

\displaystyle=\frac{1-\alpha}{2n}\|{\boldsymbol{y}}-{\boldsymbol{X}}{% \boldsymbol{\theta}}\|^{2}_{2}+\frac{\alpha}{2m}\|{\boldsymbol{y}}^{s}-{% \boldsymbol{X}}^{s}{\boldsymbol{\theta}}\|^{2}_{2}+\frac{\lambda}{2}\,\|{% \boldsymbol{\theta}}\|^{2}_{2}\,,

(16)

We assume a simple distribution, whereby the rows of ${\boldsymbol{X}}$ , ${\boldsymbol{X}}^{s}$ (denoted by ${\boldsymbol{x}}_{i}$ , ${\boldsymbol{x}}^{s}_{i}$ ) are standard normal vectors and

\displaystyle{\boldsymbol{y}}={\boldsymbol{X}}{\boldsymbol{\theta}}_{*}+{% \boldsymbol{{\varepsilon}}}\,,\;\;\;\;\;\;{\boldsymbol{y}}^{s}={\boldsymbol{X}% }^{s}{\boldsymbol{\theta}}_{*}^{s}+{\boldsymbol{{\varepsilon}}}^{s}\,.

(17)

for ${\boldsymbol{{\varepsilon}}}\sim{\sf N}({\boldsymbol{0}},\sigma^{2}{% \boldsymbol{I}}_{n})$ , ${\boldsymbol{{\varepsilon}}}^{s}\sim{\sf N}({\boldsymbol{0}},\sigma_{s}^{2}{% \boldsymbol{I}}_{m})$ . Note that the two data distributions differ in the true coefficient vectors ${\boldsymbol{\theta}}_{*}$ versus ${\boldsymbol{\theta}}_{*}^{s}$ as well as in the noise variance. We will denote by $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)$ the ridge estimator, $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=\arg\min_{{\boldsymbol{\theta}}\in% \mathbb{R}^{d}}\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)$ .

The excess test error (for square loss) is given by $R(\hat{\boldsymbol{\theta}}):={\mathbb{E}}\big{\{}\big{(}\langle{\boldsymbol{x% }},{\boldsymbol{\theta}}_{*}\rangle-\langle{\boldsymbol{x}},\hat{\boldsymbol{% \theta}}\rangle\big{)}^{2}\big{\}}=\|\hat{\boldsymbol{\theta}}-{\boldsymbol{% \theta}}_{*}\|^{2}$ . The next result characterizes this error in the proportional asymptotics.

Theorem 3.

Consider the ridge regression estimator $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)$ . Let $r:=\|{\boldsymbol{\theta}}_{*}\|_{2}$ , $r_{s}:=\|{\boldsymbol{\theta}}_{*}^{s}\|_{2}$ and $\gamma:=\cos^{-1}(\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*}^{% s}\rangle/(\|{\boldsymbol{\theta}}_{*}\|_{2}\|{\boldsymbol{\theta}}_{*}^{s}\|_% {2}))$ . Assume $n,m,d\to\infty$ such that $n/d\to\delta$ , $m/d\to\delta_{s}$ , with $\delta+\delta_{s}>1$ ²²2The same proof, with some additional technical work, yields a characterization for $\delta+\delta_{s}\leq 1$ as well. We omit it here for brevity.. For $\mathscrsfs{R}(.)$ defined in Appendix E.1, let

\xi^{*}(\alpha),\xi^{*}_{\perp}(\alpha),\omega^{*}(\alpha)=\operatorname*{% argmin}_{\xi,\xi_{\perp}\geq 0,\omega\geq 0}\mathscrsfs{R}(\xi,\xi_{\perp},% \omega,\alpha,r,r_{s},\delta,\delta_{s},\lambda,\gamma,\sigma,\sigma_{s}),

be the unique minimizer. Then for any ${\varepsilon},{\varepsilon}_{0}>0$ , there exist $c>0$ such that, for all $n$

\displaystyle{\mathbb{P}}\Big{(}\sup_{\alpha\in[{\varepsilon}_{0},1-{% \varepsilon}_{0}]}\big{|}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-% \mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)\big{|}\leq{\varepsilon}\Big{)}% \geq 1-2\,e^{-cn}\,,

where $\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha):=(\xi^{*}(\alpha)-r)^{2}+(\xi_{% \perp}^{*}(\alpha))^{2}+(\omega^{*}(\alpha))^{2}.$ Further, we can take ${\varepsilon}_{0}=0$ if $\delta,\delta_{s}>1$ .

Remark 3.2 (Optimizing $\alpha$ over the validation set).

Note that the concentration of $R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))$ around the theoretical prediction $\Lambda(\alpha)$ in Theorem 3 is uniform over $\alpha\in[{\varepsilon}_{0},1-{\varepsilon}_{0}]$ . This means that we can find the optimal $\alpha$ by computing $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)$ over a grid of $\alpha$ values, estimating $R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))$ over the validation set and choosing the optimal $\alpha$ . The uniform guarantee insures that this procedure will achieve risk $\min_{\alpha\in[0,1]}\Lambda(\alpha)+o_{P}(1)$ .

Remark 3.3 (Relation to scaling laws).

An analysis of the equations for $(\xi^{*},\xi^{*}_{\perp},\omega^{*})$ reveals that, for large $\delta,\delta_{s}$ , the predicted excess risk behaves as $\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)=\alpha^{2}\mathscrsfs{R}^{*}_{s,% \infty}+\alpha^{2}C_{1}/\delta_{s}+(1-\alpha)^{2}C_{2}/\delta_{s}+o(1/\delta,1% /\delta_{s})$ (for some constants $\mathscrsfs{R}^{*}_{s,\infty},C_{1},C_{2}$ ). This matches the low-dimensional asymptotics and our scaling law (4) with $\beta=1$ . In practice, we find that, for moderate $\delta,\delta_{s}$ , the behavior of $\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)$ is better approximated by a different value of $\beta$ (see Appendix A.)

4 Empirical results

In this section, we present experiments validating that the scaling law (4) is a good approximation both for simulated and real-world data. For simulated data, we select two different distributions for the original and surrogate datasets. The test and validation sets are generated from the same distribution as the original dataset. In case of real-world data, we choose two different datasets as the original and surrogate datasets. We split the original dataset into train, test, and validation sets, while all examples in the surrogate datasets are allocated solely to the train split.

For each dataset and model discussed in this section, we carry out the same experiment: $(i)$ We use models trained on original data to fit the scaling curve $R(\hat{\boldsymbol{\theta}}_{n,0}(0))=A_{\mbox{\sf\tiny or}}+B_{\mbox{\sf\tiny or% }}n^{-\beta_{\mbox{\sf\tiny or}}}$ and obtain $A_{\mbox{\sf\tiny or}}$ and $\beta_{\mbox{\sf\tiny or}}$ $(ii)$ We use models trained on purely surrogate data to fit the scaling curve $R(\hat{\boldsymbol{\theta}}_{0,m}(1))=A_{\mbox{\sf\tiny su}}+B_{\mbox{\sf\tiny su% }}m^{-\beta_{\mbox{\sf\tiny su}}}$ to obtain $A_{\mbox{\sf\tiny su}}$ and $\beta_{\mbox{\sf\tiny su}}$ . $(iii)$ Since assume $R_{*}=R(\hat{\boldsymbol{\theta}}_{0,\infty}(0))$ , we let $R_{*}=A_{\mbox{\sf\tiny or}}$ and excess risk estimates $R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)=R(\hat{\boldsymbol{\theta}}_{n,% 0}(0))-A_{\mbox{\sf\tiny or}}$ , $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)=R(\hat{\boldsymbol{\theta}}_{0,% m}(1))-A_{\mbox{\sf\tiny or}}$ and $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=A_{\mbox{\sf\tiny su}}-A_{% \mbox{\sf\tiny or}}$ , and we use $\beta=\beta_{\mbox{\sf\tiny or}}$ , the fit exponent obtained from original data); $(iv)$ For each combination of $n,m$ , we use our estimates of $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)$ , $R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)$ (as measured empirically on the test set), $\beta$ , $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)$ , and $R_{*}$ to plot the predicted $R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))$ as a function of $\alpha$ using scaling law (4). $(v)$ We then train the model using $n$ original and $m$ surrogate examples with weights $(1-\alpha)$ and, $\alpha$ for the two datasets, respectively. We average the results of 10 independent runs to compare it against those predicted by the scaling law. For ridge regression, we also compare with exact high-dimensional asymptotics from Theorem 3.

Let us emphasize that these plots probe the dependence on the hyperparameter $\alpha$ . These are much more demanding tests that the usual ones in scaling laws. We generally observe that the scaling law captures well the behavior of the test error for data mixtures.

Binary classification with Gaussian mixture data

This is a simple simulated setting. The original dataset consists of independent and identically distributed examples $(y_{i},{\boldsymbol{x}}_{i})\in\mathbb{R}\times\mathbb{R}^{d}$ , $d=200$ , where $y_{i}$ is uniform over $\{+1,-1\}$ , and ${\boldsymbol{x}}_{i}\big{|}_{y_{i}}\sim{\sf N}(y_{i}{\boldsymbol{\theta}}_{*},% {\boldsymbol{I}}_{d})$ , where ${\boldsymbol{\theta}}_{*}\in\mathbb{R}^{d}$ , $\|{\boldsymbol{\theta}}_{*}\|=1$ . Surrogate data have the same distribution, with a different unit vector ${\boldsymbol{\theta}}_{*,s}$ . This data distribution is parametrized by $d$ and the angle $\gamma$ between the original and surrogate parameters, $\cos\gamma:=\langle{\boldsymbol{\theta}}_{*},\hat{\boldsymbol{\theta}}_{*,s}\rangle$ . We use $\gamma=\pi/10$ in our experiments. For each $(n,m,\alpha)$ , we average the results over 10 independent runs.

We use two different models for classification: (1) Logistic regression; $(2)$ A one-hidden layer neural network with 32 hidden ReLU neurons. Results for both models are presented in Appendix A.1.

Sentiment analysis in movie reviews

As original data, we use the IMDB dataset (link) which has 25k reviews for training, each labeled as positive or negative. For validation and testing, we split the IMDB test dataset of 25k reviews into a validation set of 10k reviews and test set of 15k reviews.

We experiment with two different surrogate datasets: 1) Rotten Tomatoes dataset of movie reviews (link): these are data with different distribution but within the same domain. This dataset contains movie reviews and the corresponding sentiments, 2) Goodreads book reviews (link): these are data from a substantially different domain. This dataset has reviews and their ratings. We choose 10k reviews each with a rating of 5 and 1, and label them as positive and negative, respectively.

We convert reviews into feature vectors with $d=884$ dimensions as explained in Appendix A.2. We use logistic regression and neural network models with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension).

Results with neural nets and Rotten Tomatoes as synthetic dataset are presented in Figure 1 and the remaining results are in Appendix A.2.

Image classification with CIFAR10 and CIFAR100

We use 50,000 CIFAR10 training images as original data, its 10 classes for the classification task, and test on the 10,000 CIFAR10 test images. We use 50,000 CIFAR100 training images as surrogate data. We train a 9-layer ResNet model for classification. Appendix A.3 presents details on the data pre-processing and map** of labels. Results are shown in Figure 2. Note that CIFAR10 and CIFAR100 datasets are quite different from each other, as they have no overlap either in the images or in their label sets. Yet, the test error on training on their mixture is well predicted by the scaling law (4).

Lasso-based Cox regression on TCGA PanCancer dataset

We use the public domain TCGA pancancer dataset [GCH⁺20] (link), with gene expressions as covariates and progression-free survival (PFS) as response. After filtering and feature selection, we are left with 3580 female patients, which we use as original data, and 3640 male patients, which we use as surrogate data. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. The results are shown in Figure 3. The details of pre-processing and experiment parameters³³3We observe that training at $\alpha=1$ yields a somewhat singular behavior: we use a $\alpha=0.95$ as a proxy of $\alpha=1$ , see appendices. are in Appendix A.4.

High-dimensional ridge regression

We simulate the data distribution in Section 3.4, i.e., $y_{i}=\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{x}}_{i}\rangle+{% \varepsilon}_{i}$ , $i\leq n$ ; $y^{s}_{i}=\langle{\boldsymbol{\theta}}_{*,s},{\boldsymbol{x}}^{s}_{i}\rangle+{% \varepsilon}^{s}_{i}$ , $i\leq m$ ; with ${\boldsymbol{x}}_{i},{\boldsymbol{x}}_{i}^{s}\sim{\sf N}({\boldsymbol{0}},{% \boldsymbol{I}}_{d})$ , ${\varepsilon}_{i}\sim{\sf N}(0,\sigma^{2})$ , ${\varepsilon}^{s}_{i}\sim{\sf N}(0,\sigma_{s}^{2})$ , and fit a simple linear model using ridge regression. The results are shown in Figure 4. In our experiments, we use $d=500$ , $\sigma^{2}=\sigma_{s}^{2}=1$ , $\|{\boldsymbol{\theta}}_{*}\|=\|{\boldsymbol{\theta}}_{*,s}\|=1$ and regularization parameter $\lambda=2^{-10}$ . Under these settings, the model is parametrized by the angle $\gamma$ between ${\boldsymbol{\theta}}_{*}$ and ${\boldsymbol{\theta}}_{*,s}$ , where $\cos\gamma:=\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*,s}\rangle$ . We used $\gamma=\pi/6$ and $\pi/2$ in our experiments.⁴⁴4For ridge regression simulations, we directly plot the excess test risks, as the parameter ${\boldsymbol{\theta}}$ for original data is known. For any $\hat{\boldsymbol{\theta}}$ the excess test risk in this model is simply $\|{\boldsymbol{\theta}}-\hat{\boldsymbol{\theta}}\|^{2}$ .

The theoretical predictions of Theorem 3 for these curves in high-dimensional asymptotics $n,m,d\to\infty$ , with $n/d\to\delta$ , $m/d\to\delta_{s}$ are reported as blue lines, and match remarkably well with the empirical data. The simple scaling law (4) nevertheless provides a good approximation of these (more complicated) theoretical formulas.

Note in particular that in the top row of Figure 4, we have $\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*,s}\rangle=0$ , i.e. the surrogate data are as far as possible from the original ones. Nevertheless, the induced regularization effect leads to smaller test error on the original distribution.

We observe proposed scaling law (4) predicts well the behavior of the experiments, across of the datasets above, and for most combinations of original and surrogate examples we have tested.

Finally, we emphasize that the scaling law is only an empirical approximation of reality. This is clearly illustrated by the example of ridge regression: in this case, we use Theorem 3 to precisely predict the discrepancy between precise asymptotics and scaling law, see Appendix A.5.

5 Discussion

We conclude by discussing two possible generalizations of the scaling law (4), and its applicability. First, throughout this paper we assumed that $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(\infty)=0$ , namely that we can achieve the Bayes error by training on infinitely many original samples. In practice this will not hold because of the limited model complexity. Following standard scaling laws [KMH⁺20, HBM⁺22], this effect can be accounted for by an additional term $C\cdot N^{-\omega}$ , where $N$ is the model size (number of parameters). Second, the scaling law (4) implies as special cases that $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(n)\approx A_{\mbox{\sf\tiny or}}n^% {-\beta}$ , $R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)\approx R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny su}}(\infty)+A_{\mbox{\sf\tiny su}}m^{-\beta}$ . In particular, the exponent $\beta$ is the same when training on real or surrogate data. In practice, we observe often two somewhat different exponents $\beta_{\mbox{\sf\tiny or}}\neq\beta_{\mbox{\sf\tiny su}}$ . In these cases, we set $\beta=\beta_{\mbox{\sf\tiny or}}$ , and this appears to work reasonably well. However, we can imagine cases in which the difference between $\beta_{\mbox{\sf\tiny or}}$ and $\beta_{\mbox{\sf\tiny su}}$ is significant enough (4) will stop being accurate.

References

[AAMM⁺18] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother, Augmented reality meets computer vision: Efficient data generation for urban driving scenes, International Journal of Computer Vision 126 (2018), 961–972.
[ANZ22] Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai, Revisiting neural scaling laws in language and vision, Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
[Bir06] Steven Bird, Nltk: the natural language toolkit, Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 2006, pp. 69–72.
[BL96] Lawrence D Brown and Mark G Low, Asymptotic equivalence of nonparametric regression and white noise, The Annals of Statistics 24 (1996), no. 6, 2384–2398.
[CLCG19] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool, Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1841–1850.
[CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression, arXiv:2210.08571 (2022).
[EM77] Bradley Efron and Carl Morris, Stein’s paradox in statistics, Scientific American 236 (1977), no. 5, 119–127.
[FCK⁺23] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian, Scaling laws of synthetic images for model training… for now, arXiv preprint arXiv:2312.04567 (2023).
[GCH⁺20] Mary J Goldman, Brian Craft, Mim Hastie, Kristupas Repečka, Fran McDade, Akhil Kamath, Ayan Banerjee, Yunhai Luo, Dave Rogers, Angela N Brooks, et al., Visualizing and interpreting cancer genomics data via the xena platform, Nature biotechnology 38 (2020), no. 6, 675–678.
[GN21] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge University Press, 2021.
[Gor85] Yehoram Gordon, Some inequalities for gaussian processes and applications, Israel Journal of Mathematics 50 (1985), no. 4, 265–289.
[HBM⁺22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022).
[HKHM21] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021).
[HKK⁺20] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al., Scaling laws for autoregressive generative modeling, arXiv preprint arXiv:2010.14701 (2020).
[HNA⁺17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017).
[HNK⁺22] Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi, Generate, annotate, and learn: Nlp with synthetic text, Transactions of the Association for Computational Linguistics 10 (2022), 826–842.
[HSY⁺22] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi, Is synthetic data from generative models ready for image recognition?, arXiv preprint arXiv:2210.07574 (2022).
[JRBM⁺17] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan, Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?, 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 746–753.
[KMH⁺20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020).
[MHZH22] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han, Generating training data with language models: Towards zero-shot language understanding, Advances in Neural Information Processing Systems 35 (2022), 462–477.
[MM21] Léo Miolane and Andrea Montanari, The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning, The Annals of Statistics 49 (2021), no. 4, 2313–2335.
[MPRP16] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes, The benefit of multitask representation learning, Journal of Machine Learning Research 17 (2016), no. 81, 1–32.
[MPT⁺22] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle, Lens: Localization enhanced by nerf synthesis, Conference on Robot Learning, PMLR, 2022, pp. 1347–1356.
[MRB⁺23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel, Scaling data-constrained language models, arXiv preprint arXiv:2305.16264 (2023).
[Rei08] Markus Reiß, Asymptotic equivalence for nonparametric regression with multivariate and random design, The Annals of Statistics (2008), 1957–1982.
[RRBS19] Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit, A constructive prediction of the generalization error across scales, International Conference on Learning Representations, 2019.
[RSM⁺16] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez, The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
[SLW20] Viktor Seib, Benjamin Lange, and Stefan Wirtz, Mixing real and synthetic data to enhance neural network training–a review of current approaches, arXiv preprint arXiv:2007.08781 (2020).
[Ste81] Charles M Stein, Estimation of the mean of a multivariate normal distribution, The annals of Statistics (1981), 1135–1151.
[TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Precise error analysis of regularized $m$ -estimators in high dimensions, IEEE Transactions on Information Theory 64 (2018), no. 8, 5592–5628.
[TDR⁺21] Yi Tay, Mostafa Dehghani, **feng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler, Scale efficiently: Insights from pretraining and finetuning transformers, International Conference on Learning Representations, 2021.
[TJJ20] Nilesh Tripuraneni, Michael Jordan, and Chi **, On the theory of transfer learning: The importance of task diversity, Advances in neural information processing systems 33 (2020), 7852–7862.
[TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, Regularized linear regression: A precise analysis of the estimation error, Proceedings of Machine Learning Research 40 (2015), 1683–1709.
[TPA⁺18] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield, Training deep networks with synthetic data: Bridging the reality gap by domain randomization, Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 969–977.
[Tsy09] Alexandre B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
[vdV00] Aaad W van der Vaart, Asymptotic statistics, Cambridge University Press, 2000.
[Ver18] Roman Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
[YCFB⁺22] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Tsung-Yi Lin, Alberto Rodriguez, and Phillip Isola, Nerf-supervision: Learning dense object descriptors from neural radiance fields, 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 6496–6503.

Appendix A Details of empirical results

A.1 Binary classification with Gaussian mixture data

We provide details for the models used in the simulations of Section A.1.

Logistic regression: We use the scikit-learn implementation with the lbfgs solver, fitting the intercept, with maximum iterations set to 10k. For each run of each $(n,m,\alpha)$ combination, we set the $\ell_{2}$ penalty (parameter C in scikit-learn) to $2^{i},i=-8,...,8$ and $10^{i},i=-6,-5,-4,-3,3,4,5,6$ , and only report the test result for the value that achieves the best validation error. The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 5 and 6.

Neural network: The network has one hidden layer with 32 ReLU neurons, and an output neuron using sigmoid. For training, we use the binary cross entropy loss, a constant learning rate of 0.05, and batch size 64. We train the network for 1,000 epochs. Similar to the procedure in logistic regression, we use $\ell_{2}$ regularization (weight decay) and use the validation set to choose the best regularization parameter from the set $\{0,10^{-5},10^{-4},10^{-3},2\cdot 10^{-3},4\cdot 10^{-3},10^{-2},2\cdot 10^{-% 2},4\cdot 10^{-2},10^{-1},2\cdot 10^{-1},4\cdot 10^{-1}\}$ . The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 7 and 8.

A.2 Sentiment analysis in movie reviews

To convert the movie reviews and book reviews to vectors, we use a combination of two different embedding: We use all the reviews in the training data and then use nltk tagger [Bir06] to find the most frequent 500 adjectives appearing in the samples used for training. Then we use the common Tfidf vectorizer (we used scikit-learn’s implementation of tfidf vectorizer) for which we use the list of these most common 500 adjectives as vocabulary. This gives us a vector of length 500 dimension for each review. In addition, we also apply “Paraphrase-MiniLM-L6-v2” sentence transformer which is based on BERT with 6 Transformer Encoder Layers, and return a 384 dimension vector representation of the reviews. For each movie review we concatenate the results of tfidf vectorizer and sentence transformer to get a 884 dimensional representation that we use as our input vector.

We use logistic regression and neural networks with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension). We plot the average loss over 10 independent runs.

Results omitted from the main text are presented in Figures 9–13.

A.3 Image classification with CIFAR10 and CIFAR100

We largely use the model and the training procedure described at https://jovian.ml/aakashns/05b-cifar10-resnet. We normalize the images for mean and standard deviation. We train a 9-layer ResNet model for classification, using Adam for optimization, weight decay, and gradient clip**, trained over 16 epochs with a one-cycle learning rate scheduling policy, minimizing cross entropy loss. For each combination of $m$ , $n$ , and $\alpha$ , we report the average test error over $10$ runs. Since there is no overlap between the label sets of CIFAR10 and CIFAR100, the latter dataset needs to be relabeled. We do this by training a separate 9-layer ResNet model on 10,000 randomly chosen CIFAR10 images from the training set of 50,000 examples (without creating a separate split for them), and use its predictions on CIFAR100 images as labels.

Scaling curves are presented in Figure 14 and 2.

A.4 Lasso on TCGA PanCancer dataset

We used public domain TCGA pancancer dataset. After, filtering samples with incomplete values we are left with 9220 patients, each having 20,531 gene expression values and the outcome was PFS (progression-free survival). Out of these we used a group of 2000 patients, splitted into train and test set of 1000 each to select 500 genes having the largest absolute Cox PH score. We also used the mean and standard deviation of gene expression values of these 2000 patients to normalize the gene expression columns for the remaining 7220 patients. Among the remaining of 7220 patients 3580 were females. We treated the female patients data as original data, and split them into train (50 $\%$ ), test (25 $\%$ ) and validation split (25 $\%$ ). The remaining 3640 patients data was used as surrogate dataset. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. We used the validation split to choose best value of $\ell_{1}$ penalty parameter from $2^{i},i=2,0,-2,-4,-6,-8,-10,-12,-14,-16$ in the model. We observed discontinuity at $\alpha=1$ . To avoid this discontinuity, we approximated $R(\hat{\boldsymbol{\theta}}_{n,m}(1))$ by $R(\hat{\boldsymbol{\theta}}_{n,m}(1-\epsilon))$ if $n>0$ and by $R(\hat{\boldsymbol{\theta}}_{m/2,m}(1-\epsilon))$ if $n=0$ , where we choose $\epsilon=0.05$ . We plot the average loss over 10 independent runs. The results are presented in Figures 15 and 3.

A.5 High-dimensional ridge regression

We present additional ridge regression experiments here in Figs. 16–27. We plot the average loss over 10 independent runs. In these experiments, as in the main paper, we set $d=500$ , $\sigma^{2}=\sigma_{s}^{2}=1$ , $\|{\boldsymbol{\theta}}_{*}\|=1$ , $\|{\boldsymbol{\theta}}_{*,s}\|=1$ , except for the last four Figs. 24–27, where we use $\|{\boldsymbol{\theta}}_{*,s}\|=1/2$ . We used angle $\gamma=\pi/6$ and $\pi/2$ in our experiments.

We consider two methods: $(1)$ Fix $\lambda$ to a very small value $2^{-10}$ , and $(2)$ For each random draw of datasets select $\lambda$ that achieves the best validation performance. For the latter method, we try $\lambda=2^{i}$ , where $i=-10,-8,-6,\ldots,8,10$ . For ridge regression simulations, we directly plot the excess test risks, as the parameter $\theta$ for original data is known and for any $\hat{\theta}$ the excess test risk in this model is $\|\theta-\hat{\theta}\|^{2}$ .

Appendix B Low-dimensional asymptotics

B.1 Formal statements

In this appendix, we present our results on the estimator of Eqs. (2), (3) under the classical asymptotics $n,m\to\infty$ at $d$ fixed. For simplicity, we assume no regularizer is used in this regime.

Beyond classical regularity assumptions of low-dimensional asymptotics, in this section we will make the following assumption which guarantees that original and surrogate distribution are ‘not arbitrarily far.’ Recall that $R^{s}({\boldsymbol{\theta}})$ denotes the population error on surrogate data.

Assumption 1 (Distribution shift for low- $d$ asymptotics).

There exists a constant $K_{*}$ such that for all ${\boldsymbol{\theta}}\in\mathbb{R}^{d}$ ,

\displaystyle\big{|}R^{s}({\boldsymbol{\theta}})-R({\boldsymbol{\theta}})\big{% |}\leq K_{*}\big{(}1+R({\boldsymbol{\theta}})\big{)}\,.

(18)

The regularity conditions are similar to the ones in [vdV00]. Here and in the following ${\sf B}({\boldsymbol{\theta}}_{*},r)$ is the ball of radius $r$ centered at ${\boldsymbol{\theta}}_{*}$ .

Assumption 2 (‘Classical’ regularity).

$(a)$

The original population risk $R({\boldsymbol{\theta}})$ is uniquely minimized at a point ${\boldsymbol{\theta}}_{*}$ .

(b)

${\boldsymbol{\theta}}\mapsto\ell({\boldsymbol{\theta}};{\boldsymbol{z}})$ is non-negative lower semicontinuous. Further, define the following limit in $[0,\infty]$ for ${\boldsymbol{u}}\in{\mathbb{S}}^{d-1}$ :

\displaystyle\ell_{\infty}({\boldsymbol{u}};{\boldsymbol{z}}):=\liminf_{\begin% {subarray}{c}{\boldsymbol{\theta}}\to\infty\\ {\boldsymbol{\theta}}/\|{\boldsymbol{\theta}}\|_{2}\to{\boldsymbol{u}}\end{% subarray}}\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\,.

(19)

Then we assume $\inf_{{\boldsymbol{u}}\in{\mathbb{S}}^{d-1}}{\mathbb{E}}\ell_{\infty}({% \boldsymbol{u}};{\boldsymbol{z}})\geq R({\boldsymbol{\theta}}_{*})+c$ for some $c>0$ .

(c)

${\boldsymbol{\theta}}\mapsto\ell({\boldsymbol{\theta}};{\boldsymbol{z}})$ is differentiable at ${\boldsymbol{\theta}}_{*}$ almost surely, both under ${\boldsymbol{z}}\sim{\mathbb{P}}$ and under ${\boldsymbol{z}}\sim{\mathbb{P}}^{s}$ . Further, there exists $r>0$ such that, letting ${\sf B}:={\sf B}({\boldsymbol{\theta}}_{*},r)$ , the following holds for a constant $C$ :

\displaystyle{\mathbb{E}}\sup_{{\boldsymbol{\theta}}_{1}\neq{\boldsymbol{% \theta}}_{2}\in{\sf B}}\Big{\{}\frac{|\ell({\boldsymbol{\theta}}_{1};{% \boldsymbol{z}})-\ell({\boldsymbol{\theta}}_{2};{\boldsymbol{z}})|^{2}}{\|{% \boldsymbol{\theta}}_{1}-{\boldsymbol{\theta}}_{2}\|^{2}_{2}}\Big{\}}\leq C<% \infty\,.

(20)

$(d)$

The functions ${\boldsymbol{\theta}}\mapsto R({\boldsymbol{\theta}})$ , ${\boldsymbol{\theta}}\mapsto R^{s}({\boldsymbol{\theta}})$ , are twice differentiable in a neighborhood of ${\boldsymbol{\theta}}_{*}$ , with Lipschitz continuous Hessian. Further $\nabla^{2}R({\boldsymbol{\theta}}_{*})\succ{\boldsymbol{0}}$ (strictly positive definite).

Proposition B.1.

Under Assumption 1 and Assumption 2, define the following $d\times d$ matrices

$\displaystyle{\boldsymbol{H}}$	$\displaystyle:=\nabla^{2}R({\boldsymbol{\theta}}_{})={\mathbb{E}}[\nabla^{2}% \ell({\boldsymbol{\theta}}_{};{\boldsymbol{z}})]\,,$	(21)
$\displaystyle{\boldsymbol{K}}$	$\displaystyle:={\rm Cov}\big{(}\nabla\ell({\boldsymbol{\theta}}_{};{% \boldsymbol{z}});\nabla\ell({\boldsymbol{\theta}}_{};{\boldsymbol{z}})\big{)}% \,,\;\;\;\;\;$	(22)
$\displaystyle{\boldsymbol{K}}_{s}$	$\displaystyle:={\rm Cov}_{s}\big{(}\nabla\ell({\boldsymbol{\theta}}_{};{% \boldsymbol{z}}^{s});\nabla\ell({\boldsymbol{\theta}}_{};{\boldsymbol{z}}^{s}% )\big{)}\,,$	(23)

where ${\rm Cov}$ , ${\rm Cov}_{s}$ denote the covariances, respectively, with respect to the original data (i.e., with respect to ${\boldsymbol{z}}\sim{\mathbb{P}}$ ), and with respect to the surrogate data (i.e., with respect to ${\boldsymbol{z}}^{s}\sim{\mathbb{P}}_{s}$ ). Further define the $d$ -dimensional vector

\displaystyle{\boldsymbol{g}}^{s}:=\nabla R^{s}({\boldsymbol{\theta}}_{*})-% \nabla R({\boldsymbol{\theta}}_{*})\,.

(24)

Then there exists $\alpha_{\max}\in(0,1]$ (depending only on the constants in the assumptions) such that, for all $\alpha\in[0,\alpha_{\max}]$ , the excess risk of the estimator $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)$ satisfies (for $D:=\|{\boldsymbol{g}}^{s}\|$ bounded by a constant)

	$\displaystyle R\big{(}\hat{\boldsymbol{\theta}}_{n,m}(\alpha)\big{)}-R\big{(}{% \boldsymbol{\theta}}_{*}\big{)}$	$\displaystyle=\;\alpha^{2}\langle{\boldsymbol{g}}^{s},{\boldsymbol{H}}^{-1}{% \boldsymbol{g}}^{s}\rangle+\frac{(1-\alpha)^{2}}{n}\cdot\mathrm{Tr}\big{(}{% \boldsymbol{H}}^{-1}{\boldsymbol{K}}\big{)}$		(25)
		$\displaystyle+\frac{\alpha^{2}}{m}\cdot\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}% {\boldsymbol{K}}_{s}\big{)}+O\Big{(}\Big{(}\frac{1}{m\vee n}+D\alpha^{2}\Big{)% }\Big{(}\frac{1}{(m\vee n)^{1/2}}+D\alpha\Big{)}\Big{)}\,.$

(Here the big $O$ hides dependence on the constants in Assumptions 1 and 2.)

Remark B.1.

For economy of notation we stated Proposition B.1 in the case in which the excess risk is measured by using the same loss as for training, i.e. $\ell_{\mbox{\tiny\rm test}}=\ell$ . However the same result Eq. (25) applies with minor modifications to the case $\ell_{\mbox{\tiny\rm test}}\neq\ell$ (and thus, with $R$ replaced by $R^{\mbox{\tiny\rm test}}$ ), provided $R^{\mbox{\tiny\rm test}}$ is also twice differentiable with Lipschitz Hessian, and $\nabla R^{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}_{*})={\boldsymbol{0}}$ . In this case, (25) has to be modified replacing ${\boldsymbol{H}}^{-1}$ by ${\boldsymbol{H}}^{-1}\nabla^{2}R^{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}_% {*}){\boldsymbol{H}}^{-1}$ .

Remark B.2.

The error terms in Eq. (25) are negligible under two conditions: $(i)$ $m$ and $n$ are large, which is the classical condition for low-dimensional asymptotics to hold; $(ii)$ $\|{\boldsymbol{g}}^{s}\|_{2}=\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}\alpha$ is small. In particular, the latter condition will hold in two cases. First, when $\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}$ is of order one (i.e. the distribution shift is large), but $\alpha$ is small (surrogate data are downweighted). Note that, when the distribution shift is large, and the sample size $n$ is large enough, we expect small $\alpha$ to be optimal and therefore Eq. (25) covers the ‘interesting’ regime.

Second, when $\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}$ is small (i.e. the shift is small) and $\alpha$ is of order one. If in addition we have $\nabla^{2}R^{s}({\boldsymbol{\theta}}_{*})\approx\nabla^{2}R^{s}({\boldsymbol{% \theta}}_{*})$ , it can be shown that the range of validity of Eq. (25) covers the whole interval $\alpha\in[0,1]$ .

Remark B.3.

Note that the distribution shift is measured in Eq. (25) by the first term $\langle{\boldsymbol{g}}^{s},{\boldsymbol{H}}^{-1}{\boldsymbol{g}}^{s}\rangle$ . The original and surrogate distribution can be very different in other metrics (e.g. in total variation or transportation distance), but as long as ${\boldsymbol{g}}^{s}$ is small (as measured in the norm defined by ${\boldsymbol{H}}^{-1}$ ), surrogate data will reduce test error.

Note that, within the setting of Proposition B.1, the excess error of training only on original data is $R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n):=R(\hat{\boldsymbol{\theta}}_{n% ,0}(0))-R({\boldsymbol{\theta}}_{*})=\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}\big{)}/n+o(1/n)$ , while $R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m):=R(\hat{\boldsymbol{\theta}}_{n% ,m}(0))-R({\boldsymbol{\theta}}_{*})=\langle{\boldsymbol{g}}^{s},{\boldsymbol{% H}}^{-1}{\boldsymbol{g}}^{s}\rangle+\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}_{s}\big{)}/m+o(1/m)$ . Hence Eq. (B.1) can be recast in the form of our general scaling law (4), namely:

\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R\big{(}{\boldsymbol{% \theta}}_{*}\big{)}

\displaystyle\approx\alpha^{2}R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(% \infty)+\Big{[}\alpha^{2}\big{(}R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)% -R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)\big{)}+(1-\alpha)^{2}R_{% \mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)\Big{]}\,,

which (as expected) corresponds to the parametric scaling exponent $\beta=1$ .

An immediate consequence of Proposition B.1 is that surrogate data do not hurt, and will help if their distribution is close enough to the original one (under the assumption of optimally chosen $\alpha$ ).

Corollary B.2.

Under the assumptions of Proposition B.1, let $\overline{R}_{\mbox{\sf\tiny or}}(n):=\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}\big{)}/n$ , and $\overline{R}_{\mbox{\sf\tiny su}}(m):=\langle{\boldsymbol{g}}^{s},{\boldsymbol% {H}}^{-1}{\boldsymbol{g}}^{s}\rangle+\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}_{s}\big{)}/m$ . For $\alpha^{*}_{n,m}=\overline{R}_{\mbox{\sf\tiny or}}(n)/(\overline{R}_{\mbox{\sf% \tiny su}}(m)+\overline{R}_{\mbox{\sf\tiny or}}(n))$ , we have

\displaystyle R\big{(}\hat{\boldsymbol{\theta}}_{n,m}(\alpha^{*}_{n,m})\big{)}% -R_{*}

\displaystyle=\;\big{(}\overline{R}_{\mbox{\sf\tiny or}}(n)^{-1}+\overline{R}_% {\mbox{\sf\tiny su}}(m)^{-1}\big{)}^{-1}+\Delta_{n,m},

with $\Delta_{n,m}$ of the same order as the error in Prop. B.1.

B.2 Proofs

Lemma B.3.

Under the assumptions of Proposition B.1 (Assumption 1 and Assumption 2) there exists $\alpha_{\max}\in(0,1]$ , depending only on the constants appearing there such that the following holds:

$(i)$

The function ${\boldsymbol{\theta}}\mapsto R({\boldsymbol{\theta}};\alpha):=(1-\alpha)\,R({% \boldsymbol{\theta}})+\alpha\,R^{s}({\boldsymbol{\theta}})$ has a unique minimizer ${\boldsymbol{\theta}}_{*}(\alpha)\in\mathbb{R}^{d}$ . Further ${\boldsymbol{\theta}}_{*}(\alpha)\in{\sf B}({\boldsymbol{\theta}}_{*},r)$ , and ${\boldsymbol{\theta}}_{*}(\alpha)\to{\boldsymbol{\theta}}_{*}$ as $\alpha\downarrow 0$ .
$(ii)$

We have $\hat{\boldsymbol{\theta}}_{n,m}(\alpha)\to{\boldsymbol{\theta}}_{*}$ in probability as $n,m\to\infty$ .

Proof.

Fix $r_{0}\in(0,r]$ By Assumption 2. $(a)$ , $\inf_{{\boldsymbol{\theta}}\not\in{\sf B}({\boldsymbol{\theta}}_{*};r_{0})}R({% \boldsymbol{\theta}})>R({\boldsymbol{\theta}}_{*})+\delta_{0}$ for some constant $\delta_{0}$ . Hence, using Assumption 1, for any ${\boldsymbol{\theta}}\not\in{\sf B}({\boldsymbol{\theta}}_{*};r)$

	$\displaystyle R({\boldsymbol{\theta}};\alpha)$	$\displaystyle\geq R({\boldsymbol{\theta}})-K_{*}\alpha\big{[}1+R({\boldsymbol{% \theta}})\big{]}$
		$\displaystyle\geq(1-K_{}\alpha)R({\boldsymbol{\theta}})-K_{}\alpha$
		$\displaystyle\geq(1-K_{}\alpha)(R({\boldsymbol{\theta}}_{})+\delta_{0})-K_{*% }\alpha\,.$

In the other hand $R({\boldsymbol{\theta}}_{*};\alpha)\leq(1+K_{*}\alpha)R({\boldsymbol{\theta}}_% {*})+K_{*}\alpha$ , whence

	$\displaystyle R({\boldsymbol{\theta}};\alpha)-R({\boldsymbol{\theta}}_{*};\alpha)$	$\displaystyle\geq(1-K_{}\alpha)\delta_{0}-2K_{}\alpha R({\boldsymbol{\theta}% }_{*})$
		$\displaystyle-2K_{*}\alpha,,$

which is strictly positive for $\alpha<\alpha_{\max}(r_{0}):=\delta_{0}/(4K_{*}(1+R({\boldsymbol{\theta}}_{*}))$ . Hence the minimum must be achieved in ${\sf B}({\boldsymbol{\theta}}_{*};r_{0})$ (note that since $R({\boldsymbol{\theta}})$ , $R_{s}({\boldsymbol{\theta}})$ are lower semicontinuous, the minimum is achieved).

By Assumption 2. $(d)$ , for $r_{0}$ sufficiently small, ${\boldsymbol{\theta}}\mapsto\nabla R({\boldsymbol{\theta}};\alpha)$ is strictly convex in ${\sf B}({\boldsymbol{\theta}}_{*};r_{0})$ and therefore the minimizer is unique. This proves point $(i)$ .

Point $(ii)$ follows from a modification of Theorem 5.14 in [vdV00]. Namely, for a diverging sequence $\{(n(k),m(k)):k\in\mathbb{N}\}$ , we consider to $\widehat{R}_{*,k}({\boldsymbol{u}}):=\widehat{R}_{n(k),m(k)}(c({\boldsymbol{u}% }){\boldsymbol{u}};\alpha)$ , where $c({\boldsymbol{u}}):=(1+\|{\boldsymbol{u}}\|^{2})^{-1/2}$ . This function is lower semicontinuous on the compact set ${\sf B}({\boldsymbol{0}};1)$ and converges almost surely to its expectation for every fixed ${\boldsymbol{u}}$ in this set, and hence the argument of Theorem 5.14 [vdV00] applies here. ∎

Proof of Proposition B.1.

By a modification of Theorem 5.39 in [vdV00] (here ${\boldsymbol{\theta}}_{*}(\alpha)$ is defined as in Lemma B.3)

	$\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=$	$\displaystyle\;{\boldsymbol{\theta}}_{}(\alpha)+\frac{1-\alpha}{n}{% \boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{n}\big{[}\nabla\ell({\boldsymbol{% \theta}}_{}(\alpha);{\boldsymbol{z}}_{i})-{\mathbb{E}}\nabla\ell({\boldsymbol% {\theta}};{\boldsymbol{z}})\big{]}$		(26)
		$\displaystyle\;\;+\frac{\alpha}{m}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{m}% \big{[}\nabla\ell({\boldsymbol{\theta}}_{*}(\alpha);{\boldsymbol{z}}^{c}_{i})-% {\mathbb{E}}_{s}\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}+O_{P% }(m^{-1}+n^{-1})\,,$		(27)

where ${\boldsymbol{H}}(\alpha):=(1-\alpha)\nabla^{2}R({\boldsymbol{\theta}}_{*}(% \alpha))+\alpha\nabla^{2}R_{s}({\boldsymbol{\theta}}_{*}(\alpha))$ . Note that in the present setting the error is of order $m^{-1}+n^{-1}$ because we assume the Hessian to be Lipschitz continuous.

The population minimizer ${\boldsymbol{\theta}}_{*}(\alpha)$ solves

	$\displaystyle{\boldsymbol{0}}$	$\displaystyle=\nabla R({\boldsymbol{\theta}}_{*}(\alpha);\alpha)$
		$\displaystyle=\nabla R({\boldsymbol{\theta}}_{};\alpha)+\nabla^{2}R({% \boldsymbol{\theta}}_{};\alpha)({\boldsymbol{\theta}}_{}(\alpha)-{% \boldsymbol{\theta}}_{})+\int_{0}^{1}\big{[}\nabla^{2}R({\boldsymbol{\theta}}% _{t};\alpha)-\nabla^{2}R({\boldsymbol{\theta}}_{};\alpha)\big{]}({\boldsymbol% {\theta}}_{}(\alpha)-{\boldsymbol{\theta}}_{*})\,{\rm d}t\,,$

where ${\boldsymbol{\theta}}_{t}=t\,{\boldsymbol{\theta}}_{*}(\alpha)+(1-t)\,{% \boldsymbol{\theta}}_{*}$ . Denoting by $L_{2}$ the Lipschitz constant of the Hessian (in operator norm), and recalling that $\nabla R({\boldsymbol{\theta}}_{*})={\boldsymbol{0}}$ , we have

	$\displaystyle\nabla^{2}R({\boldsymbol{\theta}}_{};\alpha)({\boldsymbol{\theta% }}_{}(\alpha)-{\boldsymbol{\theta}}_{*})$	$\displaystyle=-\alpha\nabla R_{s}({\boldsymbol{\theta}}_{*})+{\boldsymbol{u}}\,,$
	$\displaystyle\\|{\boldsymbol{u}}\\|_{2}$	$\displaystyle\leq L_{2}\\|{\boldsymbol{\theta}}_{}(\alpha)-{\boldsymbol{\theta% }}_{}\\|^{2}\,.$

Recalling that, by Lemma B.3, ${\boldsymbol{\theta}}_{*}(\alpha)\to{\boldsymbol{\theta}}_{*}$ as $\alpha\to 0$ , this implies

\displaystyle{\boldsymbol{\theta}}_{*}(\alpha)-{\boldsymbol{\theta}}_{*}=-{% \boldsymbol{H}}^{-1}\nabla R_{s}({\boldsymbol{\theta}}_{*})\alpha+O\big{(}(% \big{(}\|\nabla R_{s}({\boldsymbol{\theta}}_{*})\|_{2}\vee\|\nabla R_{s}({% \boldsymbol{\theta}}_{*})\|_{2}^{2}\big{)}\alpha^{2})\,.

(28)

Substituting in Eq. (26), we get

$\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}=$	$\displaystyle-{\boldsymbol{H}}^{-1}\nabla R_{s}({\boldsymbol{\theta}}_{})% \alpha+\frac{1-\alpha}{n}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{n}\big{[}% \nabla\ell({\boldsymbol{\theta}}_{}(\alpha);{\boldsymbol{z}}_{i})-{\mathbb{E}% }\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}$	(29)
	$\displaystyle\;\;+\frac{\alpha}{m}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{m}% \big{[}\nabla\ell({\boldsymbol{\theta}}_{*}(\alpha);{\boldsymbol{z}}^{c}_{i})-% {\mathbb{E}}_{s}\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}+{% \boldsymbol{\Delta}}\,,$	(30)
$\displaystyle\\|{\boldsymbol{\Delta}}\\|\leq$	$\displaystyle C\Big{(}\\|\nabla R_{s}({\boldsymbol{\theta}}_{})\\|_{2}\vee\\|% \nabla R_{s}({\boldsymbol{\theta}}_{})\\|_{2}^{2}\Big{)}\alpha^{2}+\frac{C}{m% \wedge n}\,.$	(31)

The claim follows by substituting the above in

\displaystyle{\mathbb{E}}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R({% \boldsymbol{\theta}})={\mathbb{E}}\langle\hat{\boldsymbol{\theta}}_{n,m}(% \alpha)-{\boldsymbol{\theta}}_{*},{\boldsymbol{H}}(\hat{\boldsymbol{\theta}}_{% n,m}(\alpha)-{\boldsymbol{\theta}}_{*})\rangle+O\Big{(}{\mathbb{E}}\|\hat{% \boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}\|^{3}\big{)}

(32)

and using ${\boldsymbol{H}}(\alpha)={\boldsymbol{H}}+O(\alpha)$ . ∎

Appendix C Gaussian sequence model: Proofs for Section 3.1

C.1 General ridge regression

We define $\hat{\boldsymbol{\Sigma}}={\boldsymbol{X}}^{{\sf T}}{\boldsymbol{X}}/n$ , $\hat{\boldsymbol{\Sigma}}_{s}={\boldsymbol{X}}^{{\sf T}}_{s}{\boldsymbol{X}}_{% s}/m$ , and $\hat{\boldsymbol{\Sigma}}_{\alpha}=(1-\alpha)\hat{\boldsymbol{\Sigma}}+\alpha% \hat{\boldsymbol{\Sigma}}_{s}$ . We then have

$\displaystyle R_{n,m}(\alpha,\lambda)=$	$\displaystyle B_{n,m}(\alpha,\lambda)+\frac{(1-\alpha)^{2}\sigma^{2}}{n}\cdot V% _{n,m}(\alpha,\lambda)+\frac{\alpha^{2}\sigma_{s}^{2}}{n}\cdot V^{s}_{n,m}(% \alpha,\lambda)\,,$	(33)
$\displaystyle B_{n,m}(\alpha,\lambda)$	$\displaystyle:=\Big{\\|}{\boldsymbol{\Sigma}}^{1/2}({\boldsymbol{\Omega}}+\hat{% \boldsymbol{\Sigma}}_{\alpha})^{-1}\big{(}{\boldsymbol{\Omega}}{\boldsymbol{% \theta}}_{}-\alpha\hat{\boldsymbol{\Sigma}}_{s}({\boldsymbol{\theta}}_{}^{s}% -{\boldsymbol{\theta}}_{*})\big{)}\Big{\\|}^{2}\,,$	(34)
$\displaystyle V_{n,m}(\alpha,\lambda)$	$\displaystyle:=\mathrm{Tr}\Big{(}({\boldsymbol{\Omega}}+\hat{\boldsymbol{% \Sigma}}_{\alpha})^{-1}\hat{\boldsymbol{\Sigma}}({\boldsymbol{\Omega}}+\hat{% \boldsymbol{\Sigma}}_{\alpha})^{-1}{\boldsymbol{\Sigma}}\Big{)}\,,$	(35)
$\displaystyle V^{s}_{n,m}(\alpha,\lambda)$	$\displaystyle:=\mathrm{Tr}\Big{(}({\boldsymbol{\Omega}}+\hat{\boldsymbol{% \Sigma}}_{\alpha})^{-1}\hat{\boldsymbol{\Sigma}}_{s}({\boldsymbol{\Omega}}+% \hat{\boldsymbol{\Sigma}}_{\alpha})^{-1}{\boldsymbol{\Sigma}}\Big{)}$	(36)

C.2 Proof of Theorem 1

Without loss of generality, we can assume ${\boldsymbol{\Omega}}={\rm diag}((\omega_{k})_{k\geq 1})$ with $\omega_{k}$ non-decreasing. A simple calculation gives the following general expression for the test error:

$\displaystyle R_{n,m}(\alpha,\lambda)=$	$\displaystyle B_{n,m}(\alpha,\lambda)+s_{n,m}(\alpha)\cdot V_{n,m}(\alpha,% \lambda)\,,$	(37)
$\displaystyle B_{n,m}(\alpha,\lambda)$	$\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}\big{[}(\alpha+\lambda\omega_{k})\theta_{,k}-\alpha\theta_{,k}^{s}\big{]}% ^{2}\,,$	(38)
$\displaystyle V_{n,m}(\alpha,\lambda)$	$\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}\,,$	(39)
$\displaystyle s_{n,m}(\alpha)$	$\displaystyle:=(1-\alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{% s}}{m}\,.$	(40)

We define (with $k_{1}=0$ if the condition is never verified)

\displaystyle k_{1}

\displaystyle:=\max\big{\{}k:\,\lambda\omega_{k}\leq 1\big{\}}\,.

(41)

Note that

	$\displaystyle 0<k\leq k_{1}$	$\displaystyle\;\;\Rightarrow\;\;0<\lambda\omega_{k}\leq 1\,,$		(42)
	$\displaystyle k_{1}<k$	$\displaystyle\;\;\Rightarrow\;\;1<\lambda\omega_{k}\,.$		(43)

We now estimate various sums by breaking them by the value of $k$

	$\displaystyle B_{n,m}$	$\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}(\alpha+\lambda\omega_{k})\theta_{,% k}-\alpha\theta_{,k}^{s}\big{]}^{2}+\sum_{k=k_{1}+1}^{\infty}\frac{1}{(% \lambda\omega_{k})^{2}}\big{[}(\alpha+\lambda\omega_{k})\theta_{,k}-\alpha% \theta_{,k}^{s}\big{]}^{2}$
		$\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}\alpha^{2}(\theta_{,k}-\theta_{,k}% ^{s})^{2}+2\alpha(\theta_{,k}-\theta_{,k}^{s})\lambda\omega_{k}\theta_{,k}+% (\lambda\omega_{k})^{2}\theta^{2}_{,k}\big{]}$
		$\displaystyle\phantom{AA}+\sum_{k=k_{1}+1}^{\infty}\Big{[}\frac{\alpha^{2}}{(% \lambda\omega_{k})^{2}}(\theta_{,k}-\theta_{,k}^{s})^{2}-\frac{2\alpha}{% \lambda\omega_{k}}(\theta_{,k}-\theta_{,k}^{s})\theta_{,k}+\theta^{2}_{,k}% \Big{]}$
		$\displaystyle\leq\alpha^{2}\\|{\boldsymbol{\theta}}_{,\leq k_{1}}-{\boldsymbol% {\theta}}^{s}_{,\leq k_{1}}\\|^{2}+\frac{2\alpha}{\omega_{k_{1}}}\|\langle{% \boldsymbol{\theta}}_{,\leq k_{1}}-{\boldsymbol{\theta}}^{s}_{,\leq k_{1}},{% \boldsymbol{\theta}}_{,\leq k_{1}}\rangle_{{\boldsymbol{\Omega}}}\|+\frac{1}{% \omega_{k_{1}}^{2}}\\|{\boldsymbol{\theta}}_{,\leq k_{1}}\\|_{{\boldsymbol{% \Omega}}^{2}}^{2}$
		$\displaystyle\phantom{AA}+\alpha^{2}\omega_{k_{1}+1}^{2}\\|{\boldsymbol{\theta}% }_{,>k_{1}}-{\boldsymbol{\theta}}^{s}_{,>k_{1}}\\|_{{\boldsymbol{\Omega}}^{-2% }}^{2}+2\alpha\omega_{k_{1}+1}\big{\|}\langle{\boldsymbol{\theta}}_{,>k_{1}}-{% \boldsymbol{\theta}}^{s}_{,>k_{1}},{\boldsymbol{\theta}}_{,>k_{1}}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{\|}+\\|{\boldsymbol{\theta}}_{,>k_{1}}\\|^{2}\,,$

and

\displaystyle V_{n,m}\leq k_{1}+\sum_{k>k_{1}}\frac{\omega_{k_{1}+1}^{2}}{% \omega_{k}^{2}}\leq(k_{1}+c_{\#})\,,

since under the assumption $\omega_{k}\asymp k^{\mu}$ , $\mu>1/2$ , we have $\sum_{k>k_{1}}(\omega_{k_{1}+1}/\omega_{k})^{2}\leq c_{\#}$ .

Recalling the definitions in the theorem, and letting

\displaystyle\delta_{k}

\displaystyle:=\max\Big{(}\omega_{k+1}\big{|}\langle{\boldsymbol{\theta}}_{*,>% k}-{\boldsymbol{\theta}}^{s}_{*,>k},{\boldsymbol{\theta}}_{*,>k}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{|};\;\omega_{k+1}^{2}\|{\boldsymbol{\theta}}_{*% ,>k}-{\boldsymbol{\theta}}^{s}_{*,>k}\|_{{\boldsymbol{\Omega}}^{-2}}^{2}\Big{)% }\,,

we have

\displaystyle B_{n,m}\leq\alpha^{2}\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{% \theta}}^{s}_{*}\|^{2}+\|{\boldsymbol{\theta}}_{*,>k_{1}}\|^{2}+\frac{1}{% \omega_{k_{1}}^{2}}\|{\boldsymbol{\theta}}_{*,\leq k}\|_{{\boldsymbol{\Omega}}% ^{2}}^{2}+3\delta_{k_{1}}+2\Delta_{k_{1}}\,,

whence

	$\displaystyle R_{n,m}(\alpha,\lambda)$	$\displaystyle\leq\alpha^{2}\\|{\boldsymbol{\theta}}_{}-{\boldsymbol{\theta}}^{% s}_{}\\|^{2}+\\|{\boldsymbol{\theta}}_{,>k_{1}}\\|^{2}+\frac{1}{\omega_{k_{1}}^% {2}}\\|{\boldsymbol{\theta}}_{,\leq k_{1}}\\|_{{\boldsymbol{\Omega}}^{2}}^{2}+(% k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}}+2\Delta_{k_{1}}$
		$\displaystyle=\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\\|% {\boldsymbol{\theta}}_{,>k_{1}}\\|^{2}+\frac{1}{\omega_{k_{1}}^{2}}\\|{% \boldsymbol{\theta}}_{,\leq k_{1}}\\|_{{\boldsymbol{\Omega}}^{2}}^{2}+(k_{1}+c% _{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}}+2\Delta_{k_{1}}\,.$

Next we specialize to the case $\|{\boldsymbol{\theta}}_{*,>k}\|^{2}\leq C_{\theta}k^{-2\rho}$ , $\omega_{k}\asymp k^{\mu}$ $\mu\neq\rho$ . In this case we have $\omega_{k}^{-2}\|{\boldsymbol{\theta}}_{*,\leq k}\|_{{\boldsymbol{\Omega}}^{2}% }^{2}\leq Ck^{-2(\mu\wedge\rho)}$ , and therefore, by suitably adjusting the constant $C$

\displaystyle R_{n,m}(\alpha,\lambda)

\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}% }+2\Delta_{k_{1}}\,.

We now bound $\delta_{k}$ . By Cauchy-Schwarz and monotonicity of $\omega$ ,

\displaystyle\omega_{k+1}\big{|}\langle{\boldsymbol{\theta}}_{*,>k}-{% \boldsymbol{\theta}}^{s}_{*,>k},{\boldsymbol{\theta}}_{*,>k}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{|}\leq\|{\boldsymbol{\theta}}_{*,>k}-{% \boldsymbol{\theta}}^{s}_{*,>k}\|_{2}\|{\boldsymbol{\theta}}_{*,>k}\|_{2}\leq 2% C_{\theta}k^{-2\rho}\,,

and further

\displaystyle\omega_{k+1}^{2}\|{\boldsymbol{\theta}}_{*,>k}-{\boldsymbol{% \theta}}^{s}_{*,>k}\|_{{\boldsymbol{\Omega}}^{-2}}^{2}\leq 2\|{\boldsymbol{% \theta}}_{*,>k}\|^{2}+2\|{\boldsymbol{\theta}}^{s}_{*,>k}\|^{2}\leq 4C_{\theta% }k^{-2\rho}\,.

(44)

Therefore,

\displaystyle R_{n,m}(\alpha,\lambda)

\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+2\Delta_{k_{1}% }\,.

Proof of claim $(a)$ . The stated assumption on $\Delta_{k}$ imply that (eventually adjusting the constant $C$ ):

\displaystyle R_{n,m}(\alpha,\lambda)

\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)\,.

We now select $\lambda_{*}(\alpha)$ so that $k_{1}\asymp s_{n,m}(\alpha)^{-1+\beta}$ where $\beta=2(\mu\wedge\rho)/(1+2(\mu\wedge\rho))$ . (this is possible for all $n,m$ large enough under the assumption on $\omega_{k}$ ), to A straightforward calculation yields:

\displaystyle R_{n,m}(\alpha,\lambda_{*}(\alpha))

\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +C\cdot s_{n,m}(\alpha)^{\beta}\,,

which proves claim $(a)$ .

Proof of Claim $(b)$ . We choose $\omega_{k}=k^{\mu}$ , $\theta_{*,k}=k^{-\rho^{\prime}-1/2}$ , $\theta_{*,k}^{s}=\theta_{*,k}+a_{k}k^{-\rho-1/2}$ , with $a_{k}\sim{\sf Unif}(\{-A,+A\})$ . We will choose $A\leq 1$ a sufficiently small numerical constant. Note that, for $\mu>2\rho+1/2$

\displaystyle\Delta_{k}

\displaystyle=k^{-\mu}\left|\sum_{\ell=1}^{k}a_{\ell}\ell^{\mu-2\rho-1}\right|% \leq CAk^{-\mu+{\varepsilon}}\left|\sum_{\ell=1}^{k}\ell^{2\mu-4\rho-2}\right|% ^{1/2}\leq CAk^{-2\rho-1/2+{\varepsilon}^{\prime}}\,,

where, for any ${\varepsilon}>0$ , the first inequality holds with probability at least $1/2$ for all $k>k_{0}({\varepsilon})$ . We can therefore select the $a_{\ell}$ , so that $\Delta_{k}\leq C^{\prime\prime}Ak^{-2\rho-{\varepsilon}}$ for some $C^{\prime\prime}<\infty$ .

Following the calculation at point $(a)$ decompose the bias term as

	$\displaystyle B_{n,m}$	$\displaystyle=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^{% 2}\big{[}\alpha^{2}(\theta_{,k}-\theta_{,k}^{s})^{2}+(\lambda\omega_{k})^{2}% \theta^{2}_{*,k}\big{]}+2\alpha E_{n,m}\,,$
	$\displaystyle E_{n,m}$	$\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}(\theta_{,k}-\theta_{,k}^{s})\lambda\omega_{k}\theta_{*,k}\,.$

Note that $|E_{n,m}|\leq\delta_{k_{1}}+\Delta_{k_{1}}\leq CAk_{1}^{-2(\mu\wedge\rho)}$ . Therefore

	$\displaystyle B_{n,m}-\alpha^{2}\\|{\boldsymbol{\theta}}_{}-{\boldsymbol{% \theta}}_{}^{s}\\|^{2}$
	$\displaystyle\geq\sum_{k=1}^{\infty}\Big{(}\frac{\lambda\omega_{k}}{1+\lambda% \omega_{k}}\Big{)}^{2}\theta^{2}_{,k}-\alpha^{2}\sum_{k=1}^{\infty}\left[1-% \Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^{2}\right](\theta_{,k}-\theta_{*,% k}^{s})^{2}-CAk_{1}^{-2(\mu\wedge\rho)}$
	$\displaystyle\geq\frac{1}{4\omega^{2}_{k_{1}+1}}\\|{\boldsymbol{\theta}}_{,% \leq k_{1}}\\|^{2}_{{\boldsymbol{\Omega}}^{2}}+\frac{1}{4}\\|{\boldsymbol{\theta% }}_{,>k_{1}}\\|^{2}-\frac{A}{4\omega_{k_{1}+1}}\\|{\boldsymbol{\theta}}_{,\leq k% _{1}}\\|^{2}_{{\boldsymbol{\Omega}}}-\frac{A}{4}\\|{\boldsymbol{\theta}}_{,>k_{% 1}}\\|^{2}-CAk_{1}^{-2(\mu\wedge\rho)}$
	$\displaystyle\geq C\,k_{1}^{-2(\mu\wedge\rho)}\,.$

By a similar calculation, we also obtain

\displaystyle V_{n,m}\geq C\,k_{1}\,,

and therefore

\displaystyle R_{n,m}(\alpha,\lambda)

\displaystyle\geq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+Ck_{1}\cdot s_{n,m}(\alpha)\,.

The proof is completed by minimizing over $k_{1}$ .

Appendix D Analysis of the nonparametric model: Proofs for Section 3.2

This appendix is devoted to proving Theorem 2. Recall that this is established within the white noise model of Eq. (14), which we copy here for the readers’ convenience

\displaystyle{\rm d}Y=f_{*}({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}+\frac{% \sigma}{\sqrt{n}}{\rm d}B({\boldsymbol{x}})\,,

(45)

The adaptation of the estimator (13) to this continuous setting is given explicitly below

\displaystyle\hat{f}_{n,m,\alpha}

\displaystyle=\arg\min_{f}\Big{\{}(1-\alpha)\|Y-f\big{\|}_{2}^{2}+\alpha\|Y_{s% }-f\big{\|}_{2}^{2}+\lambda\|f\|_{p,2}^{2}\Big{\}}\,.

(46)

The proof of Theorem 2 is based on a reduction to a suitable ‘sequence model’ via the Fourier transform, defined as

\displaystyle\theta({\boldsymbol{q}}):=\int_{[0,1]^{d}}f({\boldsymbol{x}})\,e^% {-\iota\langle{\boldsymbol{q}},{\boldsymbol{x}}\rangle}\,{\rm d}{\boldsymbol{x% }}\,,

(47)

for ${\boldsymbol{q}}\in{\mathcal{Q}}_{d}:=\{2\pi{\boldsymbol{q}}\;:\;{\boldsymbol{% q}}\in{\mathbb{Z}}^{d}\}$ , where $\iota=\sqrt{-1}$ . The inverse Fourier transform is defined as

\displaystyle f({\boldsymbol{x}})=\frac{1}{(2\pi)^{d}}\sum_{{\boldsymbol{q}}% \in{\mathcal{Q}}_{d}}\theta({\boldsymbol{q}})\,e^{\iota\langle{\boldsymbol{q}}% ,{\boldsymbol{x}}\rangle}\,.

(48)

We let $\theta_{*}$ , $\theta_{*,s}$ , and $\hat{\theta}_{\lambda,p,n,m,\alpha}$ respectively denote the Fourier transform of $f_{*}$ , $f_{*,s}$ , and $\hat{f}_{\lambda,p,n,m,\alpha}$ .

The Fourier transforms of the observations are given by

\displaystyle\hat{Y}({\boldsymbol{q}})=\theta_{*}({\boldsymbol{q}})+\frac{% \sigma}{\sqrt{n}}\,G({\boldsymbol{q}})\,,\;\;\;\;\;\;\hat{Y}_{s}({\boldsymbol{% q}})=\theta_{*,s}({\boldsymbol{q}})+\frac{\sigma_{s}}{\sqrt{m}}\,G_{s}({% \boldsymbol{q}})\,,

(49)

where $G({\boldsymbol{q}})$ and $G_{s}({\boldsymbol{q}})$ are i.i.d. standard Gaussian. It then follows that

\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=\arg\min_{\boldsymbol{% \theta}}\left\{(1-\alpha)\|\hat{\boldsymbol{Y}}-{\boldsymbol{\theta}}\big{\|}_% {2}^{2}+\alpha\|\hat{\boldsymbol{Y}}_{s}-{\boldsymbol{\theta}}\big{\|}_{2}^{2}% +\lambda\|{\boldsymbol{\theta}}\|_{p,2}^{2}\right\}\,.

(50)

where we abuse the notation to define

\displaystyle\|{\boldsymbol{\theta}}\|_{p,2}^{2}:=\sum_{{\boldsymbol{q}}\in{% \mathcal{Q}}_{d}}c_{p,{\boldsymbol{q}}}\,|\theta({\boldsymbol{q}})|^{2}\,.

(51)

with $c_{p,{\boldsymbol{q}}}:=1+\|{\boldsymbol{q}}\|^{2r}$ . Minimizing (50) we get

\displaystyle\hat{\theta}_{n,m}({\boldsymbol{q}};\alpha)=\frac{1}{1+\lambda c_% {p,{\boldsymbol{q}}}}\big{[}(1-\alpha)\,\hat{Y}({\boldsymbol{q}})+\alpha\,\hat% {Y}_{s}({\boldsymbol{q}})\big{]}\,.

(52)

Taking the inverse Fourier transform and plugging it into the excess risk formula we get

	$\displaystyle R(\hat{f}_{n,m,\alpha})$	$\displaystyle=\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c% _{p,{\boldsymbol{q}}})^{2}}\big{[}\alpha(\theta_{,s}-\theta_{})({\boldsymbol% {q}})$
		$\displaystyle+\lambda c_{p,{\boldsymbol{q}}}\theta_{*}({\boldsymbol{q}})\big{]% }^{2}+V_{n,m}\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {p,{\boldsymbol{q}}})^{2}}\,,$		(53)

where

\displaystyle V_{n,m}

\displaystyle:=(1-\alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma_{s}^{% 2}}{m}\,.

(54)

The convexity of $x\to x^{2}$ implies

\displaystyle(a+b)^{2}=\left(\gamma\frac{a}{\gamma}+(1-\gamma)\frac{b}{1-% \gamma}\right)^{2}\leq\frac{a^{2}}{\gamma}+\frac{b^{2}}{1-\gamma}

(55)

for $\gamma\in(0,1)$ and therefore we can upper bound the first sum in (53) by taking $\gamma=1/(1+\delta)$ for any $\delta>0$ , which yields

R(f_{n,m,\alpha})\leq(1+\delta)\alpha^{2}\|{\boldsymbol{\theta}}_{*,s}-{% \boldsymbol{\theta}}_{*}\|_{2}^{2}+\frac{1+\delta}{\delta}\sum_{{\boldsymbol{q% }}\in{\mathcal{Q}}_{d}}\left(\frac{\lambda c_{p,{\boldsymbol{q}}}}{1+\lambda c% _{p,{\boldsymbol{q}}}}\right)^{2}|\theta_{*}({\boldsymbol{q}})|^{2}+V_{n,m}% \sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_{p,{% \boldsymbol{q}}})^{2}}\,.

(56)

D.1 Proof of Theorem 2

We now upper bound the first sum above. We note that, defining $q_{0}$ via $\lambda c_{r}(q_{0})=1$ (with an abuse of notation $c_{r}(t)=1+t^{2r}$ ), whence $q_{0}\geq(\lambda/2)^{-1/2r}$ for all $\lambda<1$ :

	$\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}$	$\displaystyle\left(\frac{\lambda c_{p,{\boldsymbol{q}}}}{1+\lambda c_{r,{% \boldsymbol{q}}}}\right)^{2}\cdot\|\theta_{}({\boldsymbol{q}})\|^{2}\leq\sum_{{% \boldsymbol{q}}\in{\mathcal{Q}}_{d},\\|{\boldsymbol{q}}\\|_{2}\leq q_{0}}\lambda% ^{2}c_{r}({\boldsymbol{q}})^{2}\|\theta_{}({\boldsymbol{q}})\|^{2}+\sum_{{% \boldsymbol{q}}\in{\mathcal{Q}}_{d},\\|{\boldsymbol{q}}\\|_{2}>q_{0}}\|\theta_{*}% ({\boldsymbol{q}})\|^{2}$
		$\displaystyle\leq\lambda^{2}\max_{\\|{\boldsymbol{q}}\\|_{2}\leq q_{0}}\frac{c_{% r}({\boldsymbol{q}})^{2}}{c_{s}({\boldsymbol{q}})}\sum_{{\boldsymbol{q}}\in{% \mathcal{Q}}_{d},\\|{\boldsymbol{q}}\\|_{2}\leq q_{0}}c_{s}({\boldsymbol{q}})\|% \theta_{}({\boldsymbol{q}})\|^{2}+\max_{\\|{\boldsymbol{q}}\\|_{2}>q_{0}}\frac{1% }{c_{s}({\boldsymbol{q}})}\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d},\\|{% \boldsymbol{q}}\\|_{2}>q_{0}}c_{s}({\boldsymbol{q}})\|\theta_{}({\boldsymbol{q}% })\|^{2}$
		$\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\lambda^{2}\max_{\\|{% \boldsymbol{q}}\\|_{2}\leq q_{0}}\frac{c_{r}({\boldsymbol{q}})^{2}}{c_{s}({% \boldsymbol{q}})}\max_{\\|{\boldsymbol{q}}\\|_{2}>q_{0}}\frac{1}{c_{s}({% \boldsymbol{q}})}$
		$\displaystyle\leq\lambda^{2}\max\Big{(}1,\,\frac{c_{r}(q_{0})^{2}}{c_{s}(q_{0}% )}\Big{)}+\frac{1}{c_{s}(q_{0})}$
		$\displaystyle\leq C\max(\lambda^{2},\lambda^{p/r})+C\lambda^{p/r}\leq C\lambda% ^{2\wedge(p/r)}\,,$

where in $(a)$ we used the fact that $\|f_{*}\|_{2,p}^{2}=\sum_{{\boldsymbol{q}}}c_{s}({\boldsymbol{q}})|\theta_{*}(% {\boldsymbol{q}})|$ . Letting $C_{i}(d)$ be constants depending on $d$ , we have

	$\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {r,{\boldsymbol{q}}})^{2}}$	$\displaystyle\leq C_{1}(d)\int_{\mathbb{R}^{d}}\frac{1}{(1+\lambda c_{r,{% \boldsymbol{q}}})^{2}}\,{\rm d}{\boldsymbol{q}}$
		$\displaystyle\leq C_{1}(d)\int_{\mathbb{R}^{d}}\frac{1}{(1+\lambda\\|{% \boldsymbol{q}}\\|^{2r}))^{2}}\,{\rm d}{\boldsymbol{q}}$
		$\displaystyle\leq C_{2}(d)\int_{0}^{\infty}\frac{t^{d-1}}{(1+\lambda t^{2r})^{% 2}}{\rm d}t$
		$\displaystyle\leq C_{2}(d)\int_{0}^{\lambda^{-1/2r}}t^{d-1}\,{\rm d}t+C_{2}(d)% \lambda^{-2}\int_{\lambda^{-1/2r}}^{\infty}t^{d-1-4r}\,{\rm d}t\,.$

For convergence we requite $r>d/4$ , in which case

\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {r,{\boldsymbol{q}}})^{2}}\leq C_{4}(d)\lambda^{-d/2r}\,.

(57)

Appendix E Analysis of high-dimensional regression: Proofs for Section 3.4

E.1 Auxiliary definition for Theorem 3

Our characterization is given in terms of a variational principle. For $\delta,\delta_{s}\in(0,\infty)$ , define $\mathscrsfs{R}:\mathbb{R}_{\geq 0}^{3}\to\mathbb{R}$ via

	$\displaystyle\mathscrsfs{R}(\xi,\xi_{\perp},\omega,\alpha,r,r_{s},\delta,% \delta_{s},\lambda,\gamma,\sigma,\sigma_{s})$	$\displaystyle:=-\omega\sqrt{\rho^{2}+\rho_{s}^{2}}+\rho\sqrt{\delta(\tau^{2}+% \sigma^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}$		(58)
		$\displaystyle\qquad-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s% }^{2}}{2\alpha}+\frac{\lambda}{2}\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big% {)}\,,$

where $\tau,\tau_{s}$ are defined by

	$\displaystyle\tau^{2}$	$\displaystyle:=(\xi-r)^{2}+\xi_{\perp}^{2}+\omega^{2}\,,$		(59)
	$\displaystyle\tau_{s}^{2}$	$\displaystyle:=(\xi-r_{s}\cos\gamma)^{2}+(\xi_{\perp}-r_{s}\sin\gamma)^{2}+% \omega^{2}\,,$		(60)

and $\rho={\overline{\rho}}/\sqrt{1+t^{2}}$ , $\rho_{s}={\overline{\rho}}t/\sqrt{1+t^{2}}$ , with ${\overline{\rho}}$ solving the polynomial equation

\displaystyle{\overline{\rho}}^{2}=\frac{\delta(\tau^{2}+\sigma^{2})}{\big{(}% \delta/(1-\alpha)+\omega/{\overline{\rho}}\big{)}^{2}}+\frac{\delta_{s}(\tau_{% s}^{2}+\sigma_{s}^{2})}{\big{(}\delta_{s}/\alpha+\omega/{\overline{\rho}}\big{% )}^{2}}\,,

(61)

and $t$ is given by

\displaystyle t=\frac{\omega+\delta{\overline{\rho}}/(1-\alpha)}{\omega+\delta% _{s}{\overline{\rho}}/\alpha}\cdot\sqrt{\frac{\delta_{s}(\tau^{2}_{s}+\sigma^{% 2}_{s})}{\delta(\tau^{2}+\sigma^{2})}}\,.

(62)

Theorem 3 states that the asymptotics of the test error is determined by the minimizer of $\mathscrsfs{R}$ .

E.2 Proof of Theorem 3

The proof is based on Gordon Gaussian comparison inequality [Gor85, Ver18], and follow a standard route, see e.g. [TOH15, TAH18, MM21]. We will limit ourselves to outlining the main steps of the calculation. Throughout, we consider the case ${\varepsilon}_{0}>0$ , $\delta+\delta_{s}>1$ because the other one ( ${\varepsilon}_{0}=0$ and $\delta,\delta_{s}>1$ ) is analogous and less interesting.

We begin by rewriting the ridge cost function in terms of a Lagrangian

$\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)=$	$\displaystyle\max_{{\boldsymbol{u}}\in\mathbb{R}^{n}}\max_{{\boldsymbol{u}}^{s% }\in\mathbb{R}^{m}}\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha)\,,$	(63)
$\displaystyle\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha):=$	$\displaystyle\langle{\boldsymbol{u}},{\boldsymbol{X}}({\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{})\rangle+\langle{\boldsymbol{u}}^{s},{\boldsymbol{X}}^% {s}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{,s})\rangle-\langle{% \boldsymbol{u}},{\boldsymbol{{\varepsilon}}}\rangle-\langle{\boldsymbol{u}}^{s% },{\boldsymbol{{\varepsilon}}}^{s}\rangle$	(64)
	$\displaystyle-\frac{n\\|{\boldsymbol{u}}\\|_{2}^{2}}{2(1-\alpha)}-\frac{m\\|{% \boldsymbol{u}}^{s}\\|_{2}^{2}}{2\alpha}+\frac{\lambda}{2}\,\\|{\boldsymbol{% \theta}}\\|^{2}_{2}\,.$

Let $\Delta({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s}):=\|{% \boldsymbol{u}}\|_{2}\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*}\|_{2}G+% \|{\boldsymbol{u}}^{s}\|_{2}\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s% }\|_{2}G_{s}$ ., where $G,G_{s}$ are independent standard normal random variables, independent of ${\boldsymbol{X}},{\boldsymbol{X}}^{s}$ . By Gordon’s inequality [Gor85], we can compare the Gaussian process $\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s};% \alpha)+\Delta({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s})$ to

	$\displaystyle\widehat{L}^{G}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha):=$	$\displaystyle\\|{\boldsymbol{u}}\\|\langle{\boldsymbol{g}},{\boldsymbol{\theta}}% -{\boldsymbol{\theta}}_{}\rangle+\\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}% }_{}\\|\langle{\boldsymbol{h}},{\boldsymbol{u}}\rangle+\\|{\boldsymbol{u}}^{s}% \\|\langle{\boldsymbol{g}}^{s},{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{,s% }\rangle+\\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{,s}\\|\langle{% \boldsymbol{h}},{\boldsymbol{u}}^{s}\rangle$		(65)
		$\displaystyle-\langle{\boldsymbol{u}},{\boldsymbol{{\varepsilon}}}\rangle-% \langle{\boldsymbol{u}}^{s},{\boldsymbol{{\varepsilon}}}^{s}\rangle-\frac{n\\|{% \boldsymbol{u}}\\|_{2}^{2}}{2(1-\alpha)}-\frac{m\\|{\boldsymbol{u}}^{s}\\|_{2}^{2% }}{2\alpha}+\frac{\lambda}{2}\,\\|{\boldsymbol{\theta}}\\|^{2}_{2}\,.$

Next we define the orthonormal vectors

\displaystyle{\boldsymbol{v}}_{*}:=\frac{{\boldsymbol{\theta}}_{*}}{\|{% \boldsymbol{\theta}}_{*}\|_{2}}\,,\;\;\;\;\;\;{\boldsymbol{v}}^{\perp}_{*}:=% \frac{{\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}{\boldsymbol{\theta}% }_{*,s}}{\|{\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}{\boldsymbol{% \theta}}_{*,s}\|_{2}}\,,

(66)

where ${\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}={\boldsymbol{I}}-{% \boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}:={\boldsymbol{I}}-{\boldsymbol{v}}% _{*}{\boldsymbol{v}}_{*}^{{\sf T}}$ is the projector orthogonal to ${\boldsymbol{\theta}}_{*}$ . We then decompose

\displaystyle{\boldsymbol{\theta}}=\xi{\boldsymbol{v}}_{*}+\xi_{\perp}\,{% \boldsymbol{v}}_{*}^{\perp}+{\boldsymbol{\theta}}^{\perp}\,,

(67)

where $\langle{\boldsymbol{v}}_{*},{\boldsymbol{\theta}}^{\perp}\rangle=\langle{% \boldsymbol{v}}^{\perp}_{*},{\boldsymbol{\theta}}^{\perp}\rangle=0$ , and define $\omega:=\|{\boldsymbol{\theta}}^{\perp}\|_{2}$ . Defining $\tau^{2}=\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*}\|_{2}^{2}$ , $\tau^{2}_{s}=\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s}\|_{2}^{2}$ , Eq. (60) follows.

With these notations, and letting $\hat{\sigma}^{2}=\|\tau{\boldsymbol{h}}+{\boldsymbol{{\varepsilon}}}\|^{2}_{2}% /n-\tau^{2}$ , $\hat{\sigma}_{s}^{2}=\|\tau_{s}{\boldsymbol{h}}^{s}+{\boldsymbol{{\varepsilon}% }}^{s}\|^{2}_{2}/m-\tau_{s}^{2}$ , we get

$\displaystyle\widehat{\mathscrsfs{L}}^{G}_{n,m}({\boldsymbol{\theta}},\rho,% \rho_{s};\alpha)$	$\displaystyle:=\max_{{\boldsymbol{u}},{\boldsymbol{u}}^{s}}\Big{\{}\widehat{L}% ^{G}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s};\alpha)% :\;\\|{\boldsymbol{u}}\\|=\frac{\rho}{\sqrt{d}},\,\\|{\boldsymbol{u}}^{s}\\|=\frac% {\rho_{s}}{\sqrt{d}}\Big{\}}\,,$	(68)
$\displaystyle\widehat{\mathscrsfs{L}}^{G}_{n,m}({\boldsymbol{\theta}},\rho,% \rho_{s};\alpha)$	$\displaystyle=\frac{\rho}{\sqrt{d}}\langle{\boldsymbol{g}},{\boldsymbol{\theta% }}-{\boldsymbol{\theta}}_{}\rangle+\frac{\rho_{s}}{\sqrt{d}}\langle{% \boldsymbol{g}}^{s},{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{,s}\rangle+% \rho\sqrt{\delta(\tau^{2}+\hat{\sigma}^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}% ^{2}+\hat{\sigma}_{s}^{2})}$	(69)
	$\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}}{2% \alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,.$

We finally decompose ${\boldsymbol{g}}={\boldsymbol{g}}_{\parallel}+{\boldsymbol{g}}_{\perp}$ where ${\boldsymbol{g}}_{\parallel}\in{\sf span}({\boldsymbol{v}}_{*},{\boldsymbol{v}% }_{*}^{\perp})$ and ${\boldsymbol{g}}_{\parallel}\perp{\sf span}({\boldsymbol{v}}_{*},{\boldsymbol{% v}}_{*}^{\perp})$ , and similarly for ${\boldsymbol{g}}_{s}$ , and define

\displaystyle\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};% \alpha):=\min_{{\boldsymbol{\theta}}}\Big{\{}\widehat{\mathscrsfs{L}}^{G}_{n,m% }({\boldsymbol{\theta}},\rho,\rho_{s};\alpha):\;{\boldsymbol{\theta}}=\xi{% \boldsymbol{v}}_{*}+\xi_{\perp}\,{\boldsymbol{v}}_{*}^{\perp}+{\boldsymbol{% \theta}}^{\perp}\,,\;\;\|{\boldsymbol{\theta}}^{\perp}\|=\omega\Big{\}}\,.

(70)

Defining $\iota$ via $\|\rho{\boldsymbol{g}}_{\perp}/\sqrt{n}+\rho_{s}{\boldsymbol{g}}_{s,\perp}/% \sqrt{m}\|=(1+\iota)\sqrt{\rho^{2}+\rho_{s}^{2}}$ , we obtain

	$\displaystyle\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};% \alpha)=$	$\displaystyle-(1+\iota)\sqrt{\rho^{2}+\rho_{s}^{2}}\cdot\omega+\Delta+\rho% \sqrt{\delta(\tau^{2}+\hat{\sigma}^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}% +\hat{\sigma}_{s}^{2})}$		(71)
		$\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}^{2}}% {2\alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,,$

where $\Delta$ is the contribution of the perpendicular components. Simple concentration estimates imply that for any ${\varepsilon}>0$ there exist $c({\varepsilon})>0$ such that

$\displaystyle{\mathbb{P}}\big{(}\|\hat{\sigma}-\sigma\|\leq{\varepsilon}\sqrt{% \tau^{2}+\sigma^{2}},\|\hat{\sigma}_{s}-\sigma_{s}\|\leq{\varepsilon}\sqrt{\tau_% {s}^{2}+\sigma_{s}^{2}}\big{)}$	$\displaystyle\geq 1-e^{-c({\varepsilon})n}\,,$	(72)
$\displaystyle{\mathbb{P}}\big{(}\Delta\|\leq\sqrt{(\rho^{2}+\rho_{s}^{2})(\xi^{% 2}+\xi_{\perp}^{2})}\big{)}$	$\displaystyle\geq 1-e^{-c({\varepsilon})n}\,,$	(73)
$\displaystyle{\mathbb{P}}\big{(}\|\iota\|\leq{\varepsilon})$	$\displaystyle\geq 1-e^{-c({\varepsilon})n}\,.$	(74)

We can then estimate $\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)$ by

	$\displaystyle\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)=$	$\displaystyle-\sqrt{\rho^{2}+\rho_{s}^{2}}\cdot\omega+\rho\sqrt{\delta(\tau^{2% }+\sigma^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}$		(75)
		$\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}^{2}}% {2\alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,,$

Differentiating with respect to $\rho$ and $\rho_{s}$ and setting the derivatives to $0$ yields $\rho={\overline{\rho}}/\sqrt{1+t^{2}}$ , $\rho_{s}={\overline{\rho}}t/\sqrt{1+t^{2}}$ , with ${\overline{\rho}},t$ given by Eqs. (61), (62). By computing second derivatives, one obtain that this is a local maximum. Since $\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)\to-\infty$ as $\rho^{2}+\rho_{s}^{2}\to\infty$ , the maximum over $\rho,\rho_{s}$ is either achieved at this point or at the boundary $\{\rho=0\}\cup\{\rho_{s}=0\}$ . By checking the signs of partial derivatives along this boundary, the only other possibility is $\rho=\rho_{s}=0$ .

For economy of notation, write $F(\rho,\rho_{s}):=\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)$ . For any unit vector ${\boldsymbol{v}}=(v_{1},v_{2})\geq 0$ , the directional derivative is

	$\displaystyle\nabla_{{\boldsymbol{v}}}F({\boldsymbol{r}})\big{\|}_{{\boldsymbol% {r}}=0}$	$\displaystyle=-\omega+v_{1}\sqrt{\delta(\tau^{2}+\sigma^{2})}+v_{2}\sqrt{% \delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}$
		$\displaystyle\geq\omega\big{[}-1+v_{1}\sqrt{\delta}+v_{2}\sqrt{\delta_{s}}\big% {]}\,.$

By maximizing over the direction, we see that ${\boldsymbol{v}}$ can be chosen so that $\nabla_{{\boldsymbol{v}}}F({\boldsymbol{0}})\geq\omega[-1+\sqrt{\delta+\delta_% {s}}\big{]}$ . Hence $\rho=\rho_{s}=0$ cannot be the global aximum for $\delta+\delta_{s}>1$ .

Hence, we get

\displaystyle\mathscrsfs{R}(\xi,\xi_{\perp},\omega)=\max_{\rho,\rho_{s}\geq 0}% \mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)\,.

(76)

We further note that, for fixed $\rho,\rho_{s}>0$ , the function $(\xi,\xi_{\perp},\omega)\mapsto\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,% \rho_{s};\alpha)$ is jointly strictly convex for $\lambda>0$ . Hence $(\xi,\xi_{\perp},\omega)\mapsto\mathscrsfs{R}(\xi,\xi_{\perp},\omega)$ is also strictly convex for $\lambda>0$ . Therefore, it has a unique minimizer, which we denote by $(\xi^{*},\xi_{\perp}^{*},\omega^{*})$ . Proceeding as in [MM21], we obtain the following result.

Proposition E.1.

Under the assumptions of Proposition 3, for any ${\varepsilon},{\varepsilon}_{0}>0$ there exists $c=c({\varepsilon},{\varepsilon}_{0})>0$ such that, if $\alpha\in[{\varepsilon}_{0},1-{\varepsilon}_{0}]$ (letting ${\boldsymbol{P}}^{\perp}:={\boldsymbol{I}}-{\boldsymbol{v}}_{*}{\boldsymbol{v}% }_{*}^{{\sf T}}-{\boldsymbol{v}}^{\perp}_{*}({\boldsymbol{v}}_{*}^{\perp})^{{% \sf T}}$ )

\displaystyle{\mathbb{P}}\Big{\{}\big{|}\langle{\boldsymbol{v}}_{*},\hat{% \boldsymbol{\theta}}_{n,m}\rangle-\xi^{*}\big{|}\leq{\varepsilon},\,\big{|}% \langle{\boldsymbol{v}}^{\perp}_{*},\hat{\boldsymbol{\theta}}_{n,m}\rangle-\xi% ^{*}_{\perp}\big{|}\leq{\varepsilon},\,,\big{|}\|{\boldsymbol{P}}^{\perp}\hat{% \boldsymbol{\theta}}_{n,m}\|-\omega^{*}\big{|}\leq{\varepsilon}\,\Big{\}}\geq 1% -2\,e^{-cn}\,.

(77)

In particular, the last proposition implies (a weaker form of) Theorem 3 whereby the supremum is taken over a finite net. Namely for $\eta>0$ , we define

\displaystyle N({\varepsilon}_{0},\eta):=[{\varepsilon}_{0},1-{\varepsilon}_{0% }]\cap\eta{\mathbb{Z}}\,.

Recalling that, in the present case, $R(\hat{\boldsymbol{\theta}})=\|\hat{\boldsymbol{\theta}}-{\boldsymbol{\theta}}% \|^{2}_{2}$ , we obtain (after adjusting the constant $c$ ) we have therefore:

\displaystyle{\mathbb{P}}\Big{(}\max_{\alpha\in N({\varepsilon}_{0},\eta)}\big% {|}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-\mathscrsfs{R}^{*}(\alpha)\big{|% }\geq{\varepsilon}\Big{)}\geq 1-2\,e^{-cn}\,.

(78)

Finally, let ${\boldsymbol{X}}_{+}\in\mathbb{R}^{(m+n)\times d}$ be the matrix obtained by stacking ${\boldsymbol{X}}$ and ${\boldsymbol{X}}_{s}$ . Given constants $C_{1},C_{2},C_{3}$ , define the good event

\displaystyle{\mathcal{G}}:=\Big{\{}C_{1}n\leq\lambda_{\min}({\boldsymbol{X}}_% {+}^{{\sf T}}{\boldsymbol{X}}_{+})\leq\lambda_{\max}({\boldsymbol{X}}_{+}^{{% \sf T}}{\boldsymbol{X}}_{+})\leq C_{2}n;\|{\boldsymbol{X}}^{{\sf T}}{% \boldsymbol{y}}\|\leq C_{3}n\,,\;\|{\boldsymbol{X}}_{s}^{{\sf T}}{\boldsymbol{% y}}_{s}\|\leq C_{3}n\Big{\}}\,/

(79)

By a standard bound on eigenvalues of Wishart matrices [Ver18], for $\delta+\delta_{s}>1$ , we can choose $C_{1},C_{2},C_{3}$ such that

\displaystyle{\mathbb{P}}({\mathcal{G}})\geq 1-2e^{-cn}\,.

(80)

Further on ${\mathcal{G}}$ , ${\boldsymbol{\theta}}_{n,m}(\alpha)$ is bounded (in $\ell_{2}$ norm, and Lipschitz continuous in $\alpha$ ). As a consequence, for a sufficiently large constant $L$ ,

\displaystyle{\mathbb{P}}\Big{(}\big{|}R(\hat{\boldsymbol{\theta}}_{n,m}(% \alpha_{1}))-R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha_{2}))\big{|}\leq L|% \alpha_{1}-\alpha_{2}|\forall\alpha_{1},\alpha_{2}\in[{\varepsilon}_{0},1-{% \varepsilon}_{0}]\Big{)}\geq 1-2e^{-cn}\,.

(81)

The claim follows by using this estimate together with Eq. (78).

	$\displaystyle\nabla^{2}R({\boldsymbol{\theta}}_{};\alpha)({\boldsymbol{\theta% }}_{}(\alpha)-{\boldsymbol{\theta}}_{*})$	$\displaystyle=-\alpha\nabla R_{s}({\boldsymbol{\theta}}_{*})+{\boldsymbol{u}}\,,$
	$\displaystyle\\|{\boldsymbol{u}}\\|_{2}$	$\displaystyle\leq L_{2}\\|{\boldsymbol{\theta}}_{}(\alpha)-{\boldsymbol{\theta% }}_{}\\|^{2}\,.$

	$\displaystyle B_{n,m}$	$\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}(\alpha+\lambda\omega_{k})\theta_{,% k}-\alpha\theta_{,k}^{s}\big{]}^{2}+\sum_{k=k_{1}+1}^{\infty}\frac{1}{(% \lambda\omega_{k})^{2}}\big{[}(\alpha+\lambda\omega_{k})\theta_{,k}-\alpha% \theta_{,k}^{s}\big{]}^{2}$
		$\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}\alpha^{2}(\theta_{,k}-\theta_{,k}% ^{s})^{2}+2\alpha(\theta_{,k}-\theta_{,k}^{s})\lambda\omega_{k}\theta_{,k}+% (\lambda\omega_{k})^{2}\theta^{2}_{,k}\big{]}$
		$\displaystyle\phantom{AA}+\sum_{k=k_{1}+1}^{\infty}\Big{[}\frac{\alpha^{2}}{(% \lambda\omega_{k})^{2}}(\theta_{,k}-\theta_{,k}^{s})^{2}-\frac{2\alpha}{% \lambda\omega_{k}}(\theta_{,k}-\theta_{,k}^{s})\theta_{,k}+\theta^{2}_{,k}% \Big{]}$
		$\displaystyle\leq\alpha^{2}\\|{\boldsymbol{\theta}}_{,\leq k_{1}}-{\boldsymbol% {\theta}}^{s}_{,\leq k_{1}}\\|^{2}+\frac{2\alpha}{\omega_{k_{1}}}\|\langle{% \boldsymbol{\theta}}_{,\leq k_{1}}-{\boldsymbol{\theta}}^{s}_{,\leq k_{1}},{% \boldsymbol{\theta}}_{,\leq k_{1}}\rangle_{{\boldsymbol{\Omega}}}\|+\frac{1}{% \omega_{k_{1}}^{2}}\\|{\boldsymbol{\theta}}_{,\leq k_{1}}\\|_{{\boldsymbol{% \Omega}}^{2}}^{2}$
		$\displaystyle\phantom{AA}+\alpha^{2}\omega_{k_{1}+1}^{2}\\|{\boldsymbol{\theta}% }_{,>k_{1}}-{\boldsymbol{\theta}}^{s}_{,>k_{1}}\\|_{{\boldsymbol{\Omega}}^{-2% }}^{2}+2\alpha\omega_{k_{1}+1}\big{\|}\langle{\boldsymbol{\theta}}_{,>k_{1}}-{% \boldsymbol{\theta}}^{s}_{,>k_{1}},{\boldsymbol{\theta}}_{,>k_{1}}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{\|}+\\|{\boldsymbol{\theta}}_{,>k_{1}}\\|^{2}\,,$

	$\displaystyle B_{n,m}-\alpha^{2}\\|{\boldsymbol{\theta}}_{}-{\boldsymbol{% \theta}}_{}^{s}\\|^{2}$
	$\displaystyle\geq\sum_{k=1}^{\infty}\Big{(}\frac{\lambda\omega_{k}}{1+\lambda% \omega_{k}}\Big{)}^{2}\theta^{2}_{,k}-\alpha^{2}\sum_{k=1}^{\infty}\left[1-% \Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^{2}\right](\theta_{,k}-\theta_{*,% k}^{s})^{2}-CAk_{1}^{-2(\mu\wedge\rho)}$
	$\displaystyle\geq\frac{1}{4\omega^{2}_{k_{1}+1}}\\|{\boldsymbol{\theta}}_{,% \leq k_{1}}\\|^{2}_{{\boldsymbol{\Omega}}^{2}}+\frac{1}{4}\\|{\boldsymbol{\theta% }}_{,>k_{1}}\\|^{2}-\frac{A}{4\omega_{k_{1}+1}}\\|{\boldsymbol{\theta}}_{,\leq k% _{1}}\\|^{2}_{{\boldsymbol{\Omega}}}-\frac{A}{4}\\|{\boldsymbol{\theta}}_{,>k_{% 1}}\\|^{2}-CAk_{1}^{-2(\mu\wedge\rho)}$
	$\displaystyle\geq C\,k_{1}^{-2(\mu\wedge\rho)}\,.$