Scaling laws for learning with real and surrogate data

Ayush Jain    Andrea Montanari Granica Computing Inc. — granica.aiDepartment of Statistics and Department of Mathematics, Stanford University    Eren Sasoglu
Abstract

Collecting large quantities of high-quality data can be prohibitively expensive or impractical, and a bottleneck in machine learning. One may instead augment a small set of n𝑛nitalic_n data points from the target distribution with data from more accessible sources, e.g. data collected under different circumstances or synthesized by generative models. We refer to such data as ‘surrogate data.’ We introduce a weighted empirical risk minimization (ERM) approach for integrating surrogate data into training. We analyze mathematically this method under several classical statistical models, and validate our findings empirically on datasets from different domains. Our main findings are: (i)𝑖(i)( italic_i ) Integrating surrogate data can significantly reduce the test error on the original distribution. Surprisingly, this can happen even when the surrogate data is unrelated to the original ones. We trace back this behavior to the classical Stein’s paradox. (ii)𝑖𝑖(ii)( italic_i italic_i ) In order to reap the benefit of surrogate data, it is crucial to use optimally weighted ERM. (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) The test error of models trained on mixtures of real and surrogate data is approximately described by a scaling law. This scaling law can be used to predict the optimal weighting scheme, and to choose the amount of surrogate data to add.

1 Introduction and overview

1.1 Motivation and formulation

Consider a standard learning setting where we are given n𝑛nitalic_n i.i.d. points 𝒛isubscript𝒛𝑖{\boldsymbol{z}}_{i}bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT from a target distribution 𝒟𝒟\mathcal{D}caligraphic_D. Given a family of rich parametric models governed by the parameter 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ, the goal is to find the parameter 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ that minimizes the expected test loss Rtest(𝜽)subscript𝑅test𝜽R_{\mbox{\tiny\rm test}}({\boldsymbol{\theta}})italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( bold_italic_θ ) between the model predictions for a chosen 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ and the data generated from the distribution 𝒟𝒟\mathcal{D}caligraphic_D. In many application domains, the available data 𝒁=(𝒛i)in𝒁subscriptsubscript𝒛𝑖𝑖𝑛{\boldsymbol{Z}}=({\boldsymbol{z}}_{i})_{i\leq n}bold_italic_Z = ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT from the target distribution, referred to as either real or original data, may be difficult or expensive to acquire. One may then attempt to supplement these data with a different, cheaper source. Examples of such cheaper sources are (i)𝑖(i)( italic_i ) publicly available datasets; (ii)𝑖𝑖(ii)( italic_i italic_i ) datasets owned by the same research group or company but acquired in different circumstances, e.g. in a different location; (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) synthetic data produced by a generative model.

We will denote the data points obtained from this source by 𝒛issubscriptsuperscript𝒛𝑠𝑖{\boldsymbol{z}}^{s}_{i}bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and assume we have m𝑚mitalic_m of them. To formalize, assume ‘surrogate’ data 𝒁s=(𝒛is)imsuperscript𝒁𝑠subscriptsuperscriptsubscript𝒛𝑖𝑠𝑖𝑚{\boldsymbol{Z}}^{s}=({\boldsymbol{z}}_{i}^{s})_{i\leq m}bold_italic_Z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = ( bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT is a collection of i.i.d. samples from some distribution 𝒟ssuperscript𝒟𝑠\mathcal{D}^{s}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. In general, we will not assume the distribution 𝒟ssuperscript𝒟𝑠\mathcal{D}^{s}caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT of synthetic data to be close to the original data distribution 𝒟𝒟\mathcal{D}caligraphic_D, however we assume that these distributions are over the same domain. A number of questions arise: (i)𝑖(i)( italic_i ) How should we use the surrogate data in training? (ii)𝑖𝑖(ii)( italic_i italic_i ) How many surrogate samples should we add to the original data? (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) Can we predict the improvement in test error achieved by adding surrogate samples to the training?

A natural approach would be to add the surrogate data to the original one in the usual training procedure, and indeed many authors have explored this approach (see Section 1.3). Namely, one attempts to minimize the overall empirical risk R^n+mnaive(𝜽)=i=1n(𝜽;𝒛i)+i=1m(𝜽;𝒛is)superscriptsubscript^𝑅𝑛𝑚naive𝜽superscriptsubscript𝑖1𝑛𝜽subscript𝒛𝑖superscriptsubscript𝑖1𝑚𝜽subscriptsuperscript𝒛𝑠𝑖\widehat{R}_{n+m}^{\mbox{\tiny\rm naive}}({\boldsymbol{\theta}})=\sum_{i=1}^{n% }\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+\sum_{i=1}^{m}\ell({% \boldsymbol{\theta}};{\boldsymbol{z}}^{s}_{i})over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT naive end_POSTSUPERSCRIPT ( bold_italic_θ ) = ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), where (z,θ)𝑧𝜃\ell(z,\theta)roman_ℓ ( italic_z , italic_θ ) is a train loss function.

However, a moment of reflection reveals that this approach has serious shortcomings. Consider a simple mean estimation problem, whereby 𝒛i𝖭(𝜽,𝑰d)similar-tosubscript𝒛𝑖𝖭subscript𝜽subscript𝑰𝑑{\boldsymbol{z}}_{i}\sim{\sf N}({\boldsymbol{\theta}}_{*},{\boldsymbol{I}}_{d})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), 𝒛is𝖭(𝜽s,𝑰d)similar-tosubscriptsuperscript𝒛𝑠𝑖𝖭subscriptsuperscript𝜽𝑠subscript𝑰𝑑{\boldsymbol{z}}^{s}_{i}\sim{\sf N}({\boldsymbol{\theta}}^{s}_{*},{\boldsymbol% {I}}_{d})bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), (𝜽;𝒛)=𝜽𝒛2𝜽𝒛superscriptnorm𝜽𝒛2\ell({\boldsymbol{\theta}};{\boldsymbol{z}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{z}}\|^{2}roman_ℓ ( bold_italic_θ ; bold_italic_z ) = ∥ bold_italic_θ - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, and Rtest(𝜽)=𝜽𝜽2subscript𝑅test𝜽superscriptnorm𝜽subscript𝜽2R_{\mbox{\tiny\rm test}}({{\boldsymbol{\theta}}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{*}\|^{2}italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( bold_italic_θ ) = ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. A straightforward calculation yields that the test error of the empirical risk minimizer 𝜽^n+mnaive:=argminR^n+mnaive(𝜽)assignsuperscriptsubscript^𝜽𝑛𝑚naivesuperscriptsubscript^𝑅𝑛𝑚naive𝜽\hat{\boldsymbol{\theta}}_{n+m}^{\mbox{\tiny\rm naive}}:=\arg\min\widehat{R}_{% n+m}^{\mbox{\tiny\rm naive}}({\boldsymbol{\theta}})over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT naive end_POSTSUPERSCRIPT := roman_arg roman_min over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT naive end_POSTSUPERSCRIPT ( bold_italic_θ ) is

Rtest(𝜽^n+mnaive)=(mn+m)2𝜽s𝜽2+1n+m.subscript𝑅testsuperscriptsubscript^𝜽𝑛𝑚naivesuperscript𝑚𝑛𝑚2superscriptnormsubscriptsuperscript𝜽𝑠subscript𝜽21𝑛𝑚\displaystyle R_{\mbox{\tiny\rm test}}(\hat{\boldsymbol{\theta}}_{n+m}^{\mbox{% \tiny\rm naive}})=\left(\frac{m}{n+m}\right)^{2}\|{\boldsymbol{\theta}}^{s}_{*% }-{\boldsymbol{\theta}}_{*}\|^{2}+\frac{1}{n+m}\,\,.italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n + italic_m end_POSTSUBSCRIPT start_POSTSUPERSCRIPT naive end_POSTSUPERSCRIPT ) = ( divide start_ARG italic_m end_ARG start_ARG italic_n + italic_m end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_n + italic_m end_ARG . (1)

As m𝑚mitalic_m increases the variance (the second term) decreases, but the bias due to the difference 𝜽s𝜽normsubscriptsuperscript𝜽𝑠subscript𝜽\|{\boldsymbol{\theta}}^{s}_{*}-{\boldsymbol{\theta}}_{*}\|∥ bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ increases, and the error approaches 𝜽s𝜽2superscriptnormsubscriptsuperscript𝜽𝑠subscript𝜽2\|{\boldsymbol{\theta}}^{s}_{*}-{\boldsymbol{\theta}}_{*}\|^{2}∥ bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, i.e. the model will be only as good as if training only on surrogate data.

In order to overcome these limitations, we propose a weighted ERM approach, and will show that the weight plays a crucial role. Namely, we consider the following regularized empirical risk:

R^n,m(𝜽;α):=1αni=1n(𝜽;𝒛i)+αmi=1m(𝜽;𝒛is)+Ω(𝜽),assignsubscript^𝑅𝑛𝑚𝜽𝛼1𝛼𝑛superscriptsubscript𝑖1𝑛𝜽subscript𝒛𝑖𝛼𝑚superscriptsubscript𝑖1𝑚𝜽subscriptsuperscript𝒛𝑠𝑖Ω𝜽\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha):=\frac{1-\alpha}{% n}\sum_{i=1}^{n}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+\frac{\alpha}% {m}\sum_{i=1}^{m}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}^{s}_{i})+\Omega({% \boldsymbol{\theta}})\,,over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ) := divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + divide start_ARG italic_α end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + roman_Ω ( bold_italic_θ ) , (2)

where α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ] is the weight of the surrogate dataset and Ω:d0:Ωsuperscript𝑑subscriptabsent0\Omega:\mathbb{R}^{d}\to\mathbb{R}_{\geq 0}roman_Ω : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT is a regularizer, e.g. a ridge Ω(𝜽)=λ𝜽22Ω𝜽𝜆subscriptsuperscriptnorm𝜽22\Omega({\boldsymbol{\theta}})=\lambda\|{\boldsymbol{\theta}}\|^{2}_{2}roman_Ω ( bold_italic_θ ) = italic_λ ∥ bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. We denote by

𝜽^n,m(α):=argmin𝜽R^n,m(𝜽;α)assignsubscript^𝜽𝑛𝑚𝛼subscript𝜽subscript^𝑅𝑛𝑚𝜽𝛼\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha):=\arg\min_{{\boldsymbol{% \theta}}}\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) := roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ) (3)

the corresponding empirical risk minimizer. and the resulting test error will be Rtest(𝜽^n,m(α))subscript𝑅testsubscript^𝜽𝑛𝑚𝛼R_{\mbox{\tiny\rm test}}(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ).

For supervised learning tasks, a sample 𝒛𝒛{\boldsymbol{z}}bold_italic_z is represented as 𝒛=(y,𝒙)𝒛𝑦𝒙{\boldsymbol{z}}=(y,{\boldsymbol{x}})bold_italic_z = ( italic_y , bold_italic_x ), where 𝒙d𝒙superscript𝑑{\boldsymbol{x}}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is covariate vector and y𝑦y\in\mathbb{R}italic_y ∈ blackboard_R is response variable and 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ parametrizes a family of models f(𝒙;𝜽)𝑓𝒙𝜽f({\boldsymbol{x}};{\boldsymbol{\theta}})italic_f ( bold_italic_x ; bold_italic_θ ) that predict the response y𝑦yitalic_y given covariate vector 𝒙𝒙{\boldsymbol{x}}bold_italic_x. We consider losses of the form (𝜽,𝒛)=L(y,f(𝒙;𝜽))𝜽𝒛𝐿𝑦𝑓𝒙𝜽\ell({\boldsymbol{\theta}},{\boldsymbol{z}})=L(y,f({\boldsymbol{x}};{% \boldsymbol{\theta}}))roman_ℓ ( bold_italic_θ , bold_italic_z ) = italic_L ( italic_y , italic_f ( bold_italic_x ; bold_italic_θ ) ) and Rtest(𝜽):=𝔼z𝒟Ltest(y,f(𝒙;𝜽))assignsubscript𝑅test𝜽subscript𝔼similar-to𝑧𝒟subscript𝐿test𝑦𝑓𝒙𝜽R_{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}):={\mathbb{E}}_{z\sim\mathcal{D% }}L_{\mbox{\tiny\rm test}}(y,f({\boldsymbol{x}};{\boldsymbol{\theta}}))italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( bold_italic_θ ) := blackboard_E start_POSTSUBSCRIPT italic_z ∼ caligraphic_D end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( italic_y , italic_f ( bold_italic_x ; bold_italic_θ ) ) for some functions L𝐿Litalic_L and Ltestsubscript𝐿testL_{\mbox{\tiny\rm test}}italic_L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT. We allow for the test loss Ltestsubscript𝐿testL_{\mbox{\tiny\rm test}}italic_L start_POSTSUBSCRIPT test end_POSTSUBSCRIPT to be different from the train loss L𝐿Litalic_L, but we will omit the subscript ‘test’ whenever clear from the context.

Refer to caption
Figure 1: IMDB and Rotten Tomatoes data and neural networks. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

Figure 1 provides a preview of our results, for a sentiment analysis task. (Technical details provided in Section 4 and Appendix A.2). Each frame corresponds to a different combination of n𝑛nitalic_n and m𝑚mitalic_m, and we report the test error of our approach as a function of the weight parameter α𝛼\alphaitalic_α (red circles). Solid lines report the prediction of a scaling law that will be one of the main results presented below.

We observe that the weighted ERM approach systematically achieves better test error than either training only on original data (α0𝛼0\alpha\to 0italic_α → 0) or on surrogate data (α1𝛼1\alpha\to 1italic_α → 1). Further the error for optimal α𝛼\alphaitalic_α is always monotone decreasing both in m𝑚mitalic_m and n𝑛nitalic_n, and the approach outperforms the naive unweighted approach. Also, while scaling laws typically do not capture the dependence on hyperparameters, the scaling law presented below predicts the dependence on α𝛼\alphaitalic_α reasonably well. This is particularly useful, because it can be used to tune α𝛼\alphaitalic_α optimally and to predict the amount of surrogate data needed.

1.2 Summary of results

We study the method outlined above both mathematically and via numerical experiments. Our mathematical results are developed in four different settings: (i)𝑖(i)( italic_i ) The Gaussian sequence model (Section 3.1); (ii)𝑖𝑖(ii)( italic_i italic_i ) A non-parametric function estimation setting (Section 3.2); (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) Low-dimensional empirical-risk minimization (Section 3.3); (iv)𝑖𝑣(iv)( italic_i italic_v ) High dimensional ridge regression (Section 3.4);

We carry out experiments with the following data sources. (1)1(1)( 1 ) Simulated data from linear or Gaussian mixture models: this allows us to explicitly control the distribution shift between the original and surrogate datasets, as well as check our theoretical results in a controlled setting. (2)2(2)( 2 ) Real natural language processing (NLP) data for sentiment analysis, with the role of original dataset played by IMDB reviews and the role of surrogate datasets played respectively by Rotten Tomatoes review and Goodreads book reviews. (3)3(3)( 3 ) Progression-free survival analysis using Lasso on TCGA PanCancer dataset with female patients data and male patients data as original and surrogate data, respectively. (4)4(4)( 4 ) Real image classification data, with CIFAR-10 and CIFAR-100 datasets respectively playing the role of original and surrogate data. Our results support the following conclusions:

Surrogate data improve test error. Including surrogate data in training generally improves the test error on the original data, even if the surrogate data distribution is far from the original one. In agreement with the interpretation of surrogate data as a regularizer (see also Sec. (2)), the improvement is generally positive, although its size depend on the data distributions.

Tuning of α𝛼\alphaitalic_α. The above conclusion holds under the condition that α𝛼\alphaitalic_α can be tuned (nearly) optimally. For each of the theoretical settings already mentioned, we characterize this optimal value. We verify that nearly optimal α𝛼\alphaitalic_α can be effectively selected by minimizing error on a validation split of the original data. An attractive alternative is to use the scaling law we discuss next.

Scaling law. We propose a scaling law that captures the behavior of the test error with n,m,α𝑛𝑚𝛼n,m,\alphaitalic_n , italic_m , italic_α:

R(𝜽^n,m(α))Rα2Rsuex()+[α2(Rsuex(m)Rsuex())1/β+(1α)2Rorex(n)1/β]β.𝑅subscript^𝜽𝑛𝑚𝛼subscript𝑅superscript𝛼2subscriptsuperscript𝑅exsusuperscriptdelimited-[]superscript𝛼2superscriptsubscriptsuperscript𝑅exsu𝑚subscriptsuperscript𝑅exsu1𝛽superscript1𝛼2subscriptsuperscript𝑅exorsuperscript𝑛1𝛽𝛽\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R_{*}\approx\alpha^{2% }R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\big{[}\alpha^{2}\big{(}R% ^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)-R^{\mbox{\sf\tiny ex}}_{\mbox{% \sf\tiny su}}(\infty)\big{)}^{1/\beta}+(1-\alpha)^{2}R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny or}}(n)^{1/\beta}\big{]}^{\beta}\,.italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ≈ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + [ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) - italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) ) start_POSTSUPERSCRIPT 1 / italic_β end_POSTSUPERSCRIPT + ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) start_POSTSUPERSCRIPT 1 / italic_β end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT . (4)

Here Rsubscript𝑅R_{*}italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is the minimal (Bayes) error, Rsuex(m):=R(𝜽^0,m(1))Rassignsubscriptsuperscript𝑅exsu𝑚𝑅subscript^𝜽0𝑚1subscript𝑅R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m):=R(\hat{\boldsymbol{\theta}}_{0% ,m}(1))-R_{*}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) := italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ( 1 ) ) - italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is the excess test error when training on the surrogate data (and testing on original), Rorex(n):=R(𝜽^n,0(0))Rassignsubscriptsuperscript𝑅exor𝑛𝑅subscript^𝜽𝑛00subscript𝑅R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(n):=R(\hat{\boldsymbol{\theta}}_{n% ,0}(0))-R_{*}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) := italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) - italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is the excess test error111We assume here that limnR(𝜽^n,0(0))=Rsubscript𝑛𝑅subscript^𝜽𝑛00subscript𝑅\lim_{n\to\infty}R(\hat{\boldsymbol{\theta}}_{n,0}(0))=R_{*}roman_lim start_POSTSUBSCRIPT italic_n → ∞ end_POSTSUBSCRIPT italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) = italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, i.e. that we achieve Bayes risk with infinitely many original samples. See Section 5. when training on original data (and testing on original), and β𝛽\betaitalic_β is a scaling exponent as described in Section 4. The above scaling admits natural generalizations, see Section 5.

Practical uses of the scaling law. Given data {𝒛i}insubscriptsubscript𝒛𝑖𝑖𝑛\{{\boldsymbol{z}}_{i}\}_{i\leq n}{ bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT and a source of surrogate data, we would like to predict how much the test error can be decreased by including any number m𝑚mitalic_m of surrogate samples to the mix. The scaling law (4) suggests a simple approach: (1)1(1)( 1 ) Learn models on purely original data to extract the behavior of test loss R(𝜽^n,0(0)).𝑅subscript^𝜽𝑛00R(\hat{\boldsymbol{\theta}}_{n,0}(0)).italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) .; (2)2(2)( 2 ) Learn models on purely surrogate data to extract the behavior of R(𝜽^0,m(1)).𝑅subscript^𝜽0𝑚1R(\hat{\boldsymbol{\theta}}_{0,m}(1)).italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ( 1 ) ) . (A relatively small sample is sufficient for this step.) (3)3(3)( 3 ) Use the minimum over α𝛼\alphaitalic_α of Eq. (4) to predict test error at any given pair n,m𝑛𝑚n,mitalic_n , italic_m.

1.3 Related work

The use of surrogate data to enhance training has attracted increasing research effort, also because of the recent progresses in generative modeling.

This line of work has largely focused on the techniques to generate synthetic data that are well suited for training. A broad variety of methods have been demonstrated to be useful to generate data for computer vision tasks, ranging from object classification to semantic segmentation [RSM+16, JRBM+17, AAMM+18, TPA+18, CLCG19, HSY+22, MPT+22, YCFB+22]. We refer to [SLW20] for a review. More recently, synthetic data have been used for training in natural language processing [HNK+22, MHZH22].

Scaling laws have been broadly successful in guiding the development of large machine learning models [HNA+17, RRBS19, HKK+20, KMH+20, TDR+21, HKHM21, HBM+22, ANZ22, MRB+23]. We expect them to similarly useful in integrating heterogeneous data into training. The change of scaling laws when training on synthetic data was the object of a recent empirical study [FCK+23]. On the other hand, no systematic attempt was made at integrating real and synthetic data.

Our results are also relevant to problems in transfer learning. However, that literature focuses on learning shared data representations rather than training a model for the target distribution [MPRP16, TJJ20].

2 Regularization, Gaussian mean estimation, Stein paradox

The role of the parameter α𝛼\alphaitalic_α can be understood by considering the limit m𝑚m\to\inftyitalic_m → ∞:

R^n,(𝜽;α)=1αni=1n(𝜽;𝒛i)+αRs(𝜽)+Ω(𝜽),subscript^𝑅𝑛𝜽𝛼1𝛼𝑛superscriptsubscript𝑖1𝑛𝜽subscript𝒛𝑖𝛼superscript𝑅𝑠𝜽Ω𝜽\displaystyle{\widehat{R}_{n,\infty}({\boldsymbol{\theta}};\alpha)=\frac{1-% \alpha}{n}\sum_{i=1}^{n}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}_{i})+% \alpha\,R^{s}({\boldsymbol{\theta}})+\Omega({\boldsymbol{\theta}})\,},over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , ∞ end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ) = divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_α italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ) + roman_Ω ( bold_italic_θ ) ,

and Rs(𝜽)=𝔼𝒛s𝒟s(𝜽;𝒛s)superscript𝑅𝑠𝜽subscript𝔼similar-tosuperscript𝒛𝑠superscript𝒟𝑠𝜽superscript𝒛𝑠R^{s}({\boldsymbol{\theta}})={\mathbb{E}}_{{\boldsymbol{z}}^{s}\sim\mathcal{D}% ^{s}}\ell({\boldsymbol{\theta}};{\boldsymbol{z}}^{s})italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ) = blackboard_E start_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ caligraphic_D start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) is the population risk for surrogate data. This suggests to think of the surrogate data as an additional (highly non-trivial) regularizer, with parameter α𝛼\alphaitalic_α. This leads to a simple yet important insight: adding surrogate data to the original data is beneficial if α𝛼\alphaitalic_α is chosen optimally, and large m𝑚mitalic_m reduces statistical fluctuations in this regularizer. This contrasts with the unweighted approach whose test error in general deteriorates for large m𝑚mitalic_m.

As a toy example, reconsider the mean estimation problem mentioned in the introduction: 𝒛i𝖭(𝜽,𝑰d)similar-tosubscript𝒛𝑖𝖭subscript𝜽subscript𝑰𝑑{\boldsymbol{z}}_{i}\sim{\sf N}({\boldsymbol{\theta}}_{*},{\boldsymbol{I}}_{d})bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) and 𝒛is𝖭(𝜽s,𝑰d)similar-tosubscriptsuperscript𝒛𝑠𝑖𝖭subscriptsuperscript𝜽𝑠subscript𝑰𝑑{\boldsymbol{z}}^{s}_{i}\sim{\sf N}({\boldsymbol{\theta}}^{s}_{*},{\boldsymbol% {I}}_{d})bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), (𝜽;𝒛)=𝜽𝒛2𝜽𝒛superscriptnorm𝜽𝒛2\ell({\boldsymbol{\theta}};{\boldsymbol{z}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{z}}\|^{2}roman_ℓ ( bold_italic_θ ; bold_italic_z ) = ∥ bold_italic_θ - bold_italic_z ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and Rtest(𝜽)=𝜽𝜽2subscript𝑅test𝜽superscriptnorm𝜽subscript𝜽2R_{\mbox{\tiny\rm test}}({{\boldsymbol{\theta}}})=\|{\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{*}\|^{2}italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( bold_italic_θ ) = ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. We have 𝜽^n,m(α)=(1α)in𝒛i/n+αim𝒛is/msubscript^𝜽𝑛𝑚𝛼1𝛼subscript𝑖𝑛subscript𝒛𝑖𝑛𝛼subscript𝑖𝑚subscriptsuperscript𝒛𝑠𝑖𝑚\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=(1-\alpha)\sum_{i\leq n}{{\boldsymbol{% z}}_{i}}/n+\alpha\sum_{i\leq m}{{\boldsymbol{z}}^{s}_{i}}/mover^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) = ( 1 - italic_α ) ∑ start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_n + italic_α ∑ start_POSTSUBSCRIPT italic_i ≤ italic_m end_POSTSUBSCRIPT bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_m. In other words, the weighted ERM shrinks the mean of the original data towards the mean of the surrogate data. For a given α𝛼\alphaitalic_α, the resulting test errors are

R(𝜽^n,m(α))=α2Rsuex()+(α2m+(1α)2n)d,Rsuex()=𝜽𝜽s2,formulae-sequence𝑅subscript^𝜽𝑛𝑚𝛼superscript𝛼2subscriptsuperscript𝑅exsusuperscript𝛼2𝑚superscript1𝛼2𝑛𝑑subscriptsuperscript𝑅exsusuperscriptnormsubscript𝜽superscriptsubscript𝜽𝑠2\displaystyle{R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))=\alpha^{2}R^{\mbox{% \sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\left(\frac{\alpha^{2}}{m}+\frac{(1% -\alpha)^{2}}{n}\right)d\,,\;\;\;R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(% \infty)=\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{\theta}}_{*}^{s}\|^{2}\,},italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + ( divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG + divide start_ARG ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ) italic_d , italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) = ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (5)

and for the optimum value α=argminαR(𝜽^n,m(α))subscript𝛼subscript𝛼𝑅subscript^𝜽𝑛𝑚𝛼\alpha_{*}=\arg\min_{\alpha}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ), this yields

R(𝜽^n,m(α))=(Rsuex()+d/mRsuex()+d/m+d/n)dn.𝑅subscript^𝜽𝑛𝑚subscript𝛼subscriptsuperscript𝑅exsu𝑑𝑚subscriptsuperscript𝑅exsu𝑑𝑚𝑑𝑛𝑑𝑛\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha_{*}))=\left(\frac{R^{% \mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+d/m}{R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny su}}(\infty)+d/m+d/n}\right)\cdot\frac{d}{n}\,.italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) = ( divide start_ARG italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_d / italic_m end_ARG start_ARG italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_d / italic_m + italic_d / italic_n end_ARG ) ⋅ divide start_ARG italic_d end_ARG start_ARG italic_n end_ARG . (6)

Note that 1/n1𝑛1/n1 / italic_n is the error of training only on original data and the prefactor is always strictly smaller than one. Hence, weighted ERM always achieves better error than training only on original data, regardless of the distance between original and surrogate data, although the improvement is larger for small Rsuex()subscriptsuperscript𝑅exsuR^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ). This might seem paradoxical at first. As mentioned above, we are shrinking towards an arbitrary point given by the empirical mean of the surrogate data: how can this help?

In fact, this is a disguised version of the celebrated Stein paradox [EM77, Ste81]: in estimating a Gaussian mean, a procedure that shrinks the empirical mean towards an arbitrary point by a carefully chosen amount outperforms the naive empirical mean. In our toy example, the naive empirical mean corresponds to estimation purely based on the original data, and we shrink it towards the mean of the surrogate data. Of course, the improvement over empirical mean is only possible if α𝛼\alphaitalic_α is chosen optimally. Equation (6) assumes α=α𝛼subscript𝛼\alpha=\alpha_{*}italic_α = italic_α start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT is chosen by an oracle that knows the value of Rsuex()subscriptsuperscript𝑅exsuR^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ). Stein’s analysis implies that in the Gaussian mean problem, α𝛼\alphaitalic_α can be chosen empirically as long as the dimension of 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ is d3𝑑3d\geq 3italic_d ≥ 3. In the settings we are interested in, α𝛼\alphaitalic_α can be chosen via cross-validation.

3 Theoretical results

3.1 Gaussian sequence model

The sequence model captures the behavior of many models in non-parametric statistics while being simpler to analyze [Tsy09, GN21]. It is also known to approximate the behavior of overparametrized linear regression [CM22]. The unknown target is 𝜽dsubscript𝜽superscript𝑑{\boldsymbol{\theta}}_{*}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT (with potentially d=𝑑d=\inftyitalic_d = ∞), and we observe

𝒚i=𝜽+σ𝒈i,in,𝒚is=𝜽s+σs𝒈is,im,formulae-sequencesubscript𝒚𝑖subscript𝜽𝜎subscript𝒈𝑖formulae-sequence𝑖𝑛formulae-sequencesubscriptsuperscript𝒚𝑠𝑖subscriptsuperscript𝜽𝑠subscript𝜎𝑠subscriptsuperscript𝒈𝑠𝑖𝑖𝑚\displaystyle{\boldsymbol{y}}_{i}={\boldsymbol{\theta}}_{*}+\sigma\,{% \boldsymbol{g}}_{i},\;i\leq n\,,\;\;\;{\boldsymbol{y}}^{s}_{i}={\boldsymbol{% \theta}}^{s}_{*}+\sigma_{s}\,{\boldsymbol{g}}^{s}_{i},\;i\leq m\,,bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_σ bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ≤ italic_n , bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i ≤ italic_m , (7)

where 𝜽ssuperscriptsubscript𝜽𝑠{\boldsymbol{\theta}}_{*}^{s}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is also unknown, and 𝒈i,𝒈is𝖭(0,𝑰d)similar-tosubscript𝒈𝑖superscriptsubscript𝒈𝑖𝑠𝖭0subscript𝑰𝑑{\boldsymbol{g}}_{i},{\boldsymbol{g}}_{i}^{s}\sim{\sf N}(0,{\boldsymbol{I}}_{d})bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ sansserif_N ( 0 , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ) are i.i.d. We study the penalized estimator

𝜽^n,m(α):=argmin𝜽{(1α)ni=1n𝒚i𝜽22+αmi=1m𝒚is𝜽22+λ𝜽𝛀2},assignsubscript^𝜽𝑛𝑚𝛼subscript𝜽1𝛼𝑛superscriptsubscript𝑖1𝑛subscriptsuperscriptnormsubscript𝒚𝑖𝜽22𝛼𝑚superscriptsubscript𝑖1𝑚subscriptsuperscriptnormsuperscriptsubscript𝒚𝑖𝑠𝜽22𝜆superscriptsubscriptnorm𝜽𝛀2\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha):=\arg\min_{{\boldsymbol{% \theta}}}\Big{\{}\frac{(1-\alpha)}{n}\sum_{i=1}^{n}\|{\boldsymbol{y}}_{i}-{% \boldsymbol{\theta}}\|^{2}_{2}+\frac{\alpha}{m}\sum_{i=1}^{m}\|{\boldsymbol{y}% }_{i}^{s}-{\boldsymbol{\theta}}\|^{2}_{2}+\lambda\|{\boldsymbol{\theta}}\|_{{% \boldsymbol{\Omega}}}^{2}\Big{\}}\,,over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) := roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT { divide start_ARG ( 1 - italic_α ) end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∥ bold_italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_λ ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } , (8)

where 𝜽𝛀2=𝜽,𝛀𝜽superscriptsubscriptnorm𝜽𝛀2𝜽𝛀𝜽\|{\boldsymbol{\theta}}\|_{{\boldsymbol{\Omega}}}^{2}=\langle{\boldsymbol{% \theta}},{\boldsymbol{\Omega}}{\boldsymbol{\theta}}\rangle∥ bold_italic_θ ∥ start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ⟨ bold_italic_θ , bold_Ω bold_italic_θ ⟩ and 𝛀𝟎succeeds-or-equals𝛀0{\boldsymbol{\Omega}}\succeq{\boldsymbol{0}}bold_Ω ⪰ bold_0 is a regularization weight matrix. We will be concerned with the expected risk

Rn,m(α,λ)=𝔼{𝜽^n,m(α)𝜽2}.subscript𝑅𝑛𝑚𝛼𝜆𝔼superscriptnormsubscript^𝜽𝑛𝑚𝛼subscript𝜽2\displaystyle{R_{n,m}(\alpha,\lambda)={\mathbb{E}}\Big{\{}\big{\|}\hat{% \boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}\big{\|}^{2}\Big{% \}}\,}.italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) = blackboard_E { ∥ over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (9)

The proof of the next result is presented in Appendix C.

Theorem 1.

Let ω1ω2subscript𝜔1subscript𝜔2\omega_{1}\leq\omega_{2}\leq\cdotsitalic_ω start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≤ italic_ω start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ ⋯ be the ordered eigenvalues of 𝛀𝛀{\boldsymbol{\Omega}}bold_Ω, and denote by 𝐯isubscript𝐯𝑖{\boldsymbol{v}}_{i}bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT the corresponding eigenvectors. Further denote by 𝛉,>ksubscript𝛉absent𝑘{\boldsymbol{\theta}}_{*,>k}bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT, 𝛉,>kssubscriptsuperscript𝛉𝑠absent𝑘{\boldsymbol{\theta}}^{s}_{*,>k}bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT the projections of 𝛉subscript𝛉{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, 𝛉,ssubscript𝛉𝑠{\boldsymbol{\theta}}_{*,s}bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT onto span(𝐯i:i>k){\rm span}({\boldsymbol{v}}_{i}:i>k)roman_span ( bold_italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : italic_i > italic_k ), and similarly for 𝛉,ksubscript𝛉absent𝑘{\boldsymbol{\theta}}_{*,\leq k}bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT, 𝛉,kssubscriptsuperscript𝛉𝑠absent𝑘{\boldsymbol{\theta}}^{s}_{*,\leq k}bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT. Assume that ωkkμasymptotically-equalssubscript𝜔𝑘superscript𝑘𝜇\omega_{k}\asymp k^{\mu}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≍ italic_k start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, μ>1/2𝜇12\mu>1/2italic_μ > 1 / 2, 𝛉,>k2Cθk2ρsuperscriptnormsubscript𝛉absent𝑘2subscript𝐶𝜃superscript𝑘2𝜌\|{\boldsymbol{\theta}}_{*,>k}\|^{2}\leq C_{\theta}k^{-2\rho}∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - 2 italic_ρ end_POSTSUPERSCRIPT, ρμ𝜌𝜇\rho\neq\muitalic_ρ ≠ italic_μ, and let ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be such that (for all k𝑘kitalic_k): Δk:=ωk1|𝛉,k𝛉,ks,𝛉,k𝛀|C0k2(μρ)assignsubscriptΔ𝑘superscriptsubscript𝜔𝑘1subscriptsubscript𝛉absent𝑘subscriptsuperscript𝛉𝑠absent𝑘subscript𝛉absent𝑘𝛀subscript𝐶0superscript𝑘2𝜇𝜌\Delta_{k}:=\omega_{k}^{-1}|\langle{\boldsymbol{\theta}}_{*,\leq k}-{% \boldsymbol{\theta}}^{s}_{*,\leq k},{\boldsymbol{\theta}}_{*,\leq k}\rangle_{{% \boldsymbol{\Omega}}}|\leq C_{0}k^{-2(\mu\wedge\rho)}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT := italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT | ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT | ≤ italic_C start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT. Then the following hold:

  1. (a)𝑎(a)( italic_a )

    There exists an explicit λ(α)subscript𝜆𝛼\lambda_{*}(\alpha)italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) such that, letting β:=2(μρ)/(1+2(μρ))assign𝛽2𝜇𝜌12𝜇𝜌\beta:=2(\mu\wedge\rho)/(1+2(\mu\wedge\rho))italic_β := 2 ( italic_μ ∧ italic_ρ ) / ( 1 + 2 ( italic_μ ∧ italic_ρ ) ),

    Rn,m(α,λ(α))α2Rsuex()+C[(1α)2σ2n+α2σs2m]β.subscript𝑅𝑛𝑚𝛼subscript𝜆𝛼superscript𝛼2subscriptsuperscript𝑅exsu𝐶superscriptdelimited-[]superscript1𝛼2superscript𝜎2𝑛superscript𝛼2subscriptsuperscript𝜎2𝑠𝑚𝛽\displaystyle R_{n,m}\big{(}\alpha,\lambda_{*}(\alpha)\big{)}\leq\alpha^{2}R^{% \mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+C\cdot\left[(1-\alpha)^{2}% \frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{s}}{m}\right]^{\beta}\,.italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ) ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C ⋅ [ ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ] start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT . (10)
  2. (b)𝑏(b)( italic_b )

    If μ>2ρ1/2𝜇2𝜌12\mu>2\rho-1/2italic_μ > 2 italic_ρ - 1 / 2, there exists C>0superscript𝐶0C^{\prime}>0italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT > 0 and there exist 𝜽,𝜽ssubscript𝜽superscriptsubscript𝜽𝑠{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*}^{s}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT satisfying the assumptions in point (a)𝑎(a)( italic_a ), such that,

    minλRn,m(α,λ)α2Rsuex()+C[(1α)2σ2n+α2σs2m]β.subscript𝜆subscript𝑅𝑛𝑚𝛼𝜆superscript𝛼2subscriptsuperscript𝑅exsusuperscript𝐶superscriptdelimited-[]superscript1𝛼2superscript𝜎2𝑛superscript𝛼2subscriptsuperscript𝜎2𝑠𝑚𝛽\displaystyle\min_{\lambda}R_{n,m}\big{(}\alpha,\lambda\big{)}\geq\alpha^{2}R^% {\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+C^{\prime}\cdot\left[(1-% \alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{s}}{m}\right]^{% \beta}\,.roman_min start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) ≥ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⋅ [ ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG ] start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT . (11)

Note that since the theorem also implies Rsuex(m)Rsuex()(σs2/m)βasymptotically-equalssubscriptsuperscript𝑅exsu𝑚subscriptsuperscript𝑅exsusuperscriptsuperscriptsubscript𝜎𝑠2𝑚𝛽R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)-R^{\mbox{\sf\tiny ex}}_{\mbox{% \sf\tiny su}}(\infty)\asymp(\sigma_{s}^{2}/m)^{\beta}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) - italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) ≍ ( italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT and Rorex(m)(σ2/n)βasymptotically-equalssubscriptsuperscript𝑅exor𝑚superscriptsuperscript𝜎2𝑛𝛽R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(m)\asymp(\sigma^{2}/n)^{\beta}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_m ) ≍ ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT, this result confirms the scaling law (4).

3.2 Non-parametric regression in Sobolev classes

In this section we consider the classic non-parametric regression model. We assume that n=Qd𝑛superscript𝑄𝑑n=Q^{d}italic_n = italic_Q start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT for some integer Q2𝑄2Q\geq 2italic_Q ≥ 2, and the original data (𝒙i,yi)insubscriptsubscript𝒙𝑖subscript𝑦𝑖𝑖𝑛({\boldsymbol{x}}_{i},y_{i})_{i\leq n}( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT are defined through

yi=f(𝒙i)+εi,εi𝖭(0,σ2),formulae-sequencesubscript𝑦𝑖subscript𝑓subscript𝒙𝑖subscript𝜀𝑖similar-tosubscript𝜀𝑖𝖭0superscript𝜎2\displaystyle y_{i}=f_{*}({\boldsymbol{x}}_{i})+{\varepsilon}_{i}\,,\;\;\;{% \varepsilon}_{i}\sim{\sf N}(0,\sigma^{2})\,,italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , (12)

where εisubscript𝜀𝑖{\varepsilon}_{i}italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are independent of 𝒙isubscript𝒙𝑖{\boldsymbol{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and of each other, and {𝒙i}insubscriptsubscript𝒙𝑖𝑖𝑛\{{\boldsymbol{x}}_{i}\}_{i\leq n}{ bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT equally spaced grid points in the d𝑑ditalic_d-dimensional unit-cube, i.e. 𝒳n={𝒒/Q:𝒒[Q]d}subscript𝒳𝑛conditional-set𝒒𝑄𝒒superscriptdelimited-[]𝑄𝑑{\mathcal{X}}_{n}=\{{\boldsymbol{q}}/Q:\;\;{\boldsymbol{q}}\in[Q]^{d}\}caligraphic_X start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { bold_italic_q / italic_Q : bold_italic_q ∈ [ italic_Q ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }. Surrogate data have a similar distribution, with m=Qsd𝑚superscriptsubscript𝑄𝑠𝑑m=Q_{s}^{d}italic_m = italic_Q start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT equally spaced points 𝒙issubscriptsuperscript𝒙𝑠𝑖{\boldsymbol{x}}^{s}_{i}bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the unit cube, and yis=f,s(𝒙is)+εissubscriptsuperscript𝑦𝑠𝑖subscript𝑓𝑠subscriptsuperscript𝒙𝑠𝑖subscriptsuperscript𝜀𝑠𝑖y^{s}_{i}=f_{*,s}({\boldsymbol{x}}^{s}_{i})+{\varepsilon}^{s}_{i}italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, where εis𝖭(0,σs2)similar-tosuperscriptsubscript𝜀𝑖𝑠𝖭0subscriptsuperscript𝜎2𝑠{\varepsilon}_{i}^{s}\sim{\sf N}(0,\sigma^{2}_{s})italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ sansserif_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). We assume that fsubscript𝑓f_{*}italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT has small Sobolev norm, that is,

fr,22:=[0,1]d(|f(t)|2+f(r)(t)2)dt1.assignsubscriptsuperscriptnormsubscript𝑓2𝑟2subscriptsuperscript01𝑑superscriptsubscript𝑓𝑡2superscriptnormsuperscriptsubscript𝑓𝑟𝑡2differential-d𝑡1\|f_{*}\|^{2}_{r,2}:=\int_{[0,1]^{d}}\big{(}|f_{*}(t)|^{2}+\|f_{*}^{(r)}(t)\|^% {2}\big{)}{\rm d}t\leq 1\,.∥ italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 2 end_POSTSUBSCRIPT := ∫ start_POSTSUBSCRIPT [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( | italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_t ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_r ) end_POSTSUPERSCRIPT ( italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) roman_d italic_t ≤ 1 .

Recall that fr,22subscriptsuperscriptnorm𝑓2𝑟2\|f\|^{2}_{r,2}∥ italic_f ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_r , 2 end_POSTSUBSCRIPT is a special reproducing kernel Hilbert space (RKHS) norm: we expect some of the considerations below to generalize to other RKHS norms.

Following our general methodology, we use the estimator

f^n,m,α=argminf{1αni=1n(yif(𝒙i))2+αmi=1m(yisf(𝒙is))2+λfp,22}.subscript^𝑓𝑛𝑚𝛼subscript𝑓1𝛼𝑛superscriptsubscript𝑖1𝑛superscriptsubscript𝑦𝑖𝑓subscript𝒙𝑖2𝛼𝑚superscriptsubscript𝑖1𝑚superscriptsubscriptsuperscript𝑦𝑠𝑖𝑓subscriptsuperscript𝒙𝑠𝑖2𝜆superscriptsubscriptnorm𝑓𝑝22\displaystyle\hat{f}_{n,m,\alpha}=\arg\min_{f}\bigg{\{}\frac{1-\alpha}{n}\sum_% {i=1}^{n}\big{(}y_{i}-f({\boldsymbol{x}}_{i})\big{)}^{2}+\frac{\alpha}{m}\sum_% {i=1}^{m}\big{(}y^{s}_{i}-f({\boldsymbol{x}}^{s}_{i})\big{)}^{2}+\lambda\|f\|_% {p,2}^{2}\bigg{\}}\,.over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_n , italic_m , italic_α end_POSTSUBSCRIPT = roman_arg roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT { divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_α end_ARG start_ARG italic_m end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_f ( bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_f ∥ start_POSTSUBSCRIPT italic_p , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (13)

We are interested in R(f)=𝔼{(f(𝒙)f(𝒙))2}𝑅𝑓𝔼superscript𝑓𝒙subscript𝑓𝒙2R(f)={\mathbb{E}}\{(f({\boldsymbol{x}})-f_{*}({\boldsymbol{x}}))^{2}\}italic_R ( italic_f ) = blackboard_E { ( italic_f ( bold_italic_x ) - italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_x ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT }, which is the excess squared loss for a test point 𝒙𝖴𝗇𝗂𝖿([0,1]d)similar-to𝒙𝖴𝗇𝗂𝖿superscript01𝑑{\boldsymbol{x}}\sim{\sf Unif}([0,1]^{d})bold_italic_x ∼ sansserif_Unif ( [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ).

In order to avoid technical burden we will carry out the analysis for a continuous model, the so-called white noise model, where we observe the function f𝑓fitalic_f at all points 𝒙[0,1]d𝒙superscript01𝑑{\boldsymbol{x}}\in[0,1]^{d}bold_italic_x ∈ [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, perturbed by d𝑑ditalic_d-dimensional white noise:

dY=f(𝒙)d𝒙+σndB(𝒙),d𝑌subscript𝑓𝒙d𝒙𝜎𝑛d𝐵𝒙\displaystyle{\rm d}Y=f_{*}({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}+\frac{% \sigma}{\sqrt{n}}{\rm d}B({\boldsymbol{x}})\,,roman_d italic_Y = italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_x ) roman_d bold_italic_x + divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG roman_d italic_B ( bold_italic_x ) , (14)

and similarly for Yssuperscript𝑌𝑠Y^{s}italic_Y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. We use an estimator that naturally generalizes (13) to the continuous case. Our results for the white noise model are as follows.

Theorem 2.

Let β=(2p4r)/(d+(2p4r))𝛽2𝑝4𝑟𝑑2𝑝4𝑟\beta=(2p\wedge 4r)/(d+(2p\wedge 4r))italic_β = ( 2 italic_p ∧ 4 italic_r ) / ( italic_d + ( 2 italic_p ∧ 4 italic_r ) ). If r>d/4𝑟𝑑4r>d/4italic_r > italic_d / 4 and λ=(δKn,mσ2)2r/(d+(2p4r))𝜆superscript𝛿subscript𝐾𝑛𝑚superscript𝜎22𝑟𝑑2𝑝4𝑟\lambda=(\delta K_{n,m}\sigma^{2})^{2r/(d+(2p\wedge 4r))}italic_λ = ( italic_δ italic_K start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 italic_r / ( italic_d + ( 2 italic_p ∧ 4 italic_r ) ) end_POSTSUPERSCRIPT, then for every δ(0,1)𝛿01\delta\in(0,1)italic_δ ∈ ( 0 , 1 ) there exists a constant C=C(d,δ)𝐶𝐶𝑑𝛿C=C(d,\delta)italic_C = italic_C ( italic_d , italic_δ ) such that

R(f^n,m,α)(1+δ)α2Rsuex()+C{(1α)2σ2n+α2σs2m}β𝑅subscript^𝑓𝑛𝑚𝛼1𝛿superscript𝛼2superscriptsubscript𝑅suex𝐶superscriptsuperscript1𝛼2superscript𝜎2𝑛superscript𝛼2superscriptsubscript𝜎𝑠2𝑚𝛽\displaystyle R(\hat{f}_{n,m,\alpha})\leq(1+\delta)\alpha^{2}R_{\mbox{\sf\tiny su% }}^{\mbox{\sf\tiny ex}}(\infty)+C\left\{(1-\alpha)^{2}\cdot\frac{\sigma^{2}}{n% }+\alpha^{2}\cdot\frac{\sigma_{s}^{2}}{m}\right\}^{\beta}italic_R ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_n , italic_m , italic_α end_POSTSUBSCRIPT ) ≤ ( 1 + italic_δ ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) + italic_C { ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG } start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT (15)

with high probability, where Kn,m:=(1α)2/n+α2/massignsubscript𝐾𝑛𝑚superscript1𝛼2𝑛superscript𝛼2𝑚K_{n,m}:=(1-\alpha)^{2}/n+\alpha^{2}/mitalic_K start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT := ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_n + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m.

Remark 3.1.

The white noise model (14) is known to be equivalent to the original model (12) (with deterministic equispaced designs) in the sense of Le Cam, for r>d/2𝑟𝑑2r>d/2italic_r > italic_d / 2 [BL96, Rei08]. While suggestive, this equivalence does not allow us to formally deduce results for the data (12), because it does not apply to the specific estimators of interest here.

With the given choice of λ𝜆\lambdaitalic_λ, r𝑟ritalic_r, the derivation of (15) also implies Rsuex(m)Rsuex()C(σs2/m)βsuperscriptsubscript𝑅suex𝑚superscriptsubscript𝑅suexsuperscript𝐶superscriptsuperscriptsubscript𝜎𝑠2𝑚𝛽R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)-R_{\mbox{\sf\tiny su}}^{\mbox{% \sf\tiny ex}}(\infty)\geq C^{\prime}(\sigma_{s}^{2}/m)^{\beta}italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_m ) - italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) ≥ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT / italic_m ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT, Rorex(n)C(σ/n)βsuperscriptsubscript𝑅orex𝑛superscript𝐶superscript𝜎𝑛𝛽R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)\geq C^{\prime}(\sigma/n)^{\beta}italic_R start_POSTSUBSCRIPT or end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_n ) ≥ italic_C start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ( italic_σ / italic_n ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT (for the least favorable f𝑓fitalic_f [Tsy09]). Hence (15) is consistent with the scaling law (4).

3.3 Low-dimensional asymptotics

We study the estimator of Eqs. (2), (3) under the classical asymptotics n,m𝑛𝑚n,m\to\inftyitalic_n , italic_m → ∞ at d𝑑ditalic_d fixed. Since this type of analysis is more standard, we defer it to Appendix B. The main result of this analysis is that the scaling law (4) holds in this setting, with the classical parametric exponent β=1𝛽1\beta=1italic_β = 1, for α[0,αmax]𝛼0subscript𝛼\alpha\in[0,\alpha_{\max}]italic_α ∈ [ 0 , italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] for a suitable αmax(0,1)subscript𝛼01\alpha_{\max}\in(0,1)italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∈ ( 0 , 1 ). Importantly, the interval [0,αmax]0subscript𝛼[0,\alpha_{\max}][ 0 , italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] includes the optimal choice of the weight α𝛼\alphaitalic_α.

3.4 High-dimensional linear regression

In this section, we study ridge regression in the high-dimensional regime in which the number of samples is proportional to the number of parameters. Denoting the original data by (𝒚,𝑿)𝒚𝑿({\boldsymbol{y}},{\boldsymbol{X}})( bold_italic_y , bold_italic_X ) (with 𝒚n𝒚superscript𝑛{\boldsymbol{y}}\in\mathbb{R}^{n}bold_italic_y ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT the vector of responses and 𝑿n×d𝑿superscript𝑛𝑑{\boldsymbol{X}}\in\mathbb{R}^{n\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT the matrix of covariates), and the surrogate data by (𝒚s,𝑿s)superscript𝒚𝑠superscript𝑿𝑠({\boldsymbol{y}}^{s},{\boldsymbol{X}}^{s})( bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) (with 𝒚smsuperscript𝒚𝑠superscript𝑚{\boldsymbol{y}}^{s}\in\mathbb{R}^{m}bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and 𝑿sm×dsuperscript𝑿𝑠superscript𝑚𝑑{\boldsymbol{X}}^{s}\in\mathbb{R}^{m\times d}bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m × italic_d end_POSTSUPERSCRIPT), we minimize the regularized empirical risk

R^n,m(𝜽;α)subscript^𝑅𝑛𝑚𝜽𝛼\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ) =1α2n𝒚𝑿𝜽22+α2m𝒚s𝑿s𝜽22+λ2𝜽22,absent1𝛼2𝑛subscriptsuperscriptnorm𝒚𝑿𝜽22𝛼2𝑚subscriptsuperscriptnormsuperscript𝒚𝑠superscript𝑿𝑠𝜽22𝜆2subscriptsuperscriptnorm𝜽22\displaystyle=\frac{1-\alpha}{2n}\|{\boldsymbol{y}}-{\boldsymbol{X}}{% \boldsymbol{\theta}}\|^{2}_{2}+\frac{\alpha}{2m}\|{\boldsymbol{y}}^{s}-{% \boldsymbol{X}}^{s}{\boldsymbol{\theta}}\|^{2}_{2}+\frac{\lambda}{2}\,\|{% \boldsymbol{\theta}}\|^{2}_{2}\,,= divide start_ARG 1 - italic_α end_ARG start_ARG 2 italic_n end_ARG ∥ bold_italic_y - bold_italic_X bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_α end_ARG start_ARG 2 italic_m end_ARG ∥ bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , (16)

We assume a simple distribution, whereby the rows of 𝑿𝑿{\boldsymbol{X}}bold_italic_X, 𝑿ssuperscript𝑿𝑠{\boldsymbol{X}}^{s}bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT (denoted by 𝒙isubscript𝒙𝑖{\boldsymbol{x}}_{i}bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, 𝒙issubscriptsuperscript𝒙𝑠𝑖{\boldsymbol{x}}^{s}_{i}bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) are standard normal vectors and

𝒚=𝑿𝜽+𝜺,𝒚s=𝑿s𝜽s+𝜺s.formulae-sequence𝒚𝑿subscript𝜽𝜺superscript𝒚𝑠superscript𝑿𝑠superscriptsubscript𝜽𝑠superscript𝜺𝑠\displaystyle{\boldsymbol{y}}={\boldsymbol{X}}{\boldsymbol{\theta}}_{*}+{% \boldsymbol{{\varepsilon}}}\,,\;\;\;\;\;\;{\boldsymbol{y}}^{s}={\boldsymbol{X}% }^{s}{\boldsymbol{\theta}}_{*}^{s}+{\boldsymbol{{\varepsilon}}}^{s}\,.bold_italic_y = bold_italic_X bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + bold_italic_ε , bold_italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + bold_italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT . (17)

for 𝜺𝖭(𝟎,σ2𝑰n)similar-to𝜺𝖭0superscript𝜎2subscript𝑰𝑛{\boldsymbol{{\varepsilon}}}\sim{\sf N}({\boldsymbol{0}},\sigma^{2}{% \boldsymbol{I}}_{n})bold_italic_ε ∼ sansserif_N ( bold_0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), 𝜺s𝖭(𝟎,σs2𝑰m)similar-tosuperscript𝜺𝑠𝖭0superscriptsubscript𝜎𝑠2subscript𝑰𝑚{\boldsymbol{{\varepsilon}}}^{s}\sim{\sf N}({\boldsymbol{0}},\sigma_{s}^{2}{% \boldsymbol{I}}_{m})bold_italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ sansserif_N ( bold_0 , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ). Note that the two data distributions differ in the true coefficient vectors 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT versus 𝜽ssuperscriptsubscript𝜽𝑠{\boldsymbol{\theta}}_{*}^{s}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT as well as in the noise variance. We will denote by 𝜽^n,m(α)subscript^𝜽𝑛𝑚𝛼\hat{\boldsymbol{\theta}}_{n,m}(\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) the ridge estimator, 𝜽^n,m(α)=argmin𝜽dR^n,m(𝜽;α)subscript^𝜽𝑛𝑚𝛼subscript𝜽superscript𝑑subscript^𝑅𝑛𝑚𝜽𝛼\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=\arg\min_{{\boldsymbol{\theta}}\in% \mathbb{R}^{d}}\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ).

The excess test error (for square loss) is given by R(𝜽^):=𝔼{(𝒙,𝜽𝒙,𝜽^)2}=𝜽^𝜽2assign𝑅^𝜽𝔼superscript𝒙subscript𝜽𝒙^𝜽2superscriptnorm^𝜽subscript𝜽2R(\hat{\boldsymbol{\theta}}):={\mathbb{E}}\big{\{}\big{(}\langle{\boldsymbol{x% }},{\boldsymbol{\theta}}_{*}\rangle-\langle{\boldsymbol{x}},\hat{\boldsymbol{% \theta}}\rangle\big{)}^{2}\big{\}}=\|\hat{\boldsymbol{\theta}}-{\boldsymbol{% \theta}}_{*}\|^{2}italic_R ( over^ start_ARG bold_italic_θ end_ARG ) := blackboard_E { ( ⟨ bold_italic_x , bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ - ⟨ bold_italic_x , over^ start_ARG bold_italic_θ end_ARG ⟩ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } = ∥ over^ start_ARG bold_italic_θ end_ARG - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The next result characterizes this error in the proportional asymptotics.

Theorem 3.

Consider the ridge regression estimator 𝛉^n,m(α)subscript^𝛉𝑛𝑚𝛼\hat{\boldsymbol{\theta}}_{n,m}(\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ). Let r:=𝛉2assign𝑟subscriptnormsubscript𝛉2r:=\|{\boldsymbol{\theta}}_{*}\|_{2}italic_r := ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, rs:=𝛉s2assignsubscript𝑟𝑠subscriptnormsuperscriptsubscript𝛉𝑠2r_{s}:=\|{\boldsymbol{\theta}}_{*}^{s}\|_{2}italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT := ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and γ:=cos1(𝛉,𝛉s/(𝛉2𝛉s2))assign𝛾superscript1subscript𝛉superscriptsubscript𝛉𝑠subscriptnormsubscript𝛉2subscriptnormsuperscriptsubscript𝛉𝑠2\gamma:=\cos^{-1}(\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*}^{% s}\rangle/(\|{\boldsymbol{\theta}}_{*}\|_{2}\|{\boldsymbol{\theta}}_{*}^{s}\|_% {2}))italic_γ := roman_cos start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ / ( ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ). Assume n,m,d𝑛𝑚𝑑n,m,d\to\inftyitalic_n , italic_m , italic_d → ∞ such that n/dδ𝑛𝑑𝛿n/d\to\deltaitalic_n / italic_d → italic_δ, m/dδs𝑚𝑑subscript𝛿𝑠m/d\to\delta_{s}italic_m / italic_d → italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, with δ+δs>1𝛿subscript𝛿𝑠1\delta+\delta_{s}>1italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1222The same proof, with some additional technical work, yields a characterization for δ+δs1𝛿subscript𝛿𝑠1\delta+\delta_{s}\leq 1italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≤ 1 as well. We omit it here for brevity.. For R(.)\mathscrsfs{R}(.)italic_R ( . ) defined in Appendix E.1, let

ξ(α),ξ(α),ω(α)=argminξ,ξ0,ω0R(ξ,ξ,ω,α,r,rs,δ,δs,λ,γ,σ,σs),superscript𝜉𝛼subscriptsuperscript𝜉perpendicular-to𝛼superscript𝜔𝛼subscriptargminformulae-sequence𝜉subscript𝜉perpendicular-to0𝜔0𝑅𝜉subscript𝜉perpendicular-to𝜔𝛼𝑟subscript𝑟𝑠𝛿subscript𝛿𝑠𝜆𝛾𝜎subscript𝜎𝑠\xi^{*}(\alpha),\xi^{*}_{\perp}(\alpha),\omega^{*}(\alpha)=\operatorname*{% argmin}_{\xi,\xi_{\perp}\geq 0,\omega\geq 0}\mathscrsfs{R}(\xi,\xi_{\perp},% \omega,\alpha,r,r_{s},\delta,\delta_{s},\lambda,\gamma,\sigma,\sigma_{s}),italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) , italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ( italic_α ) , italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) = roman_argmin start_POSTSUBSCRIPT italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT ≥ 0 , italic_ω ≥ 0 end_POSTSUBSCRIPT italic_R ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_α , italic_r , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_λ , italic_γ , italic_σ , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ,

be the unique minimizer. Then for any ε,ε0>0𝜀subscript𝜀00{\varepsilon},{\varepsilon}_{0}>0italic_ε , italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, there exist c>0𝑐0c>0italic_c > 0 such that, for all n𝑛nitalic_n

(supα[ε0,1ε0]|R(𝜽^n,m(α))Rtest(α)|ε)12ecn,subscriptsupremum𝛼subscript𝜀01subscript𝜀0𝑅subscript^𝜽𝑛𝑚𝛼subscript𝑅test𝛼𝜀12superscript𝑒𝑐𝑛\displaystyle{\mathbb{P}}\Big{(}\sup_{\alpha\in[{\varepsilon}_{0},1-{% \varepsilon}_{0}]}\big{|}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-% \mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)\big{|}\leq{\varepsilon}\Big{)}% \geq 1-2\,e^{-cn}\,,blackboard_P ( roman_sup start_POSTSUBSCRIPT italic_α ∈ [ italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] end_POSTSUBSCRIPT | italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( italic_α ) | ≤ italic_ε ) ≥ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_c italic_n end_POSTSUPERSCRIPT ,

where Rtest(α):=(ξ(α)r)2+(ξ(α))2+(ω(α))2.assignsubscript𝑅test𝛼superscriptsuperscript𝜉𝛼𝑟2superscriptsuperscriptsubscript𝜉perpendicular-to𝛼2superscriptsuperscript𝜔𝛼2\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha):=(\xi^{*}(\alpha)-r)^{2}+(\xi_{% \perp}^{*}(\alpha))^{2}+(\omega^{*}(\alpha))^{2}.italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( italic_α ) := ( italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) - italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . Further, we can take ε0=0subscript𝜀00{\varepsilon}_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 if δ,δs>1𝛿subscript𝛿𝑠1\delta,\delta_{s}>1italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1.

Remark 3.2 (Optimizing α𝛼\alphaitalic_α over the validation set).

Note that the concentration of R(𝜽^n,m(α))𝑅subscript^𝜽𝑛𝑚𝛼R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) around the theoretical prediction Λ(α)Λ𝛼\Lambda(\alpha)roman_Λ ( italic_α ) in Theorem 3 is uniform over α[ε0,1ε0]𝛼subscript𝜀01subscript𝜀0\alpha\in[{\varepsilon}_{0},1-{\varepsilon}_{0}]italic_α ∈ [ italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ]. This means that we can find the optimal α𝛼\alphaitalic_α by computing 𝜽^n,m(α)subscript^𝜽𝑛𝑚𝛼\hat{\boldsymbol{\theta}}_{n,m}(\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) over a grid of α𝛼\alphaitalic_α values, estimating R(𝜽^n,m(α))𝑅subscript^𝜽𝑛𝑚𝛼R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) over the validation set and choosing the optimal α𝛼\alphaitalic_α. The uniform guarantee insures that this procedure will achieve risk minα[0,1]Λ(α)+oP(1)subscript𝛼01Λ𝛼subscript𝑜𝑃1\min_{\alpha\in[0,1]}\Lambda(\alpha)+o_{P}(1)roman_min start_POSTSUBSCRIPT italic_α ∈ [ 0 , 1 ] end_POSTSUBSCRIPT roman_Λ ( italic_α ) + italic_o start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( 1 ).

Remark 3.3 (Relation to scaling laws).

An analysis of the equations for (ξ,ξ,ω)superscript𝜉subscriptsuperscript𝜉perpendicular-tosuperscript𝜔(\xi^{*},\xi^{*}_{\perp},\omega^{*})( italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ) reveals that, for large δ,δs𝛿subscript𝛿𝑠\delta,\delta_{s}italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the predicted excess risk behaves as Rtest(α)=α2Rs,+α2C1/δs+(1α)2C2/δs+o(1/δ,1/δs)subscript𝑅test𝛼superscript𝛼2subscriptsuperscript𝑅𝑠superscript𝛼2subscript𝐶1subscript𝛿𝑠superscript1𝛼2subscript𝐶2subscript𝛿𝑠𝑜1𝛿1subscript𝛿𝑠\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)=\alpha^{2}\mathscrsfs{R}^{*}_{s,% \infty}+\alpha^{2}C_{1}/\delta_{s}+(1-\alpha)^{2}C_{2}/\delta_{s}+o(1/\delta,1% /\delta_{s})italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( italic_α ) = italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , ∞ end_POSTSUBSCRIPT + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_o ( 1 / italic_δ , 1 / italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) (for some constants Rs,,C1,C2subscriptsuperscript𝑅𝑠subscript𝐶1subscript𝐶2\mathscrsfs{R}^{*}_{s,\infty},C_{1},C_{2}italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s , ∞ end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT). This matches the low-dimensional asymptotics and our scaling law (4) with β=1𝛽1\beta=1italic_β = 1. In practice, we find that, for moderate δ,δs𝛿subscript𝛿𝑠\delta,\delta_{s}italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, the behavior of Rtest(α)subscript𝑅test𝛼\mathscrsfs{R}_{\mbox{\tiny\rm test}}(\alpha)italic_R start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ( italic_α ) is better approximated by a different value of β𝛽\betaitalic_β (see Appendix A.)

4 Empirical results

In this section, we present experiments validating that the scaling law (4) is a good approximation both for simulated and real-world data. For simulated data, we select two different distributions for the original and surrogate datasets. The test and validation sets are generated from the same distribution as the original dataset. In case of real-world data, we choose two different datasets as the original and surrogate datasets. We split the original dataset into train, test, and validation sets, while all examples in the surrogate datasets are allocated solely to the train split.

For each dataset and model discussed in this section, we carry out the same experiment: (i)𝑖(i)( italic_i ) We use models trained on original data to fit the scaling curve R(𝜽^n,0(0))=Aor+Bornβor𝑅subscript^𝜽𝑛00subscript𝐴orsubscript𝐵orsuperscript𝑛subscript𝛽orR(\hat{\boldsymbol{\theta}}_{n,0}(0))=A_{\mbox{\sf\tiny or}}+B_{\mbox{\sf\tiny or% }}n^{-\beta_{\mbox{\sf\tiny or}}}italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) = italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT or end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and obtain Aorsubscript𝐴orA_{\mbox{\sf\tiny or}}italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT and βorsubscript𝛽or\beta_{\mbox{\sf\tiny or}}italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT (ii)𝑖𝑖(ii)( italic_i italic_i ) We use models trained on purely surrogate data to fit the scaling curve R(𝜽^0,m(1))=Asu+Bsumβsu𝑅subscript^𝜽0𝑚1subscript𝐴susubscript𝐵susuperscript𝑚subscript𝛽suR(\hat{\boldsymbol{\theta}}_{0,m}(1))=A_{\mbox{\sf\tiny su}}+B_{\mbox{\sf\tiny su% }}m^{-\beta_{\mbox{\sf\tiny su}}}italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ( 1 ) ) = italic_A start_POSTSUBSCRIPT su end_POSTSUBSCRIPT + italic_B start_POSTSUBSCRIPT su end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - italic_β start_POSTSUBSCRIPT su end_POSTSUBSCRIPT end_POSTSUPERSCRIPT to obtain Asusubscript𝐴suA_{\mbox{\sf\tiny su}}italic_A start_POSTSUBSCRIPT su end_POSTSUBSCRIPT and βsusubscript𝛽su\beta_{\mbox{\sf\tiny su}}italic_β start_POSTSUBSCRIPT su end_POSTSUBSCRIPT. (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) Since assume R=R(𝜽^0,(0))subscript𝑅𝑅subscript^𝜽00R_{*}=R(\hat{\boldsymbol{\theta}}_{0,\infty}(0))italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT 0 , ∞ end_POSTSUBSCRIPT ( 0 ) ), we let R=Aorsubscript𝑅subscript𝐴orR_{*}=A_{\mbox{\sf\tiny or}}italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT and excess risk estimates Rorex(n)=R(𝜽^n,0(0))Aorsuperscriptsubscript𝑅orex𝑛𝑅subscript^𝜽𝑛00subscript𝐴orR_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)=R(\hat{\boldsymbol{\theta}}_{n,% 0}(0))-A_{\mbox{\sf\tiny or}}italic_R start_POSTSUBSCRIPT or end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_n ) = italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) - italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT, Rsuex(m)=R(𝜽^0,m(1))Aorsuperscriptsubscript𝑅suex𝑚𝑅subscript^𝜽0𝑚1subscript𝐴orR_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)=R(\hat{\boldsymbol{\theta}}_{0,% m}(1))-A_{\mbox{\sf\tiny or}}italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_m ) = italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT 0 , italic_m end_POSTSUBSCRIPT ( 1 ) ) - italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT and Rsuex()=AsuAorsuperscriptsubscript𝑅suexsubscript𝐴susubscript𝐴orR_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=A_{\mbox{\sf\tiny su}}-A_{% \mbox{\sf\tiny or}}italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = italic_A start_POSTSUBSCRIPT su end_POSTSUBSCRIPT - italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT, and we use β=βor𝛽subscript𝛽or\beta=\beta_{\mbox{\sf\tiny or}}italic_β = italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT, the fit exponent obtained from original data); (iv)𝑖𝑣(iv)( italic_i italic_v ) For each combination of n,m𝑛𝑚n,mitalic_n , italic_m, we use our estimates of Rsuex(m)superscriptsubscript𝑅suex𝑚R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_m ), Rorex(n)superscriptsubscript𝑅orex𝑛R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)italic_R start_POSTSUBSCRIPT or end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_n ) (as measured empirically on the test set), β𝛽\betaitalic_β, Rsuex()superscriptsubscript𝑅suexR_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ), and Rsubscript𝑅R_{*}italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT to plot the predicted R(𝜽^n,m(α))𝑅subscript^𝜽𝑛𝑚𝛼R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) as a function of α𝛼\alphaitalic_α using scaling law (4). (v)𝑣(v)( italic_v ) We then train the model using n𝑛nitalic_n original and m𝑚mitalic_m surrogate examples with weights (1α)1𝛼(1-\alpha)( 1 - italic_α ) and, α𝛼\alphaitalic_α for the two datasets, respectively. We average the results of 10 independent runs to compare it against those predicted by the scaling law. For ridge regression, we also compare with exact high-dimensional asymptotics from Theorem 3.

Let us emphasize that these plots probe the dependence on the hyperparameter α𝛼\alphaitalic_α. These are much more demanding tests that the usual ones in scaling laws. We generally observe that the scaling law captures well the behavior of the test error for data mixtures.

Binary classification with Gaussian mixture data

This is a simple simulated setting. The original dataset consists of independent and identically distributed examples (yi,𝒙i)×dsubscript𝑦𝑖subscript𝒙𝑖superscript𝑑(y_{i},{\boldsymbol{x}}_{i})\in\mathbb{R}\times\mathbb{R}^{d}( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∈ blackboard_R × blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, d=200𝑑200d=200italic_d = 200, where yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is uniform over {+1,1}11\{+1,-1\}{ + 1 , - 1 }, and 𝒙i|yi𝖭(yi𝜽,𝑰d)similar-toevaluated-atsubscript𝒙𝑖subscript𝑦𝑖𝖭subscript𝑦𝑖subscript𝜽subscript𝑰𝑑{\boldsymbol{x}}_{i}\big{|}_{y_{i}}\sim{\sf N}(y_{i}{\boldsymbol{\theta}}_{*},% {\boldsymbol{I}}_{d})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | start_POSTSUBSCRIPT italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∼ sansserif_N ( italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), where 𝜽dsubscript𝜽superscript𝑑{\boldsymbol{\theta}}_{*}\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, 𝜽=1normsubscript𝜽1\|{\boldsymbol{\theta}}_{*}\|=1∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ = 1. Surrogate data have the same distribution, with a different unit vector 𝜽,ssubscript𝜽𝑠{\boldsymbol{\theta}}_{*,s}bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT. This data distribution is parametrized by d𝑑ditalic_d and the angle γ𝛾\gammaitalic_γ between the original and surrogate parameters, cosγ:=𝜽,𝜽^,sassign𝛾subscript𝜽subscript^𝜽𝑠\cos\gamma:=\langle{\boldsymbol{\theta}}_{*},\hat{\boldsymbol{\theta}}_{*,s}\rangleroman_cos italic_γ := ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ⟩. We use γ=π/10𝛾𝜋10\gamma=\pi/10italic_γ = italic_π / 10 in our experiments. For each (n,m,α)𝑛𝑚𝛼(n,m,\alpha)( italic_n , italic_m , italic_α ), we average the results over 10 independent runs.

We use two different models for classification: (1) Logistic regression; (2)2(2)( 2 ) A one-hidden layer neural network with 32 hidden ReLU neurons. Results for both models are presented in Appendix A.1.

Refer to caption
Figure 2: CIFAR10 and CIFAR100 data. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

Sentiment analysis in movie reviews

As original data, we use the IMDB dataset (link) which has 25k reviews for training, each labeled as positive or negative. For validation and testing, we split the IMDB test dataset of 25k reviews into a validation set of 10k reviews and test set of 15k reviews.

We experiment with two different surrogate datasets: 1) Rotten Tomatoes dataset of movie reviews (link): these are data with different distribution but within the same domain. This dataset contains movie reviews and the corresponding sentiments, 2) Goodreads book reviews (link): these are data from a substantially different domain. This dataset has reviews and their ratings. We choose 10k reviews each with a rating of 5 and 1, and label them as positive and negative, respectively.

We convert reviews into feature vectors with d=884𝑑884d=884italic_d = 884 dimensions as explained in Appendix A.2. We use logistic regression and neural network models with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension).

Results with neural nets and Rotten Tomatoes as synthetic dataset are presented in Figure 1 and the remaining results are in Appendix A.2.

Image classification with CIFAR10 and CIFAR100

We use 50,000 CIFAR10 training images as original data, its 10 classes for the classification task, and test on the 10,000 CIFAR10 test images. We use 50,000 CIFAR100 training images as surrogate data. We train a 9-layer ResNet model for classification. Appendix A.3 presents details on the data pre-processing and map** of labels. Results are shown in Figure 2. Note that CIFAR10 and CIFAR100 datasets are quite different from each other, as they have no overlap either in the images or in their label sets. Yet, the test error on training on their mixture is well predicted by the scaling law (4).

Lasso-based Cox regression on TCGA PanCancer dataset

We use the public domain TCGA pancancer dataset [GCH+20] (link), with gene expressions as covariates and progression-free survival (PFS) as response. After filtering and feature selection, we are left with 3580 female patients, which we use as original data, and 3640 male patients, which we use as surrogate data. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. The results are shown in Figure 3. The details of pre-processing and experiment parameters333We observe that training at α=1𝛼1\alpha=1italic_α = 1 yields a somewhat singular behavior: we use a α=0.95𝛼0.95\alpha=0.95italic_α = 0.95 as a proxy of α=1𝛼1\alpha=1italic_α = 1, see appendices. are in Appendix A.4.

Refer to caption
Figure 3: Lasso-based Cox regression on TCGA PanCancer dataset. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

High-dimensional ridge regression

We simulate the data distribution in Section 3.4, i.e., yi=𝜽,𝒙i+εisubscript𝑦𝑖subscript𝜽subscript𝒙𝑖subscript𝜀𝑖y_{i}=\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{x}}_{i}\rangle+{% \varepsilon}_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ + italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, in𝑖𝑛i\leq nitalic_i ≤ italic_n; yis=𝜽,s,𝒙is+εissubscriptsuperscript𝑦𝑠𝑖subscript𝜽𝑠subscriptsuperscript𝒙𝑠𝑖subscriptsuperscript𝜀𝑠𝑖y^{s}_{i}=\langle{\boldsymbol{\theta}}_{*,s},{\boldsymbol{x}}^{s}_{i}\rangle+{% \varepsilon}^{s}_{i}italic_y start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT , bold_italic_x start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⟩ + italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, im𝑖𝑚i\leq mitalic_i ≤ italic_m; with 𝒙i,𝒙is𝖭(𝟎,𝑰d)similar-tosubscript𝒙𝑖superscriptsubscript𝒙𝑖𝑠𝖭0subscript𝑰𝑑{\boldsymbol{x}}_{i},{\boldsymbol{x}}_{i}^{s}\sim{\sf N}({\boldsymbol{0}},{% \boldsymbol{I}}_{d})bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ sansserif_N ( bold_0 , bold_italic_I start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ), εi𝖭(0,σ2)similar-tosubscript𝜀𝑖𝖭0superscript𝜎2{\varepsilon}_{i}\sim{\sf N}(0,\sigma^{2})italic_ε start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), εis𝖭(0,σs2)similar-tosubscriptsuperscript𝜀𝑠𝑖𝖭0superscriptsubscript𝜎𝑠2{\varepsilon}^{s}_{i}\sim{\sf N}(0,\sigma_{s}^{2})italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∼ sansserif_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), and fit a simple linear model using ridge regression. The results are shown in Figure 4. In our experiments, we use d=500𝑑500d=500italic_d = 500, σ2=σs2=1superscript𝜎2superscriptsubscript𝜎𝑠21\sigma^{2}=\sigma_{s}^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, 𝜽=𝜽,s=1normsubscript𝜽normsubscript𝜽𝑠1\|{\boldsymbol{\theta}}_{*}\|=\|{\boldsymbol{\theta}}_{*,s}\|=1∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ = ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ = 1 and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT. Under these settings, the model is parametrized by the angle γ𝛾\gammaitalic_γ between 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT and 𝜽,ssubscript𝜽𝑠{\boldsymbol{\theta}}_{*,s}bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT, where cosγ:=𝜽,𝜽,sassign𝛾subscript𝜽subscript𝜽𝑠\cos\gamma:=\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*,s}\rangleroman_cos italic_γ := ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ⟩. We used γ=π/6𝛾𝜋6\gamma=\pi/6italic_γ = italic_π / 6 and π/2𝜋2\pi/2italic_π / 2 in our experiments.444For ridge regression simulations, we directly plot the excess test risks, as the parameter 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ for original data is known. For any 𝜽^^𝜽\hat{\boldsymbol{\theta}}over^ start_ARG bold_italic_θ end_ARG the excess test risk in this model is simply 𝜽𝜽^2superscriptnorm𝜽^𝜽2\|{\boldsymbol{\theta}}-\hat{\boldsymbol{\theta}}\|^{2}∥ bold_italic_θ - over^ start_ARG bold_italic_θ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.

The theoretical predictions of Theorem 3 for these curves in high-dimensional asymptotics n,m,d𝑛𝑚𝑑n,m,d\to\inftyitalic_n , italic_m , italic_d → ∞, with n/dδ𝑛𝑑𝛿n/d\to\deltaitalic_n / italic_d → italic_δ, m/dδs𝑚𝑑subscript𝛿𝑠m/d\to\delta_{s}italic_m / italic_d → italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are reported as blue lines, and match remarkably well with the empirical data. The simple scaling law (4) nevertheless provides a good approximation of these (more complicated) theoretical formulas.

Note in particular that in the top row of Figure 4, we have 𝜽,𝜽,s=0subscript𝜽subscript𝜽𝑠0\langle{\boldsymbol{\theta}}_{*},{\boldsymbol{\theta}}_{*,s}\rangle=0⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ⟩ = 0, i.e. the surrogate data are as far as possible from the original ones. Nevertheless, the induced regularization effect leads to smaller test error on the original distribution.

Refer to caption
Figure 4: Ridge regression on simulated data. Here d=500𝑑500d=500italic_d = 500, n=1000𝑛1000n=1000italic_n = 1000, σ2=σs2=1superscript𝜎2superscriptsubscript𝜎𝑠21\sigma^{2}=\sigma_{s}^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, 𝜽=𝜽,s=1normsubscript𝜽normsubscript𝜽𝑠1\|{\boldsymbol{\theta}}_{*}\|=\|{\boldsymbol{\theta}}_{*,s}\|=1∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ = ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ = 1, regul. par. λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT, and m𝑚mitalic_m varies by column. Top row γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, bottom row γ=π/6𝛾𝜋6\gamma=\pi/6italic_γ = italic_π / 6.

We observe proposed scaling law (4) predicts well the behavior of the experiments, across of the datasets above, and for most combinations of original and surrogate examples we have tested.

Finally, we emphasize that the scaling law is only an empirical approximation of reality. This is clearly illustrated by the example of ridge regression: in this case, we use Theorem 3 to precisely predict the discrepancy between precise asymptotics and scaling law, see Appendix A.5.

5 Discussion

We conclude by discussing two possible generalizations of the scaling law (4), and its applicability. First, throughout this paper we assumed that Rorex()=0subscriptsuperscript𝑅exor0R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(\infty)=0italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( ∞ ) = 0, namely that we can achieve the Bayes error by training on infinitely many original samples. In practice this will not hold because of the limited model complexity. Following standard scaling laws [KMH+20, HBM+22], this effect can be accounted for by an additional term CNω𝐶superscript𝑁𝜔C\cdot N^{-\omega}italic_C ⋅ italic_N start_POSTSUPERSCRIPT - italic_ω end_POSTSUPERSCRIPT, where N𝑁Nitalic_N is the model size (number of parameters). Second, the scaling law (4) implies as special cases that Rorex(n)Aornβsubscriptsuperscript𝑅exor𝑛subscript𝐴orsuperscript𝑛𝛽R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny or}}(n)\approx A_{\mbox{\sf\tiny or}}n^% {-\beta}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) ≈ italic_A start_POSTSUBSCRIPT or end_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT, Rsuex(m)Rsuex()+Asumβsubscriptsuperscript𝑅exsu𝑚subscriptsuperscript𝑅exsusubscript𝐴susuperscript𝑚𝛽R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(m)\approx R^{\mbox{\sf\tiny ex}}_{% \mbox{\sf\tiny su}}(\infty)+A_{\mbox{\sf\tiny su}}m^{-\beta}italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) ≈ italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_A start_POSTSUBSCRIPT su end_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT - italic_β end_POSTSUPERSCRIPT. In particular, the exponent β𝛽\betaitalic_β is the same when training on real or surrogate data. In practice, we observe often two somewhat different exponents βorβsusubscript𝛽orsubscript𝛽su\beta_{\mbox{\sf\tiny or}}\neq\beta_{\mbox{\sf\tiny su}}italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ≠ italic_β start_POSTSUBSCRIPT su end_POSTSUBSCRIPT. In these cases, we set β=βor𝛽subscript𝛽or\beta=\beta_{\mbox{\sf\tiny or}}italic_β = italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT, and this appears to work reasonably well. However, we can imagine cases in which the difference between βorsubscript𝛽or\beta_{\mbox{\sf\tiny or}}italic_β start_POSTSUBSCRIPT or end_POSTSUBSCRIPT and βsusubscript𝛽su\beta_{\mbox{\sf\tiny su}}italic_β start_POSTSUBSCRIPT su end_POSTSUBSCRIPT is significant enough (4) will stop being accurate.

References

  • [AAMM+18] Hassan Abu Alhaija, Siva Karthik Mustikovela, Lars Mescheder, Andreas Geiger, and Carsten Rother, Augmented reality meets computer vision: Efficient data generation for urban driving scenes, International Journal of Computer Vision 126 (2018), 961–972.
  • [ANZ22] Ibrahim M Alabdulmohsin, Behnam Neyshabur, and Xiaohua Zhai, Revisiting neural scaling laws in language and vision, Advances in Neural Information Processing Systems 35 (2022), 22300–22312.
  • [Bir06] Steven Bird, Nltk: the natural language toolkit, Proceedings of the COLING/ACL 2006 Interactive Presentation Sessions, 2006, pp. 69–72.
  • [BL96] Lawrence D Brown and Mark G Low, Asymptotic equivalence of nonparametric regression and white noise, The Annals of Statistics 24 (1996), no. 6, 2384–2398.
  • [CLCG19] Yuhua Chen, Wen Li, Xiaoran Chen, and Luc Van Gool, Learning semantic segmentation from synthetic data: A geometrically guided input-output adaptation approach, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 1841–1850.
  • [CM22] Chen Cheng and Andrea Montanari, Dimension free ridge regression, arXiv:2210.08571 (2022).
  • [EM77] Bradley Efron and Carl Morris, Stein’s paradox in statistics, Scientific American 236 (1977), no. 5, 119–127.
  • [FCK+23] Lijie Fan, Kaifeng Chen, Dilip Krishnan, Dina Katabi, Phillip Isola, and Yonglong Tian, Scaling laws of synthetic images for model training… for now, arXiv preprint arXiv:2312.04567 (2023).
  • [GCH+20] Mary J Goldman, Brian Craft, Mim Hastie, Kristupas Repečka, Fran McDade, Akhil Kamath, Ayan Banerjee, Yunhai Luo, Dave Rogers, Angela N Brooks, et al., Visualizing and interpreting cancer genomics data via the xena platform, Nature biotechnology 38 (2020), no. 6, 675–678.
  • [GN21] Evarist Giné and Richard Nickl, Mathematical foundations of infinite-dimensional statistical models, Cambridge University Press, 2021.
  • [Gor85] Yehoram Gordon, Some inequalities for gaussian processes and applications, Israel Journal of Mathematics 50 (1985), no. 4, 265–289.
  • [HBM+22] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al., Training compute-optimal large language models, arXiv preprint arXiv:2203.15556 (2022).
  • [HKHM21] Danny Hernandez, Jared Kaplan, Tom Henighan, and Sam McCandlish, Scaling laws for transfer, arXiv preprint arXiv:2102.01293 (2021).
  • [HKK+20] Tom Henighan, Jared Kaplan, Mor Katz, Mark Chen, Christopher Hesse, Jacob Jackson, Heewoo Jun, Tom B Brown, Prafulla Dhariwal, Scott Gray, et al., Scaling laws for autoregressive generative modeling, arXiv preprint arXiv:2010.14701 (2020).
  • [HNA+17] Joel Hestness, Sharan Narang, Newsha Ardalani, Gregory Diamos, Heewoo Jun, Hassan Kianinejad, Md Mostofa Ali Patwary, Yang Yang, and Yanqi Zhou, Deep learning scaling is predictable, empirically, arXiv preprint arXiv:1712.00409 (2017).
  • [HNK+22] Xuanli He, Islam Nassar, Jamie Kiros, Gholamreza Haffari, and Mohammad Norouzi, Generate, annotate, and learn: Nlp with synthetic text, Transactions of the Association for Computational Linguistics 10 (2022), 826–842.
  • [HSY+22] Ruifei He, Shuyang Sun, Xin Yu, Chuhui Xue, Wenqing Zhang, Philip Torr, Song Bai, and Xiaojuan Qi, Is synthetic data from generative models ready for image recognition?, arXiv preprint arXiv:2210.07574 (2022).
  • [JRBM+17] Matthew Johnson-Roberson, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan, Driving in the matrix: Can virtual worlds replace human-generated annotations for real world tasks?, 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 746–753.
  • [KMH+20] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, Scaling laws for neural language models, arXiv preprint arXiv:2001.08361 (2020).
  • [MHZH22] Yu Meng, Jiaxin Huang, Yu Zhang, and Jiawei Han, Generating training data with language models: Towards zero-shot language understanding, Advances in Neural Information Processing Systems 35 (2022), 462–477.
  • [MM21] Léo Miolane and Andrea Montanari, The distribution of the lasso: Uniform control over sparse balls and adaptive parameter tuning, The Annals of Statistics 49 (2021), no. 4, 2313–2335.
  • [MPRP16] Andreas Maurer, Massimiliano Pontil, and Bernardino Romera-Paredes, The benefit of multitask representation learning, Journal of Machine Learning Research 17 (2016), no. 81, 1–32.
  • [MPT+22] Arthur Moreau, Nathan Piasco, Dzmitry Tsishkou, Bogdan Stanciulescu, and Arnaud de La Fortelle, Lens: Localization enhanced by nerf synthesis, Conference on Robot Learning, PMLR, 2022, pp. 1347–1356.
  • [MRB+23] Niklas Muennighoff, Alexander M Rush, Boaz Barak, Teven Le Scao, Aleksandra Piktus, Nouamane Tazi, Sampo Pyysalo, Thomas Wolf, and Colin Raffel, Scaling data-constrained language models, arXiv preprint arXiv:2305.16264 (2023).
  • [Rei08] Markus Reiß, Asymptotic equivalence for nonparametric regression with multivariate and random design, The Annals of Statistics (2008), 1957–1982.
  • [RRBS19] Jonathan S Rosenfeld, Amir Rosenfeld, Yonatan Belinkov, and Nir Shavit, A constructive prediction of the generalization error across scales, International Conference on Learning Representations, 2019.
  • [RSM+16] German Ros, Laura Sellart, Joanna Materzynska, David Vazquez, and Antonio M Lopez, The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 3234–3243.
  • [SLW20] Viktor Seib, Benjamin Lange, and Stefan Wirtz, Mixing real and synthetic data to enhance neural network training–a review of current approaches, arXiv preprint arXiv:2007.08781 (2020).
  • [Ste81] Charles M Stein, Estimation of the mean of a multivariate normal distribution, The annals of Statistics (1981), 1135–1151.
  • [TAH18] Christos Thrampoulidis, Ehsan Abbasi, and Babak Hassibi, Precise error analysis of regularized m𝑚mitalic_m-estimators in high dimensions, IEEE Transactions on Information Theory 64 (2018), no. 8, 5592–5628.
  • [TDR+21] Yi Tay, Mostafa Dehghani, **feng Rao, William Fedus, Samira Abnar, Hyung Won Chung, Sharan Narang, Dani Yogatama, Ashish Vaswani, and Donald Metzler, Scale efficiently: Insights from pretraining and finetuning transformers, International Conference on Learning Representations, 2021.
  • [TJJ20] Nilesh Tripuraneni, Michael Jordan, and Chi **, On the theory of transfer learning: The importance of task diversity, Advances in neural information processing systems 33 (2020), 7852–7862.
  • [TOH15] Christos Thrampoulidis, Samet Oymak, and Babak Hassibi, Regularized linear regression: A precise analysis of the estimation error, Proceedings of Machine Learning Research 40 (2015), 1683–1709.
  • [TPA+18] Jonathan Tremblay, Aayush Prakash, David Acuna, Mark Brophy, Varun Jampani, Cem Anil, Thang To, Eric Cameracci, Shaad Boochoon, and Stan Birchfield, Training deep networks with synthetic data: Bridging the reality gap by domain randomization, Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 969–977.
  • [Tsy09] Alexandre B. Tsybakov, Introduction to nonparametric estimation, Springer, 2009.
  • [vdV00] Aaad W van der Vaart, Asymptotic statistics, Cambridge University Press, 2000.
  • [Ver18] Roman Vershynin, High-dimensional probability: An introduction with applications in data science, vol. 47, Cambridge university press, 2018.
  • [YCFB+22] Lin Yen-Chen, Pete Florence, Jonathan T Barron, Tsung-Yi Lin, Alberto Rodriguez, and Phillip Isola, Nerf-supervision: Learning dense object descriptors from neural radiance fields, 2022 International Conference on Robotics and Automation (ICRA), IEEE, 2022, pp. 6496–6503.

Appendix A Details of empirical results

A.1 Binary classification with Gaussian mixture data


Refer to caption
Figure 5: Gaussian mixture data and logistic regression. Test error when trained on original (left plot) and surrogate (right plot) data only (red dots). Best fits of the form are shown in blue. These gives the estimates β=0.72𝛽0.72\beta=0.72italic_β = 0.72, R=0.157subscript𝑅0.157R_{*}=0.157italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.157, and Rsuex()=0.013superscriptsubscript𝑅suex0.013R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.013italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.013.

Refer to caption
Figure 6: Gaussian mixture data and logistic regression. Test error when trained on mixtures of original (n𝑛nitalic_n varying by row) and surrogate (m𝑚mitalic_m varying by column) data. Black curves: scaling formula (4).

We provide details for the models used in the simulations of Section A.1.

Logistic regression: We use the scikit-learn implementation with the lbfgs solver, fitting the intercept, with maximum iterations set to 10k. For each run of each (n,m,α)𝑛𝑚𝛼(n,m,\alpha)( italic_n , italic_m , italic_α ) combination, we set the 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT penalty (parameter C in scikit-learn) to 2i,i=8,,8formulae-sequencesuperscript2𝑖𝑖882^{i},i=-8,...,82 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = - 8 , … , 8 and 10i,i=6,5,4,3,3,4,5,6formulae-sequencesuperscript10𝑖𝑖6543345610^{i},i=-6,-5,-4,-3,3,4,5,610 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = - 6 , - 5 , - 4 , - 3 , 3 , 4 , 5 , 6, and only report the test result for the value that achieves the best validation error. The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 5 and 6.


Refer to caption
Figure 7: Gaussian mixture data and neural network. Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates exponent β=0.79𝛽0.79\beta=0.79italic_β = 0.79, R=0.160subscript𝑅0.160R_{*}=0.160italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.160 and Rsuex()=0.010superscriptsubscript𝑅suex0.010R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.010italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.010.
Refer to caption
Figure 8: Gaussian mixture data and neural network. Test error when training mixture of original (n𝑛nitalic_n varying by row) and surrogate (m𝑚mitalic_m varying by column) data. Black curves: scaling law (4).

Neural network: The network has one hidden layer with 32 ReLU neurons, and an output neuron using sigmoid. For training, we use the binary cross entropy loss, a constant learning rate of 0.05, and batch size 64. We train the network for 1,000 epochs. Similar to the procedure in logistic regression, we use 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT regularization (weight decay) and use the validation set to choose the best regularization parameter from the set {0,105,104,103,2103,4103,102,2102,4102,101,2101,4101}0superscript105superscript104superscript1032superscript1034superscript103superscript1022superscript1024superscript102superscript1012superscript1014superscript101\{0,10^{-5},10^{-4},10^{-3},2\cdot 10^{-3},4\cdot 10^{-3},10^{-2},2\cdot 10^{-% 2},4\cdot 10^{-2},10^{-1},2\cdot 10^{-1},4\cdot 10^{-1}\}{ 0 , 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 2 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 4 ⋅ 10 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 2 ⋅ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 4 ⋅ 10 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT , 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 2 ⋅ 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , 4 ⋅ 10 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT }. The results of the individual scaling law estimates and the comparison of joint training results with the scaling law predictions are shown in Figures 7 and 8.

A.2 Sentiment analysis in movie reviews


Refer to caption
Figure 9: IMDB and Rotten Tomatoes data and logistic regression. Test error when trained on original (left plot) and surrogate (right plot) data only (red dots), together with scaling law fits. Best fit parameters are β=0.27𝛽0.27\beta=0.27italic_β = 0.27, R=0.101subscript𝑅0.101R_{*}=0.101italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.101 and Rsuex()=0.148superscriptsubscript𝑅suex0.148R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.148italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.148.

Refer to caption
Figure 10: IMDB and Rotten Tomatoes data and logistic regression. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

Refer to caption
Figure 11: IMDB and Goodreads book reviews (as surrogate dataset) and logistic regression. Test error when trained on original (left plot) and surrogate (right plot) data only (red dots), together with scaling law fits. Best fit parameters are β=0.27𝛽0.27\beta=0.27italic_β = 0.27, R=0.101subscript𝑅0.101R_{*}=0.101italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.101 and Rsuex()=0.101superscriptsubscript𝑅suex0.101R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.101italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.101.

Refer to caption
Figure 12: IMDB and Goodreads book reviews and logistic regression. Test error when trained on mixtures of original and surrogate data. Black curves: prediction from Eq. (4).

Refer to caption
Figure 13: IMDB and Rotten Tomatoes data and neural networks. Scaling law fits for models trained on original (left plot) and surrogate (right plot) data only (red dots)(as in Fig. 9.) Best fit parameters are β=0.37𝛽0.37\beta=0.37italic_β = 0.37, R=0.145subscript𝑅0.145R_{*}=0.145italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.145 and Rsuex()=0.095superscriptsubscript𝑅suex0.095R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.095italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.095.

To convert the movie reviews and book reviews to vectors, we use a combination of two different embedding: We use all the reviews in the training data and then use nltk tagger [Bir06] to find the most frequent 500 adjectives appearing in the samples used for training. Then we use the common Tfidf vectorizer (we used scikit-learn’s implementation of tfidf vectorizer) for which we use the list of these most common 500 adjectives as vocabulary. This gives us a vector of length 500 dimension for each review. In addition, we also apply “Paraphrase-MiniLM-L6-v2” sentence transformer which is based on BERT with 6 Transformer Encoder Layers, and return a 384 dimension vector representation of the reviews. For each movie review we concatenate the results of tfidf vectorizer and sentence transformer to get a 884 dimensional representation that we use as our input vector.

We use logistic regression and neural networks with the same set of parameters as in the Gaussian mixture experiments (except for the input dimension). We plot the average loss over 10 independent runs.

Results omitted from the main text are presented in Figures 913.

A.3 Image classification with CIFAR10 and CIFAR100


Refer to caption
Figure 14: CIFAR10 and CIFAR100 data: (left) Test error scaling of original data (left) and surrogate data (right). Best fit parameters are β=0.404𝛽0.404\beta=0.404italic_β = 0.404, R=0.0013subscript𝑅0.0013R_{*}=0.0013italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.0013, and Rsuex()=0.199superscriptsubscript𝑅suex0.199R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.199italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.199.

We largely use the model and the training procedure described at https://jovian.ml/aakashns/05b-cifar10-resnet. We normalize the images for mean and standard deviation. We train a 9-layer ResNet model for classification, using Adam for optimization, weight decay, and gradient clip**, trained over 16 epochs with a one-cycle learning rate scheduling policy, minimizing cross entropy loss. For each combination of m𝑚mitalic_m, n𝑛nitalic_n, and α𝛼\alphaitalic_α, we report the average test error over 10101010 runs. Since there is no overlap between the label sets of CIFAR10 and CIFAR100, the latter dataset needs to be relabeled. We do this by training a separate 9-layer ResNet model on 10,000 randomly chosen CIFAR10 images from the training set of 50,000 examples (without creating a separate split for them), and use its predictions on CIFAR100 images as labels.

Scaling curves are presented in Figure 14 and 2.

A.4 Lasso on TCGA PanCancer dataset

We used public domain TCGA pancancer dataset. After, filtering samples with incomplete values we are left with 9220 patients, each having 20,531 gene expression values and the outcome was PFS (progression-free survival). Out of these we used a group of 2000 patients, splitted into train and test set of 1000 each to select 500 genes having the largest absolute Cox PH score. We also used the mean and standard deviation of gene expression values of these 2000 patients to normalize the gene expression columns for the remaining 7220 patients. Among the remaining of 7220 patients 3580 were females. We treated the female patients data as original data, and split them into train (50%percent\%%), test (25%percent\%%) and validation split (25%percent\%%). The remaining 3640 patients data was used as surrogate dataset. We fit CoxPHFitter model (link) with 500 selected genes and use “1-concordance score” as our loss function. We used the validation split to choose best value of 1subscript1\ell_{1}roman_ℓ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT penalty parameter from 2i,i=2,0,2,4,6,8,10,12,14,16formulae-sequencesuperscript2𝑖𝑖202468101214162^{i},i=2,0,-2,-4,-6,-8,-10,-12,-14,-162 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT , italic_i = 2 , 0 , - 2 , - 4 , - 6 , - 8 , - 10 , - 12 , - 14 , - 16 in the model. We observed discontinuity at α=1𝛼1\alpha=1italic_α = 1. To avoid this discontinuity, we approximated R(𝜽^n,m(1))𝑅subscript^𝜽𝑛𝑚1R(\hat{\boldsymbol{\theta}}_{n,m}(1))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( 1 ) ) by R(𝜽^n,m(1ϵ))𝑅subscript^𝜽𝑛𝑚1italic-ϵR(\hat{\boldsymbol{\theta}}_{n,m}(1-\epsilon))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( 1 - italic_ϵ ) ) if n>0𝑛0n>0italic_n > 0 and by R(𝜽^m/2,m(1ϵ))𝑅subscript^𝜽𝑚2𝑚1italic-ϵR(\hat{\boldsymbol{\theta}}_{m/2,m}(1-\epsilon))italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_m / 2 , italic_m end_POSTSUBSCRIPT ( 1 - italic_ϵ ) ) if n=0𝑛0n=0italic_n = 0, where we choose ϵ=0.05italic-ϵ0.05\epsilon=0.05italic_ϵ = 0.05. We plot the average loss over 10 independent runs. The results are presented in Figures 15 and 3.


Refer to caption
Figure 15: Lasso-based Cox regression on TCGA PanCancer dataset with female patients data as original data and male patients data as surrogate data. Scaling law fits for models trained on original (left plot) and surrogate (right plot) data only (red dots)(as in Fig. 9.) Best fit parameters are β=1.55𝛽1.55\beta=1.55italic_β = 1.55, R=0.29subscript𝑅0.29R_{*}=0.29italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 0.29 and Rsuex()=0.29superscriptsubscript𝑅suex0.29R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.29italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.29.

A.5 High-dimensional ridge regression

We present additional ridge regression experiments here in Figs. 1627. We plot the average loss over 10 independent runs. In these experiments, as in the main paper, we set d=500𝑑500d=500italic_d = 500, σ2=σs2=1superscript𝜎2superscriptsubscript𝜎𝑠21\sigma^{2}=\sigma_{s}^{2}=1italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = 1, 𝜽=1normsubscript𝜽1\|{\boldsymbol{\theta}}_{*}\|=1∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ = 1, 𝜽,s=1normsubscript𝜽𝑠1\|{\boldsymbol{\theta}}_{*,s}\|=1∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ = 1, except for the last four Figs. 2427, where we use 𝜽,s=1/2normsubscript𝜽𝑠12\|{\boldsymbol{\theta}}_{*,s}\|=1/2∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ = 1 / 2. We used angle γ=π/6𝛾𝜋6\gamma=\pi/6italic_γ = italic_π / 6 and π/2𝜋2\pi/2italic_π / 2 in our experiments.

We consider two methods: (1)1(1)( 1 ) Fix λ𝜆\lambdaitalic_λ to a very small value 210superscript2102^{-10}2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT, and (2)2(2)( 2 ) For each random draw of datasets select λ𝜆\lambdaitalic_λ that achieves the best validation performance. For the latter method, we try λ=2i𝜆superscript2𝑖\lambda=2^{i}italic_λ = 2 start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT, where i=10,8,6,,8,10𝑖1086810i=-10,-8,-6,\ldots,8,10italic_i = - 10 , - 8 , - 6 , … , 8 , 10. For ridge regression simulations, we directly plot the excess test risks, as the parameter θ𝜃\thetaitalic_θ for original data is known and for any θ^^𝜃\hat{\theta}over^ start_ARG italic_θ end_ARG the excess test risk in this model is θθ^2superscriptnorm𝜃^𝜃2\|\theta-\hat{\theta}\|^{2}∥ italic_θ - over^ start_ARG italic_θ end_ARG ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT.


Refer to caption
Figure 16: Ridge regression with γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=1.57𝛽1.57\beta=1.57italic_β = 1.57 and Rsuex()=2.0superscriptsubscript𝑅suex2.0R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=2.0italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 2.0

Refer to caption
Figure 17: Ridge regression with γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT

Refer to caption
Figure 18: Ridge regression with π/6𝜋6\pi/6italic_π / 6 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=1.57𝛽1.57\beta=1.57italic_β = 1.57 and Rsuex()=0.29superscriptsubscript𝑅suex0.29R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.29italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.29

Refer to caption
Figure 19: Ridge regression with π/6𝜋6\pi/6italic_π / 6 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT

Refer to caption
Figure 20: Ridge regression with π/2𝜋2\pi/2italic_π / 2 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and the best regularization parameter: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=0.94𝛽0.94\beta=0.94italic_β = 0.94 and Rsuex()=1.0superscriptsubscript𝑅suex1.0R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=1.0italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 1.0

Refer to caption
Figure 21: Ridge regression with γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, and the best regularization parameter

Refer to caption
Figure 22: Ridge regression with π/6𝜋6\pi/6italic_π / 6 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and the best regularization parameter: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=0.94𝛽0.94\beta=0.94italic_β = 0.94 and Rsuex()=0.24superscriptsubscript𝑅suex0.24R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=0.24italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 0.24

Refer to caption
Figure 23: Ridge regression with π/6𝜋6\pi/6italic_π / 6 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and the best regularization parameter

Refer to caption
Figure 24: Ridge regression with π/2𝜋2\pi/2italic_π / 2 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, θ=1norm𝜃1\|\theta\|=1∥ italic_θ ∥ = 1, θs=1/2normsubscript𝜃𝑠12\|\theta_{s}\|=1/2∥ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ = 1 / 2 and the best regularization parameter: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=0.94𝛽0.94\beta=0.94italic_β = 0.94 and Rsuex()=1.00superscriptsubscript𝑅suex1.00R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=1.00italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 1.00

Refer to caption
Figure 25: Ridge regression with π/2𝜋2\pi/2italic_π / 2 between θ𝜃\thetaitalic_θ and θssubscript𝜃𝑠\theta_{s}italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, θ=1norm𝜃1\|\theta\|=1∥ italic_θ ∥ = 1, θs=1/2normsubscript𝜃𝑠12\|\theta_{s}\|=1/2∥ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ = 1 / 2 and the best regularization parameter

Refer to caption
Figure 26: Ridge regression with γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, θ=1norm𝜃1\|\theta\|=1∥ italic_θ ∥ = 1, θs=1/2normsubscript𝜃𝑠12\|\theta_{s}\|=1/2∥ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ = 1 / 2, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT: Test error scaling of the original data (left), and surrogate data (right). Best curve fits give the estimates β=1.57𝛽1.57\beta=1.57italic_β = 1.57 and Rsuex()=1.27superscriptsubscript𝑅suex1.27R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)=1.27italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) = 1.27

Refer to caption
Figure 27: Ridge regression with γ=π/2𝛾𝜋2\gamma=\pi/2italic_γ = italic_π / 2, θ=1norm𝜃1\|\theta\|=1∥ italic_θ ∥ = 1, θs=1/2normsubscript𝜃𝑠12\|\theta_{s}\|=1/2∥ italic_θ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ = 1 / 2, and regularization parameter λ=210𝜆superscript210\lambda=2^{-10}italic_λ = 2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT

Appendix B Low-dimensional asymptotics

B.1 Formal statements

In this appendix, we present our results on the estimator of Eqs. (2), (3) under the classical asymptotics n,m𝑛𝑚n,m\to\inftyitalic_n , italic_m → ∞ at d𝑑ditalic_d fixed. For simplicity, we assume no regularizer is used in this regime.

Beyond classical regularity assumptions of low-dimensional asymptotics, in this section we will make the following assumption which guarantees that original and surrogate distribution are ‘not arbitrarily far.’ Recall that Rs(𝜽)superscript𝑅𝑠𝜽R^{s}({\boldsymbol{\theta}})italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ) denotes the population error on surrogate data.

Assumption 1 (Distribution shift for low-d𝑑ditalic_d asymptotics).

There exists a constant Ksubscript𝐾K_{*}italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT such that for all 𝛉d𝛉superscript𝑑{\boldsymbol{\theta}}\in\mathbb{R}^{d}bold_italic_θ ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT,

|Rs(𝜽)R(𝜽)|K(1+R(𝜽)).superscript𝑅𝑠𝜽𝑅𝜽subscript𝐾1𝑅𝜽\displaystyle\big{|}R^{s}({\boldsymbol{\theta}})-R({\boldsymbol{\theta}})\big{% |}\leq K_{*}\big{(}1+R({\boldsymbol{\theta}})\big{)}\,.| italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ) - italic_R ( bold_italic_θ ) | ≤ italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( 1 + italic_R ( bold_italic_θ ) ) . (18)

The regularity conditions are similar to the ones in [vdV00]. Here and in the following 𝖡(𝜽,r)𝖡subscript𝜽𝑟{\sf B}({\boldsymbol{\theta}}_{*},r)sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_r ) is the ball of radius r𝑟ritalic_r centered at 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

Assumption 2 (‘Classical’ regularity).

x

  1. (a)𝑎(a)( italic_a )

    The original population risk R(𝜽)𝑅𝜽R({\boldsymbol{\theta}})italic_R ( bold_italic_θ ) is uniquely minimized at a point 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT.

  2. (b)𝑏(b)( italic_b )

    𝜽(𝜽;𝒛)maps-to𝜽𝜽𝒛{\boldsymbol{\theta}}\mapsto\ell({\boldsymbol{\theta}};{\boldsymbol{z}})bold_italic_θ ↦ roman_ℓ ( bold_italic_θ ; bold_italic_z ) is non-negative lower semicontinuous. Further, define the following limit in [0,]0[0,\infty][ 0 , ∞ ] for 𝒖𝕊d1𝒖superscript𝕊𝑑1{\boldsymbol{u}}\in{\mathbb{S}}^{d-1}bold_italic_u ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT:

    (𝒖;𝒛):=lim inf𝜽𝜽/𝜽2𝒖(𝜽;𝒛).assignsubscript𝒖𝒛subscriptlimit-infimum𝜽𝜽subscriptnorm𝜽2𝒖𝜽𝒛\displaystyle\ell_{\infty}({\boldsymbol{u}};{\boldsymbol{z}}):=\liminf_{\begin% {subarray}{c}{\boldsymbol{\theta}}\to\infty\\ {\boldsymbol{\theta}}/\|{\boldsymbol{\theta}}\|_{2}\to{\boldsymbol{u}}\end{% subarray}}\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\,.roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( bold_italic_u ; bold_italic_z ) := lim inf start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_italic_θ → ∞ end_CELL end_ROW start_ROW start_CELL bold_italic_θ / ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT → bold_italic_u end_CELL end_ROW end_ARG end_POSTSUBSCRIPT roman_ℓ ( bold_italic_θ ; bold_italic_z ) . (19)

    Then we assume inf𝒖𝕊d1𝔼(𝒖;𝒛)R(𝜽)+csubscriptinfimum𝒖superscript𝕊𝑑1𝔼subscript𝒖𝒛𝑅subscript𝜽𝑐\inf_{{\boldsymbol{u}}\in{\mathbb{S}}^{d-1}}{\mathbb{E}}\ell_{\infty}({% \boldsymbol{u}};{\boldsymbol{z}})\geq R({\boldsymbol{\theta}}_{*})+croman_inf start_POSTSUBSCRIPT bold_italic_u ∈ blackboard_S start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E roman_ℓ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ( bold_italic_u ; bold_italic_z ) ≥ italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_c for some c>0𝑐0c>0italic_c > 0.

  3. (c)𝑐(c)( italic_c )

    𝜽(𝜽;𝒛)maps-to𝜽𝜽𝒛{\boldsymbol{\theta}}\mapsto\ell({\boldsymbol{\theta}};{\boldsymbol{z}})bold_italic_θ ↦ roman_ℓ ( bold_italic_θ ; bold_italic_z ) is differentiable at 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT almost surely, both under 𝒛similar-to𝒛{\boldsymbol{z}}\sim{\mathbb{P}}bold_italic_z ∼ blackboard_P and under 𝒛ssimilar-to𝒛superscript𝑠{\boldsymbol{z}}\sim{\mathbb{P}}^{s}bold_italic_z ∼ blackboard_P start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. Further, there exists r>0𝑟0r>0italic_r > 0 such that, letting 𝖡:=𝖡(𝜽,r)assign𝖡𝖡subscript𝜽𝑟{\sf B}:={\sf B}({\boldsymbol{\theta}}_{*},r)sansserif_B := sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_r ), the following holds for a constant C𝐶Citalic_C:

    𝔼sup𝜽1𝜽2𝖡{|(𝜽1;𝒛)(𝜽2;𝒛)|2𝜽1𝜽222}C<.𝔼subscriptsupremumsubscript𝜽1subscript𝜽2𝖡superscriptsubscript𝜽1𝒛subscript𝜽2𝒛2subscriptsuperscriptnormsubscript𝜽1subscript𝜽222𝐶\displaystyle{\mathbb{E}}\sup_{{\boldsymbol{\theta}}_{1}\neq{\boldsymbol{% \theta}}_{2}\in{\sf B}}\Big{\{}\frac{|\ell({\boldsymbol{\theta}}_{1};{% \boldsymbol{z}})-\ell({\boldsymbol{\theta}}_{2};{\boldsymbol{z}})|^{2}}{\|{% \boldsymbol{\theta}}_{1}-{\boldsymbol{\theta}}_{2}\|^{2}_{2}}\Big{\}}\leq C<% \infty\,.blackboard_E roman_sup start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≠ bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ sansserif_B end_POSTSUBSCRIPT { divide start_ARG | roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; bold_italic_z ) - roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; bold_italic_z ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG } ≤ italic_C < ∞ . (20)
  4. (d)𝑑(d)( italic_d )

    The functions 𝜽R(𝜽)maps-to𝜽𝑅𝜽{\boldsymbol{\theta}}\mapsto R({\boldsymbol{\theta}})bold_italic_θ ↦ italic_R ( bold_italic_θ ), 𝜽Rs(𝜽)maps-to𝜽superscript𝑅𝑠𝜽{\boldsymbol{\theta}}\mapsto R^{s}({\boldsymbol{\theta}})bold_italic_θ ↦ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ), are twice differentiable in a neighborhood of 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, with Lipschitz continuous Hessian. Further 2R(𝜽)𝟎succeedssuperscript2𝑅subscript𝜽0\nabla^{2}R({\boldsymbol{\theta}}_{*})\succ{\boldsymbol{0}}∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≻ bold_0 (strictly positive definite).

Proposition B.1.

Under Assumption 1 and Assumption 2, define the following d×d𝑑𝑑d\times ditalic_d × italic_d matrices

𝑯𝑯\displaystyle{\boldsymbol{H}}bold_italic_H :=2R(𝜽)=𝔼[2(𝜽;𝒛)],assignabsentsuperscript2𝑅subscript𝜽𝔼delimited-[]superscript2subscript𝜽𝒛\displaystyle:=\nabla^{2}R({\boldsymbol{\theta}}_{*})={\mathbb{E}}[\nabla^{2}% \ell({\boldsymbol{\theta}}_{*};{\boldsymbol{z}})]\,,:= ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = blackboard_E [ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_italic_z ) ] , (21)
𝑲𝑲\displaystyle{\boldsymbol{K}}bold_italic_K :=Cov((𝜽;𝒛);(𝜽;𝒛)),assignabsentCovsubscript𝜽𝒛subscript𝜽𝒛\displaystyle:={\rm Cov}\big{(}\nabla\ell({\boldsymbol{\theta}}_{*};{% \boldsymbol{z}});\nabla\ell({\boldsymbol{\theta}}_{*};{\boldsymbol{z}})\big{)}% \,,\;\;\;\;\;:= roman_Cov ( ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_italic_z ) ; ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_italic_z ) ) , (22)
𝑲ssubscript𝑲𝑠\displaystyle{\boldsymbol{K}}_{s}bold_italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT :=Covs((𝜽;𝒛s);(𝜽;𝒛s)),assignabsentsubscriptCov𝑠subscript𝜽superscript𝒛𝑠subscript𝜽superscript𝒛𝑠\displaystyle:={\rm Cov}_{s}\big{(}\nabla\ell({\boldsymbol{\theta}}_{*};{% \boldsymbol{z}}^{s});\nabla\ell({\boldsymbol{\theta}}_{*};{\boldsymbol{z}}^{s}% )\big{)}\,,:= roman_Cov start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ; ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) ) , (23)

where CovCov{\rm Cov}roman_Cov, CovssubscriptCov𝑠{\rm Cov}_{s}roman_Cov start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT denote the covariances, respectively, with respect to the original data (i.e., with respect to 𝐳similar-to𝐳{\boldsymbol{z}}\sim{\mathbb{P}}bold_italic_z ∼ blackboard_P), and with respect to the surrogate data (i.e., with respect to 𝐳sssimilar-tosuperscript𝐳𝑠subscript𝑠{\boldsymbol{z}}^{s}\sim{\mathbb{P}}_{s}bold_italic_z start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∼ blackboard_P start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT). Further define the d𝑑ditalic_d-dimensional vector

𝒈s:=Rs(𝜽)R(𝜽).assignsuperscript𝒈𝑠superscript𝑅𝑠subscript𝜽𝑅subscript𝜽\displaystyle{\boldsymbol{g}}^{s}:=\nabla R^{s}({\boldsymbol{\theta}}_{*})-% \nabla R({\boldsymbol{\theta}}_{*})\,.bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT := ∇ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) - ∇ italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) . (24)

Then there exists αmax(0,1]subscript𝛼01\alpha_{\max}\in(0,1]italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∈ ( 0 , 1 ] (depending only on the constants in the assumptions) such that, for all α[0,αmax]𝛼0subscript𝛼\alpha\in[0,\alpha_{\max}]italic_α ∈ [ 0 , italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], the excess risk of the estimator 𝛉^n,m(α)subscript^𝛉𝑛𝑚𝛼\hat{\boldsymbol{\theta}}_{n,m}(\alpha)over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) satisfies (for D:=𝐠sassign𝐷normsuperscript𝐠𝑠D:=\|{\boldsymbol{g}}^{s}\|italic_D := ∥ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ bounded by a constant)

R(𝜽^n,m(α))R(𝜽)𝑅subscript^𝜽𝑛𝑚𝛼𝑅subscript𝜽\displaystyle R\big{(}\hat{\boldsymbol{\theta}}_{n,m}(\alpha)\big{)}-R\big{(}{% \boldsymbol{\theta}}_{*}\big{)}italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) =α2𝒈s,𝑯1𝒈s+(1α)2nTr(𝑯1𝑲)absentsuperscript𝛼2superscript𝒈𝑠superscript𝑯1superscript𝒈𝑠superscript1𝛼2𝑛Trsuperscript𝑯1𝑲\displaystyle=\;\alpha^{2}\langle{\boldsymbol{g}}^{s},{\boldsymbol{H}}^{-1}{% \boldsymbol{g}}^{s}\rangle+\frac{(1-\alpha)^{2}}{n}\cdot\mathrm{Tr}\big{(}{% \boldsymbol{H}}^{-1}{\boldsymbol{K}}\big{)}= italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ + divide start_ARG ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K ) (25)
+α2mTr(𝑯1𝑲s)+O((1mn+Dα2)(1(mn)1/2+Dα)).superscript𝛼2𝑚Trsuperscript𝑯1subscript𝑲𝑠𝑂1𝑚𝑛𝐷superscript𝛼21superscript𝑚𝑛12𝐷𝛼\displaystyle+\frac{\alpha^{2}}{m}\cdot\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}% {\boldsymbol{K}}_{s}\big{)}+O\Big{(}\Big{(}\frac{1}{m\vee n}+D\alpha^{2}\Big{)% }\Big{(}\frac{1}{(m\vee n)^{1/2}}+D\alpha\Big{)}\Big{)}\,.+ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG ⋅ roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_O ( ( divide start_ARG 1 end_ARG start_ARG italic_m ∨ italic_n end_ARG + italic_D italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( divide start_ARG 1 end_ARG start_ARG ( italic_m ∨ italic_n ) start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT end_ARG + italic_D italic_α ) ) .

(Here the big O𝑂Oitalic_O hides dependence on the constants in Assumptions 1 and 2.)

Remark B.1.

For economy of notation we stated Proposition B.1 in the case in which the excess risk is measured by using the same loss as for training, i.e. test=subscripttest\ell_{\mbox{\tiny\rm test}}=\ellroman_ℓ start_POSTSUBSCRIPT test end_POSTSUBSCRIPT = roman_ℓ. However the same result Eq. (25) applies with minor modifications to the case testsubscripttest\ell_{\mbox{\tiny\rm test}}\neq\ellroman_ℓ start_POSTSUBSCRIPT test end_POSTSUBSCRIPT ≠ roman_ℓ (and thus, with R𝑅Ritalic_R replaced by Rtestsuperscript𝑅testR^{\mbox{\tiny\rm test}}italic_R start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT), provided Rtestsuperscript𝑅testR^{\mbox{\tiny\rm test}}italic_R start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT is also twice differentiable with Lipschitz Hessian, and Rtest(𝜽)=𝟎superscript𝑅testsubscript𝜽0\nabla R^{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}_{*})={\boldsymbol{0}}∇ italic_R start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = bold_0. In this case, (25) has to be modified replacing 𝑯1superscript𝑯1{\boldsymbol{H}}^{-1}bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT by 𝑯12Rtest(𝜽)𝑯1superscript𝑯1superscript2superscript𝑅testsubscript𝜽superscript𝑯1{\boldsymbol{H}}^{-1}\nabla^{2}R^{\mbox{\tiny\rm test}}({\boldsymbol{\theta}}_% {*}){\boldsymbol{H}}^{-1}bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT test end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT.

Remark B.2.

The error terms in Eq. (25) are negligible under two conditions: (i)𝑖(i)( italic_i ) m𝑚mitalic_m and n𝑛nitalic_n are large, which is the classical condition for low-dimensional asymptotics to hold; (ii)𝑖𝑖(ii)( italic_i italic_i ) 𝒈s2=Rs(𝜽)2αsubscriptnormsuperscript𝒈𝑠2subscriptnormsuperscript𝑅𝑠subscript𝜽2𝛼\|{\boldsymbol{g}}^{s}\|_{2}=\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}\alpha∥ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = ∥ ∇ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_α is small. In particular, the latter condition will hold in two cases. First, when Rs(𝜽)2subscriptnormsuperscript𝑅𝑠subscript𝜽2\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}∥ ∇ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is of order one (i.e. the distribution shift is large), but α𝛼\alphaitalic_α is small (surrogate data are downweighted). Note that, when the distribution shift is large, and the sample size n𝑛nitalic_n is large enough, we expect small α𝛼\alphaitalic_α to be optimal and therefore Eq. (25) covers the ‘interesting’ regime.

Second, when Rs(𝜽)2subscriptnormsuperscript𝑅𝑠subscript𝜽2\|\nabla R^{s}({\boldsymbol{\theta}}_{*})\|_{2}∥ ∇ italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is small (i.e. the shift is small) and α𝛼\alphaitalic_α is of order one. If in addition we have 2Rs(𝜽)2Rs(𝜽)superscript2superscript𝑅𝑠subscript𝜽superscript2superscript𝑅𝑠subscript𝜽\nabla^{2}R^{s}({\boldsymbol{\theta}}_{*})\approx\nabla^{2}R^{s}({\boldsymbol{% \theta}}_{*})∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ≈ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ), it can be shown that the range of validity of Eq. (25) covers the whole interval α[0,1]𝛼01\alpha\in[0,1]italic_α ∈ [ 0 , 1 ].

Remark B.3.

Note that the distribution shift is measured in Eq. (25) by the first term 𝒈s,𝑯1𝒈ssuperscript𝒈𝑠superscript𝑯1superscript𝒈𝑠\langle{\boldsymbol{g}}^{s},{\boldsymbol{H}}^{-1}{\boldsymbol{g}}^{s}\rangle⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩. The original and surrogate distribution can be very different in other metrics (e.g. in total variation or transportation distance), but as long as 𝒈ssuperscript𝒈𝑠{\boldsymbol{g}}^{s}bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT is small (as measured in the norm defined by 𝑯1superscript𝑯1{\boldsymbol{H}}^{-1}bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT), surrogate data will reduce test error.

Note that, within the setting of Proposition B.1, the excess error of training only on original data is Rorex(n):=R(𝜽^n,0(0))R(𝜽)=Tr(𝑯1𝑲)/n+o(1/n)assignsuperscriptsubscript𝑅orex𝑛𝑅subscript^𝜽𝑛00𝑅subscript𝜽Trsuperscript𝑯1𝑲𝑛𝑜1𝑛R_{\mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n):=R(\hat{\boldsymbol{\theta}}_{n% ,0}(0))-R({\boldsymbol{\theta}}_{*})=\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}\big{)}/n+o(1/n)italic_R start_POSTSUBSCRIPT or end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_n ) := italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ( 0 ) ) - italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K ) / italic_n + italic_o ( 1 / italic_n ), while Rsuex(m):=R(𝜽^n,m(0))R(𝜽)=𝒈s,𝑯1𝒈s+Tr(𝑯1𝑲s)/m+o(1/m)assignsuperscriptsubscript𝑅suex𝑚𝑅subscript^𝜽𝑛𝑚0𝑅subscript𝜽superscript𝒈𝑠superscript𝑯1superscript𝒈𝑠Trsuperscript𝑯1subscript𝑲𝑠𝑚𝑜1𝑚R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m):=R(\hat{\boldsymbol{\theta}}_{n% ,m}(0))-R({\boldsymbol{\theta}}_{*})=\langle{\boldsymbol{g}}^{s},{\boldsymbol{% H}}^{-1}{\boldsymbol{g}}^{s}\rangle+\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}_{s}\big{)}/m+o(1/m)italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_m ) := italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( 0 ) ) - italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = ⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ + roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) / italic_m + italic_o ( 1 / italic_m ). Hence Eq. (B.1) can be recast in the form of our general scaling law (4), namely:

R(𝜽^n,m(α))R(𝜽)𝑅subscript^𝜽𝑛𝑚𝛼𝑅subscript𝜽\displaystyle R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R\big{(}{\boldsymbol{% \theta}}_{*}\big{)}italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) α2Rsuex()+[α2(Rsuex(m)Rsuex())+(1α)2Rorex(n)],absentsuperscript𝛼2superscriptsubscript𝑅suexdelimited-[]superscript𝛼2superscriptsubscript𝑅suex𝑚superscriptsubscript𝑅suexsuperscript1𝛼2superscriptsubscript𝑅orex𝑛\displaystyle\approx\alpha^{2}R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(% \infty)+\Big{[}\alpha^{2}\big{(}R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(m)% -R_{\mbox{\sf\tiny su}}^{\mbox{\sf\tiny ex}}(\infty)\big{)}+(1-\alpha)^{2}R_{% \mbox{\sf\tiny or}}^{\mbox{\sf\tiny ex}}(n)\Big{]}\,,≈ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) + [ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_m ) - italic_R start_POSTSUBSCRIPT su end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( ∞ ) ) + ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT or end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT ( italic_n ) ] ,

which (as expected) corresponds to the parametric scaling exponent β=1𝛽1\beta=1italic_β = 1.

An immediate consequence of Proposition B.1 is that surrogate data do not hurt, and will help if their distribution is close enough to the original one (under the assumption of optimally chosen α𝛼\alphaitalic_α).

Corollary B.2.

Under the assumptions of Proposition B.1, let R¯or(n):=Tr(𝐇1𝐊)/nassignsubscript¯𝑅or𝑛Trsuperscript𝐇1𝐊𝑛\overline{R}_{\mbox{\sf\tiny or}}(n):=\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}\big{)}/nover¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) := roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K ) / italic_n, and R¯su(m):=𝐠s,𝐇1𝐠s+Tr(𝐇1𝐊s)/massignsubscript¯𝑅su𝑚superscript𝐠𝑠superscript𝐇1superscript𝐠𝑠Trsuperscript𝐇1subscript𝐊𝑠𝑚\overline{R}_{\mbox{\sf\tiny su}}(m):=\langle{\boldsymbol{g}}^{s},{\boldsymbol% {H}}^{-1}{\boldsymbol{g}}^{s}\rangle+\mathrm{Tr}\big{(}{\boldsymbol{H}}^{-1}{% \boldsymbol{K}}_{s}\big{)}/mover¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) := ⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ + roman_Tr ( bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_italic_K start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) / italic_m. For αn,m=R¯or(n)/(R¯su(m)+R¯or(n))subscriptsuperscript𝛼𝑛𝑚subscript¯𝑅or𝑛subscript¯𝑅su𝑚subscript¯𝑅or𝑛\alpha^{*}_{n,m}=\overline{R}_{\mbox{\sf\tiny or}}(n)/(\overline{R}_{\mbox{\sf% \tiny su}}(m)+\overline{R}_{\mbox{\sf\tiny or}}(n))italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT = over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) / ( over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) + over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) ), we have

R(𝜽^n,m(αn,m))R𝑅subscript^𝜽𝑛𝑚subscriptsuperscript𝛼𝑛𝑚subscript𝑅\displaystyle R\big{(}\hat{\boldsymbol{\theta}}_{n,m}(\alpha^{*}_{n,m})\big{)}% -R_{*}italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ) ) - italic_R start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT =(R¯or(n)1+R¯su(m)1)1+Δn,m,absentsuperscriptsubscript¯𝑅orsuperscript𝑛1subscript¯𝑅susuperscript𝑚11subscriptΔ𝑛𝑚\displaystyle=\;\big{(}\overline{R}_{\mbox{\sf\tiny or}}(n)^{-1}+\overline{R}_% {\mbox{\sf\tiny su}}(m)^{-1}\big{)}^{-1}+\Delta_{n,m},= ( over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT or end_POSTSUBSCRIPT ( italic_n ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + over¯ start_ARG italic_R end_ARG start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( italic_m ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ,

with Δn,msubscriptΔ𝑛𝑚\Delta_{n,m}roman_Δ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT of the same order as the error in Prop. B.1.

B.2 Proofs

Lemma B.3.

Under the assumptions of Proposition B.1 (Assumption 1 and Assumption 2) there exists αmax(0,1]subscript𝛼01\alpha_{\max}\in(0,1]italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∈ ( 0 , 1 ], depending only on the constants appearing there such that the following holds:

  1. (i)𝑖(i)( italic_i )

    The function 𝜽R(𝜽;α):=(1α)R(𝜽)+αRs(𝜽)maps-to𝜽𝑅𝜽𝛼assign1𝛼𝑅𝜽𝛼superscript𝑅𝑠𝜽{\boldsymbol{\theta}}\mapsto R({\boldsymbol{\theta}};\alpha):=(1-\alpha)\,R({% \boldsymbol{\theta}})+\alpha\,R^{s}({\boldsymbol{\theta}})bold_italic_θ ↦ italic_R ( bold_italic_θ ; italic_α ) := ( 1 - italic_α ) italic_R ( bold_italic_θ ) + italic_α italic_R start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ ) has a unique minimizer 𝜽(α)dsubscript𝜽𝛼superscript𝑑{\boldsymbol{\theta}}_{*}(\alpha)\in\mathbb{R}^{d}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Further 𝜽(α)𝖡(𝜽,r)subscript𝜽𝛼𝖡subscript𝜽𝑟{\boldsymbol{\theta}}_{*}(\alpha)\in{\sf B}({\boldsymbol{\theta}}_{*},r)bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ∈ sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , italic_r ), and 𝜽(α)𝜽subscript𝜽𝛼subscript𝜽{\boldsymbol{\theta}}_{*}(\alpha)\to{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) → bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT as α0𝛼0\alpha\downarrow 0italic_α ↓ 0.

  2. (ii)𝑖𝑖(ii)( italic_i italic_i )

    We have 𝜽^n,m(α)𝜽subscript^𝜽𝑛𝑚𝛼subscript𝜽\hat{\boldsymbol{\theta}}_{n,m}(\alpha)\to{\boldsymbol{\theta}}_{*}over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) → bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT in probability as n,m𝑛𝑚n,m\to\inftyitalic_n , italic_m → ∞.

Proof.

Fix r0(0,r]subscript𝑟00𝑟r_{0}\in(0,r]italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ ( 0 , italic_r ] By Assumption 2.(a)𝑎(a)( italic_a ), inf𝜽𝖡(𝜽;r0)R(𝜽)>R(𝜽)+δ0subscriptinfimum𝜽𝖡subscript𝜽subscript𝑟0𝑅𝜽𝑅subscript𝜽subscript𝛿0\inf_{{\boldsymbol{\theta}}\not\in{\sf B}({\boldsymbol{\theta}}_{*};r_{0})}R({% \boldsymbol{\theta}})>R({\boldsymbol{\theta}}_{*})+\delta_{0}roman_inf start_POSTSUBSCRIPT bold_italic_θ ∉ sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_R ( bold_italic_θ ) > italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT for some constant δ0subscript𝛿0\delta_{0}italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Hence, using Assumption 1, for any 𝜽𝖡(𝜽;r)𝜽𝖡subscript𝜽𝑟{\boldsymbol{\theta}}\not\in{\sf B}({\boldsymbol{\theta}}_{*};r)bold_italic_θ ∉ sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_r )

R(𝜽;α)𝑅𝜽𝛼\displaystyle R({\boldsymbol{\theta}};\alpha)italic_R ( bold_italic_θ ; italic_α ) R(𝜽)Kα[1+R(𝜽)]absent𝑅𝜽subscript𝐾𝛼delimited-[]1𝑅𝜽\displaystyle\geq R({\boldsymbol{\theta}})-K_{*}\alpha\big{[}1+R({\boldsymbol{% \theta}})\big{]}≥ italic_R ( bold_italic_θ ) - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α [ 1 + italic_R ( bold_italic_θ ) ]
(1Kα)R(𝜽)Kαabsent1subscript𝐾𝛼𝑅𝜽subscript𝐾𝛼\displaystyle\geq(1-K_{*}\alpha)R({\boldsymbol{\theta}})-K_{*}\alpha≥ ( 1 - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α ) italic_R ( bold_italic_θ ) - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α
(1Kα)(R(𝜽)+δ0)Kα.absent1subscript𝐾𝛼𝑅subscript𝜽subscript𝛿0subscript𝐾𝛼\displaystyle\geq(1-K_{*}\alpha)(R({\boldsymbol{\theta}}_{*})+\delta_{0})-K_{*% }\alpha\,.≥ ( 1 - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α ) ( italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α .

In the other hand R(𝜽;α)(1+Kα)R(𝜽)+Kα𝑅subscript𝜽𝛼1subscript𝐾𝛼𝑅subscript𝜽subscript𝐾𝛼R({\boldsymbol{\theta}}_{*};\alpha)\leq(1+K_{*}\alpha)R({\boldsymbol{\theta}}_% {*})+K_{*}\alphaitalic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) ≤ ( 1 + italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α ) italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α, whence

R(𝜽;α)R(𝜽;α)𝑅𝜽𝛼𝑅subscript𝜽𝛼\displaystyle R({\boldsymbol{\theta}};\alpha)-R({\boldsymbol{\theta}}_{*};\alpha)italic_R ( bold_italic_θ ; italic_α ) - italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) (1Kα)δ02KαR(𝜽)absent1subscript𝐾𝛼subscript𝛿02subscript𝐾𝛼𝑅subscript𝜽\displaystyle\geq(1-K_{*}\alpha)\delta_{0}-2K_{*}\alpha R({\boldsymbol{\theta}% }_{*})≥ ( 1 - italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α ) italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - 2 italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT )
2Kα,,2subscript𝐾𝛼\displaystyle-2K_{*}\alpha,,- 2 italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT italic_α , ,

which is strictly positive for α<αmax(r0):=δ0/(4K(1+R(𝜽))\alpha<\alpha_{\max}(r_{0}):=\delta_{0}/(4K_{*}(1+R({\boldsymbol{\theta}}_{*}))italic_α < italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) := italic_δ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT / ( 4 italic_K start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( 1 + italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ). Hence the minimum must be achieved in 𝖡(𝜽;r0)𝖡subscript𝜽subscript𝑟0{\sf B}({\boldsymbol{\theta}}_{*};r_{0})sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) (note that since R(𝜽)𝑅𝜽R({\boldsymbol{\theta}})italic_R ( bold_italic_θ ), Rs(𝜽)subscript𝑅𝑠𝜽R_{s}({\boldsymbol{\theta}})italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ ) are lower semicontinuous, the minimum is achieved).

By Assumption 2.(d)𝑑(d)( italic_d ), for r0subscript𝑟0r_{0}italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT sufficiently small, 𝜽R(𝜽;α)maps-to𝜽𝑅𝜽𝛼{\boldsymbol{\theta}}\mapsto\nabla R({\boldsymbol{\theta}};\alpha)bold_italic_θ ↦ ∇ italic_R ( bold_italic_θ ; italic_α ) is strictly convex in 𝖡(𝜽;r0)𝖡subscript𝜽subscript𝑟0{\sf B}({\boldsymbol{\theta}}_{*};r_{0})sansserif_B ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_r start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) and therefore the minimizer is unique. This proves point (i)𝑖(i)( italic_i ).

Point (ii)𝑖𝑖(ii)( italic_i italic_i ) follows from a modification of Theorem 5.14 in [vdV00]. Namely, for a diverging sequence {(n(k),m(k)):k}conditional-set𝑛𝑘𝑚𝑘𝑘\{(n(k),m(k)):k\in\mathbb{N}\}{ ( italic_n ( italic_k ) , italic_m ( italic_k ) ) : italic_k ∈ blackboard_N }, we consider to R^,k(𝒖):=R^n(k),m(k)(c(𝒖)𝒖;α)assignsubscript^𝑅𝑘𝒖subscript^𝑅𝑛𝑘𝑚𝑘𝑐𝒖𝒖𝛼\widehat{R}_{*,k}({\boldsymbol{u}}):=\widehat{R}_{n(k),m(k)}(c({\boldsymbol{u}% }){\boldsymbol{u}};\alpha)over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT ( bold_italic_u ) := over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n ( italic_k ) , italic_m ( italic_k ) end_POSTSUBSCRIPT ( italic_c ( bold_italic_u ) bold_italic_u ; italic_α ), where c(𝒖):=(1+𝒖2)1/2assign𝑐𝒖superscript1superscriptnorm𝒖212c({\boldsymbol{u}}):=(1+\|{\boldsymbol{u}}\|^{2})^{-1/2}italic_c ( bold_italic_u ) := ( 1 + ∥ bold_italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT. This function is lower semicontinuous on the compact set 𝖡(𝟎;1)𝖡01{\sf B}({\boldsymbol{0}};1)sansserif_B ( bold_0 ; 1 ) and converges almost surely to its expectation for every fixed 𝒖𝒖{\boldsymbol{u}}bold_italic_u in this set, and hence the argument of Theorem 5.14 [vdV00] applies here. ∎

Proof of Proposition B.1.

By a modification of Theorem 5.39 in [vdV00] (here 𝜽(α)subscript𝜽𝛼{\boldsymbol{\theta}}_{*}(\alpha)bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) is defined as in Lemma B.3)

𝜽^n,m(α)=subscript^𝜽𝑛𝑚𝛼absent\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) = 𝜽(α)+1αn𝑯(α)1i=1n[(𝜽(α);𝒛i)𝔼(𝜽;𝒛)]subscript𝜽𝛼1𝛼𝑛𝑯superscript𝛼1superscriptsubscript𝑖1𝑛delimited-[]subscript𝜽𝛼subscript𝒛𝑖𝔼𝜽𝒛\displaystyle\;{\boldsymbol{\theta}}_{*}(\alpha)+\frac{1-\alpha}{n}{% \boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{n}\big{[}\nabla\ell({\boldsymbol{% \theta}}_{*}(\alpha);{\boldsymbol{z}}_{i})-{\mathbb{E}}\nabla\ell({\boldsymbol% {\theta}};{\boldsymbol{z}})\big{]}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) + divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG bold_italic_H ( italic_α ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ; bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - blackboard_E ∇ roman_ℓ ( bold_italic_θ ; bold_italic_z ) ] (26)
+αm𝑯(α)1i=1m[(𝜽(α);𝒛ic)𝔼s(𝜽;𝒛)]+OP(m1+n1),𝛼𝑚𝑯superscript𝛼1superscriptsubscript𝑖1𝑚delimited-[]subscript𝜽𝛼subscriptsuperscript𝒛𝑐𝑖subscript𝔼𝑠𝜽𝒛subscript𝑂𝑃superscript𝑚1superscript𝑛1\displaystyle\;\;+\frac{\alpha}{m}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{m}% \big{[}\nabla\ell({\boldsymbol{\theta}}_{*}(\alpha);{\boldsymbol{z}}^{c}_{i})-% {\mathbb{E}}_{s}\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}+O_{P% }(m^{-1}+n^{-1})\,,+ divide start_ARG italic_α end_ARG start_ARG italic_m end_ARG bold_italic_H ( italic_α ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ; bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∇ roman_ℓ ( bold_italic_θ ; bold_italic_z ) ] + italic_O start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ) , (27)

where 𝑯(α):=(1α)2R(𝜽(α))+α2Rs(𝜽(α))assign𝑯𝛼1𝛼superscript2𝑅subscript𝜽𝛼𝛼superscript2subscript𝑅𝑠subscript𝜽𝛼{\boldsymbol{H}}(\alpha):=(1-\alpha)\nabla^{2}R({\boldsymbol{\theta}}_{*}(% \alpha))+\alpha\nabla^{2}R_{s}({\boldsymbol{\theta}}_{*}(\alpha))bold_italic_H ( italic_α ) := ( 1 - italic_α ) ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ) + italic_α ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ). Note that in the present setting the error is of order m1+n1superscript𝑚1superscript𝑛1m^{-1}+n^{-1}italic_m start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT + italic_n start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT because we assume the Hessian to be Lipschitz continuous.

The population minimizer 𝜽(α)subscript𝜽𝛼{\boldsymbol{\theta}}_{*}(\alpha)bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) solves

𝟎0\displaystyle{\boldsymbol{0}}bold_0 =R(𝜽(α);α)absent𝑅subscript𝜽𝛼𝛼\displaystyle=\nabla R({\boldsymbol{\theta}}_{*}(\alpha);\alpha)= ∇ italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ; italic_α )
=R(𝜽;α)+2R(𝜽;α)(𝜽(α)𝜽)+01[2R(𝜽t;α)2R(𝜽;α)](𝜽(α)𝜽)dt,absent𝑅subscript𝜽𝛼superscript2𝑅subscript𝜽𝛼subscript𝜽𝛼subscript𝜽superscriptsubscript01delimited-[]superscript2𝑅subscript𝜽𝑡𝛼superscript2𝑅subscript𝜽𝛼subscript𝜽𝛼subscript𝜽differential-d𝑡\displaystyle=\nabla R({\boldsymbol{\theta}}_{*};\alpha)+\nabla^{2}R({% \boldsymbol{\theta}}_{*};\alpha)({\boldsymbol{\theta}}_{*}(\alpha)-{% \boldsymbol{\theta}}_{*})+\int_{0}^{1}\big{[}\nabla^{2}R({\boldsymbol{\theta}}% _{t};\alpha)-\nabla^{2}R({\boldsymbol{\theta}}_{*};\alpha)\big{]}({\boldsymbol% {\theta}}_{*}(\alpha)-{\boldsymbol{\theta}}_{*})\,{\rm d}t\,,= ∇ italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) + ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT [ ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_α ) - ∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) ] ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) roman_d italic_t ,

where 𝜽t=t𝜽(α)+(1t)𝜽subscript𝜽𝑡𝑡subscript𝜽𝛼1𝑡subscript𝜽{\boldsymbol{\theta}}_{t}=t\,{\boldsymbol{\theta}}_{*}(\alpha)+(1-t)\,{% \boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_t bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) + ( 1 - italic_t ) bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. Denoting by L2subscript𝐿2L_{2}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT the Lipschitz constant of the Hessian (in operator norm), and recalling that R(𝜽)=𝟎𝑅subscript𝜽0\nabla R({\boldsymbol{\theta}}_{*})={\boldsymbol{0}}∇ italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) = bold_0, we have

2R(𝜽;α)(𝜽(α)𝜽)superscript2𝑅subscript𝜽𝛼subscript𝜽𝛼subscript𝜽\displaystyle\nabla^{2}R({\boldsymbol{\theta}}_{*};\alpha)({\boldsymbol{\theta% }}_{*}(\alpha)-{\boldsymbol{\theta}}_{*})∇ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ; italic_α ) ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) =αRs(𝜽)+𝒖,absent𝛼subscript𝑅𝑠subscript𝜽𝒖\displaystyle=-\alpha\nabla R_{s}({\boldsymbol{\theta}}_{*})+{\boldsymbol{u}}\,,= - italic_α ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) + bold_italic_u ,
𝒖2subscriptnorm𝒖2\displaystyle\|{\boldsymbol{u}}\|_{2}∥ bold_italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT L2𝜽(α)𝜽2.absentsubscript𝐿2superscriptnormsubscript𝜽𝛼subscript𝜽2\displaystyle\leq L_{2}\|{\boldsymbol{\theta}}_{*}(\alpha)-{\boldsymbol{\theta% }}_{*}\|^{2}\,.≤ italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT .

Recalling that, by Lemma B.3, 𝜽(α)𝜽subscript𝜽𝛼subscript𝜽{\boldsymbol{\theta}}_{*}(\alpha)\to{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) → bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT as α0𝛼0\alpha\to 0italic_α → 0, this implies

𝜽(α)𝜽=𝑯1Rs(𝜽)α+O(((Rs(𝜽)2Rs(𝜽)22)α2).\displaystyle{\boldsymbol{\theta}}_{*}(\alpha)-{\boldsymbol{\theta}}_{*}=-{% \boldsymbol{H}}^{-1}\nabla R_{s}({\boldsymbol{\theta}}_{*})\alpha+O\big{(}(% \big{(}\|\nabla R_{s}({\boldsymbol{\theta}}_{*})\|_{2}\vee\|\nabla R_{s}({% \boldsymbol{\theta}}_{*})\|_{2}^{2}\big{)}\alpha^{2})\,.bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = - bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_α + italic_O ( ( ( ∥ ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∨ ∥ ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) . (28)

Substituting in Eq. (26), we get

𝜽^n,m(α)𝜽=subscript^𝜽𝑛𝑚𝛼subscript𝜽absent\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}=over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT = 𝑯1Rs(𝜽)α+1αn𝑯(α)1i=1n[(𝜽(α);𝒛i)𝔼(𝜽;𝒛)]superscript𝑯1subscript𝑅𝑠subscript𝜽𝛼1𝛼𝑛𝑯superscript𝛼1superscriptsubscript𝑖1𝑛delimited-[]subscript𝜽𝛼subscript𝒛𝑖𝔼𝜽𝒛\displaystyle-{\boldsymbol{H}}^{-1}\nabla R_{s}({\boldsymbol{\theta}}_{*})% \alpha+\frac{1-\alpha}{n}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{n}\big{[}% \nabla\ell({\boldsymbol{\theta}}_{*}(\alpha);{\boldsymbol{z}}_{i})-{\mathbb{E}% }\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}- bold_italic_H start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) italic_α + divide start_ARG 1 - italic_α end_ARG start_ARG italic_n end_ARG bold_italic_H ( italic_α ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT [ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ; bold_italic_z start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - blackboard_E ∇ roman_ℓ ( bold_italic_θ ; bold_italic_z ) ] (29)
+αm𝑯(α)1i=1m[(𝜽(α);𝒛ic)𝔼s(𝜽;𝒛)]+𝚫,𝛼𝑚𝑯superscript𝛼1superscriptsubscript𝑖1𝑚delimited-[]subscript𝜽𝛼subscriptsuperscript𝒛𝑐𝑖subscript𝔼𝑠𝜽𝒛𝚫\displaystyle\;\;+\frac{\alpha}{m}{\boldsymbol{H}}(\alpha)^{-1}\sum_{i=1}^{m}% \big{[}\nabla\ell({\boldsymbol{\theta}}_{*}(\alpha);{\boldsymbol{z}}^{c}_{i})-% {\mathbb{E}}_{s}\nabla\ell({\boldsymbol{\theta}};{\boldsymbol{z}})\big{]}+{% \boldsymbol{\Delta}}\,,+ divide start_ARG italic_α end_ARG start_ARG italic_m end_ARG bold_italic_H ( italic_α ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT [ ∇ roman_ℓ ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ; bold_italic_z start_POSTSUPERSCRIPT italic_c end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - blackboard_E start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∇ roman_ℓ ( bold_italic_θ ; bold_italic_z ) ] + bold_Δ , (30)
𝚫norm𝚫absent\displaystyle\|{\boldsymbol{\Delta}}\|\leq∥ bold_Δ ∥ ≤ C(Rs(𝜽)2Rs(𝜽)22)α2+Cmn.𝐶subscriptnormsubscript𝑅𝑠subscript𝜽2superscriptsubscriptnormsubscript𝑅𝑠subscript𝜽22superscript𝛼2𝐶𝑚𝑛\displaystyle C\Big{(}\|\nabla R_{s}({\boldsymbol{\theta}}_{*})\|_{2}\vee\|% \nabla R_{s}({\boldsymbol{\theta}}_{*})\|_{2}^{2}\Big{)}\alpha^{2}+\frac{C}{m% \wedge n}\,.italic_C ( ∥ ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∨ ∥ ∇ italic_R start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG italic_C end_ARG start_ARG italic_m ∧ italic_n end_ARG . (31)

The claim follows by substituting the above in

𝔼R(𝜽^n,m(α))R(𝜽)=𝔼𝜽^n,m(α)𝜽,𝑯(𝜽^n,m(α)𝜽)+O(𝔼𝜽^n,m(α)𝜽3)𝔼𝑅subscript^𝜽𝑛𝑚𝛼𝑅𝜽𝔼subscript^𝜽𝑛𝑚𝛼subscript𝜽𝑯subscript^𝜽𝑛𝑚𝛼subscript𝜽𝑂𝔼superscriptnormsubscript^𝜽𝑛𝑚𝛼subscript𝜽3\displaystyle{\mathbb{E}}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-R({% \boldsymbol{\theta}})={\mathbb{E}}\langle\hat{\boldsymbol{\theta}}_{n,m}(% \alpha)-{\boldsymbol{\theta}}_{*},{\boldsymbol{H}}(\hat{\boldsymbol{\theta}}_{% n,m}(\alpha)-{\boldsymbol{\theta}}_{*})\rangle+O\Big{(}{\mathbb{E}}\|\hat{% \boldsymbol{\theta}}_{n,m}(\alpha)-{\boldsymbol{\theta}}_{*}\|^{3}\big{)}blackboard_E italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R ( bold_italic_θ ) = blackboard_E ⟨ over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_H ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ + italic_O ( blackboard_E ∥ over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) (32)

and using 𝑯(α)=𝑯+O(α)𝑯𝛼𝑯𝑂𝛼{\boldsymbol{H}}(\alpha)={\boldsymbol{H}}+O(\alpha)bold_italic_H ( italic_α ) = bold_italic_H + italic_O ( italic_α ). ∎

Appendix C Gaussian sequence model: Proofs for Section 3.1

C.1 General ridge regression

We define 𝚺^=𝑿𝖳𝑿/n^𝚺superscript𝑿𝖳𝑿𝑛\hat{\boldsymbol{\Sigma}}={\boldsymbol{X}}^{{\sf T}}{\boldsymbol{X}}/nover^ start_ARG bold_Σ end_ARG = bold_italic_X start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_X / italic_n, 𝚺^s=𝑿s𝖳𝑿s/msubscript^𝚺𝑠subscriptsuperscript𝑿𝖳𝑠subscript𝑿𝑠𝑚\hat{\boldsymbol{\Sigma}}_{s}={\boldsymbol{X}}^{{\sf T}}_{s}{\boldsymbol{X}}_{% s}/mover^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = bold_italic_X start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_m, and 𝚺^α=(1α)𝚺^+α𝚺^ssubscript^𝚺𝛼1𝛼^𝚺𝛼subscript^𝚺𝑠\hat{\boldsymbol{\Sigma}}_{\alpha}=(1-\alpha)\hat{\boldsymbol{\Sigma}}+\alpha% \hat{\boldsymbol{\Sigma}}_{s}over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT = ( 1 - italic_α ) over^ start_ARG bold_Σ end_ARG + italic_α over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. We then have

Rn,m(α,λ)=subscript𝑅𝑛𝑚𝛼𝜆absent\displaystyle R_{n,m}(\alpha,\lambda)=italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) = Bn,m(α,λ)+(1α)2σ2nVn,m(α,λ)+α2σs2nVn,ms(α,λ),subscript𝐵𝑛𝑚𝛼𝜆superscript1𝛼2superscript𝜎2𝑛subscript𝑉𝑛𝑚𝛼𝜆superscript𝛼2superscriptsubscript𝜎𝑠2𝑛subscriptsuperscript𝑉𝑠𝑛𝑚𝛼𝜆\displaystyle B_{n,m}(\alpha,\lambda)+\frac{(1-\alpha)^{2}\sigma^{2}}{n}\cdot V% _{n,m}(\alpha,\lambda)+\frac{\alpha^{2}\sigma_{s}^{2}}{n}\cdot V^{s}_{n,m}(% \alpha,\lambda)\,,italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) + divide start_ARG ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) + divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG ⋅ italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) , (33)
Bn,m(α,λ)subscript𝐵𝑛𝑚𝛼𝜆\displaystyle B_{n,m}(\alpha,\lambda)italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) :=𝚺1/2(𝛀+𝚺^α)1(𝛀𝜽α𝚺^s(𝜽s𝜽))2,assignabsentsuperscriptnormsuperscript𝚺12superscript𝛀subscript^𝚺𝛼1𝛀subscript𝜽𝛼subscript^𝚺𝑠superscriptsubscript𝜽𝑠subscript𝜽2\displaystyle:=\Big{\|}{\boldsymbol{\Sigma}}^{1/2}({\boldsymbol{\Omega}}+\hat{% \boldsymbol{\Sigma}}_{\alpha})^{-1}\big{(}{\boldsymbol{\Omega}}{\boldsymbol{% \theta}}_{*}-\alpha\hat{\boldsymbol{\Sigma}}_{s}({\boldsymbol{\theta}}_{*}^{s}% -{\boldsymbol{\theta}}_{*})\big{)}\Big{\|}^{2}\,,:= ∥ bold_Σ start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ( bold_Ω + over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( bold_Ω bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - italic_α over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (34)
Vn,m(α,λ)subscript𝑉𝑛𝑚𝛼𝜆\displaystyle V_{n,m}(\alpha,\lambda)italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) :=Tr((𝛀+𝚺^α)1𝚺^(𝛀+𝚺^α)1𝚺),assignabsentTrsuperscript𝛀subscript^𝚺𝛼1^𝚺superscript𝛀subscript^𝚺𝛼1𝚺\displaystyle:=\mathrm{Tr}\Big{(}({\boldsymbol{\Omega}}+\hat{\boldsymbol{% \Sigma}}_{\alpha})^{-1}\hat{\boldsymbol{\Sigma}}({\boldsymbol{\Omega}}+\hat{% \boldsymbol{\Sigma}}_{\alpha})^{-1}{\boldsymbol{\Sigma}}\Big{)}\,,:= roman_Tr ( ( bold_Ω + over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG ( bold_Ω + over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ ) , (35)
Vn,ms(α,λ)subscriptsuperscript𝑉𝑠𝑛𝑚𝛼𝜆\displaystyle V^{s}_{n,m}(\alpha,\lambda)italic_V start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) :=Tr((𝛀+𝚺^α)1𝚺^s(𝛀+𝚺^α)1𝚺)assignabsentTrsuperscript𝛀subscript^𝚺𝛼1subscript^𝚺𝑠superscript𝛀subscript^𝚺𝛼1𝚺\displaystyle:=\mathrm{Tr}\Big{(}({\boldsymbol{\Omega}}+\hat{\boldsymbol{% \Sigma}}_{\alpha})^{-1}\hat{\boldsymbol{\Sigma}}_{s}({\boldsymbol{\Omega}}+% \hat{\boldsymbol{\Sigma}}_{\alpha})^{-1}{\boldsymbol{\Sigma}}\Big{)}:= roman_Tr ( ( bold_Ω + over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_Ω + over^ start_ARG bold_Σ end_ARG start_POSTSUBSCRIPT italic_α end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT bold_Σ ) (36)

C.2 Proof of Theorem 1

Without loss of generality, we can assume 𝛀=diag((ωk)k1)𝛀diagsubscriptsubscript𝜔𝑘𝑘1{\boldsymbol{\Omega}}={\rm diag}((\omega_{k})_{k\geq 1})bold_Ω = roman_diag ( ( italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT ) with ωksubscript𝜔𝑘\omega_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT non-decreasing. A simple calculation gives the following general expression for the test error:

Rn,m(α,λ)=subscript𝑅𝑛𝑚𝛼𝜆absent\displaystyle R_{n,m}(\alpha,\lambda)=italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) = Bn,m(α,λ)+sn,m(α)Vn,m(α,λ),subscript𝐵𝑛𝑚𝛼𝜆subscript𝑠𝑛𝑚𝛼subscript𝑉𝑛𝑚𝛼𝜆\displaystyle B_{n,m}(\alpha,\lambda)+s_{n,m}(\alpha)\cdot V_{n,m}(\alpha,% \lambda)\,,italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) + italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ⋅ italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) , (37)
Bn,m(α,λ)subscript𝐵𝑛𝑚𝛼𝜆\displaystyle B_{n,m}(\alpha,\lambda)italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) :=k=1(11+λωk)2[(α+λωk)θ,kαθ,ks]2,assignabsentsuperscriptsubscript𝑘1superscript11𝜆subscript𝜔𝑘2superscriptdelimited-[]𝛼𝜆subscript𝜔𝑘subscript𝜃𝑘𝛼superscriptsubscript𝜃𝑘𝑠2\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}\big{[}(\alpha+\lambda\omega_{k})\theta_{*,k}-\alpha\theta_{*,k}^{s}\big{]}% ^{2}\,,:= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ ( italic_α + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (38)
Vn,m(α,λ)subscript𝑉𝑛𝑚𝛼𝜆\displaystyle V_{n,m}(\alpha,\lambda)italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) :=k=1(11+λωk)2,assignabsentsuperscriptsubscript𝑘1superscript11𝜆subscript𝜔𝑘2\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}\,,:= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (39)
sn,m(α)subscript𝑠𝑛𝑚𝛼\displaystyle s_{n,m}(\alpha)italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) :=(1α)2σ2n+α2σs2m.assignabsentsuperscript1𝛼2superscript𝜎2𝑛superscript𝛼2subscriptsuperscript𝜎2𝑠𝑚\displaystyle:=(1-\alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma^{2}_{% s}}{m}\,.:= ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG italic_m end_ARG . (40)

We define (with k1=0subscript𝑘10k_{1}=0italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0 if the condition is never verified)

k1subscript𝑘1\displaystyle k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT :=max{k:λωk1}.assignabsent:𝑘𝜆subscript𝜔𝑘1\displaystyle:=\max\big{\{}k:\,\lambda\omega_{k}\leq 1\big{\}}\,.:= roman_max { italic_k : italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 } . (41)

Note that

0<kk10𝑘subscript𝑘1\displaystyle 0<k\leq k_{1}0 < italic_k ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT   0<λωk1,absent  0𝜆subscript𝜔𝑘1\displaystyle\;\;\Rightarrow\;\;0<\lambda\omega_{k}\leq 1\,,⇒ 0 < italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ 1 , (42)
k1<ksubscript𝑘1𝑘\displaystyle k_{1}<kitalic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT < italic_k   1<λωk.absent1𝜆subscript𝜔𝑘\displaystyle\;\;\Rightarrow\;\;1<\lambda\omega_{k}\,.⇒ 1 < italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT . (43)

We now estimate various sums by breaking them by the value of k𝑘kitalic_k

Bn,msubscript𝐵𝑛𝑚\displaystyle B_{n,m}italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT k=1k1[(α+λωk)θ,kαθ,ks]2+k=k1+11(λωk)2[(α+λωk)θ,kαθ,ks]2absentsuperscriptsubscript𝑘1subscript𝑘1superscriptdelimited-[]𝛼𝜆subscript𝜔𝑘subscript𝜃𝑘𝛼superscriptsubscript𝜃𝑘𝑠2superscriptsubscript𝑘subscript𝑘111superscript𝜆subscript𝜔𝑘2superscriptdelimited-[]𝛼𝜆subscript𝜔𝑘subscript𝜃𝑘𝛼superscriptsubscript𝜃𝑘𝑠2\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}(\alpha+\lambda\omega_{k})\theta_{*,% k}-\alpha\theta_{*,k}^{s}\big{]}^{2}+\sum_{k=k_{1}+1}^{\infty}\frac{1}{(% \lambda\omega_{k})^{2}}\big{[}(\alpha+\lambda\omega_{k})\theta_{*,k}-\alpha% \theta_{*,k}^{s}\big{]}^{2}≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ ( italic_α + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT italic_k = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG ( italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ ( italic_α + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_α italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
k=1k1[α2(θ,kθ,ks)2+2α(θ,kθ,ks)λωkθ,k+(λωk)2θ,k2]absentsuperscriptsubscript𝑘1subscript𝑘1delimited-[]superscript𝛼2superscriptsubscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠22𝛼subscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠𝜆subscript𝜔𝑘subscript𝜃𝑘superscript𝜆subscript𝜔𝑘2subscriptsuperscript𝜃2𝑘\displaystyle\leq\sum_{k=1}^{k_{1}}\big{[}\alpha^{2}(\theta_{*,k}-\theta_{*,k}% ^{s})^{2}+2\alpha(\theta_{*,k}-\theta_{*,k}^{s})\lambda\omega_{k}\theta_{*,k}+% (\lambda\omega_{k})^{2}\theta^{2}_{*,k}\big{]}≤ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT + ( italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT ]
+k=k1+1[α2(λωk)2(θ,kθ,ks)22αλωk(θ,kθ,ks)θ,k+θ,k2]superscriptsubscript𝑘subscript𝑘11delimited-[]superscript𝛼2superscript𝜆subscript𝜔𝑘2superscriptsubscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠22𝛼𝜆subscript𝜔𝑘subscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠subscript𝜃𝑘subscriptsuperscript𝜃2𝑘\displaystyle\phantom{AA}+\sum_{k=k_{1}+1}^{\infty}\Big{[}\frac{\alpha^{2}}{(% \lambda\omega_{k})^{2}}(\theta_{*,k}-\theta_{*,k}^{s})^{2}-\frac{2\alpha}{% \lambda\omega_{k}}(\theta_{*,k}-\theta_{*,k}^{s})\theta_{*,k}+\theta^{2}_{*,k}% \Big{]}+ ∑ start_POSTSUBSCRIPT italic_k = italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ divide start_ARG italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG ( italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG 2 italic_α end_ARG start_ARG italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT + italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT ]
α2𝜽,k1𝜽,k1s2+2αωk1|𝜽,k1𝜽,k1s,𝜽,k1𝛀|+1ωk12𝜽,k1𝛀22absentsuperscript𝛼2superscriptnormsubscript𝜽absentsubscript𝑘1subscriptsuperscript𝜽𝑠absentsubscript𝑘122𝛼subscript𝜔subscript𝑘1subscriptsubscript𝜽absentsubscript𝑘1subscriptsuperscript𝜽𝑠absentsubscript𝑘1subscript𝜽absentsubscript𝑘1𝛀1superscriptsubscript𝜔subscript𝑘12superscriptsubscriptnormsubscript𝜽absentsubscript𝑘1superscript𝛀22\displaystyle\leq\alpha^{2}\|{\boldsymbol{\theta}}_{*,\leq k_{1}}-{\boldsymbol% {\theta}}^{s}_{*,\leq k_{1}}\|^{2}+\frac{2\alpha}{\omega_{k_{1}}}|\langle{% \boldsymbol{\theta}}_{*,\leq k_{1}}-{\boldsymbol{\theta}}^{s}_{*,\leq k_{1}},{% \boldsymbol{\theta}}_{*,\leq k_{1}}\rangle_{{\boldsymbol{\Omega}}}|+\frac{1}{% \omega_{k_{1}}^{2}}\|{\boldsymbol{\theta}}_{*,\leq k_{1}}\|_{{\boldsymbol{% \Omega}}^{2}}^{2}≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 2 italic_α end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG | ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT | + divide start_ARG 1 end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
+α2ωk1+12𝜽,>k1𝜽,>k1s𝛀22+2αωk1+1|𝜽,>k1𝜽,>k1s,𝜽,>k1𝛀1|+𝜽,>k12,superscript𝛼2superscriptsubscript𝜔subscript𝑘112superscriptsubscriptnormsubscript𝜽absentsubscript𝑘1subscriptsuperscript𝜽𝑠absentsubscript𝑘1superscript𝛀222𝛼subscript𝜔subscript𝑘11subscriptsubscript𝜽absentsubscript𝑘1subscriptsuperscript𝜽𝑠absentsubscript𝑘1subscript𝜽absentsubscript𝑘1superscript𝛀1superscriptnormsubscript𝜽absentsubscript𝑘12\displaystyle\phantom{AA}+\alpha^{2}\omega_{k_{1}+1}^{2}\|{\boldsymbol{\theta}% }_{*,>k_{1}}-{\boldsymbol{\theta}}^{s}_{*,>k_{1}}\|_{{\boldsymbol{\Omega}}^{-2% }}^{2}+2\alpha\omega_{k_{1}+1}\big{|}\langle{\boldsymbol{\theta}}_{*,>k_{1}}-{% \boldsymbol{\theta}}^{s}_{*,>k_{1}},{\boldsymbol{\theta}}_{*,>k_{1}}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{|}+\|{\boldsymbol{\theta}}_{*,>k_{1}}\|^{2}\,,+ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_α italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT | ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | + ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ,

and

Vn,mk1+k>k1ωk1+12ωk2(k1+c#),subscript𝑉𝑛𝑚subscript𝑘1subscript𝑘subscript𝑘1superscriptsubscript𝜔subscript𝑘112superscriptsubscript𝜔𝑘2subscript𝑘1subscript𝑐#\displaystyle V_{n,m}\leq k_{1}+\sum_{k>k_{1}}\frac{\omega_{k_{1}+1}^{2}}{% \omega_{k}^{2}}\leq(k_{1}+c_{\#})\,,italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_k > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ,

since under the assumption ωkkμasymptotically-equalssubscript𝜔𝑘superscript𝑘𝜇\omega_{k}\asymp k^{\mu}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≍ italic_k start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, μ>1/2𝜇12\mu>1/2italic_μ > 1 / 2, we have k>k1(ωk1+1/ωk)2c#subscript𝑘subscript𝑘1superscriptsubscript𝜔subscript𝑘11subscript𝜔𝑘2subscript𝑐#\sum_{k>k_{1}}(\omega_{k_{1}+1}/\omega_{k})^{2}\leq c_{\#}∑ start_POSTSUBSCRIPT italic_k > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT / italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT.

Recalling the definitions in the theorem, and letting

δksubscript𝛿𝑘\displaystyle\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT :=max(ωk+1|𝜽,>k𝜽,>ks,𝜽,>k𝛀1|;ωk+12𝜽,>k𝜽,>ks𝛀22),assignabsentsubscript𝜔𝑘1subscriptsubscript𝜽absent𝑘subscriptsuperscript𝜽𝑠absent𝑘subscript𝜽absent𝑘superscript𝛀1superscriptsubscript𝜔𝑘12superscriptsubscriptnormsubscript𝜽absent𝑘subscriptsuperscript𝜽𝑠absent𝑘superscript𝛀22\displaystyle:=\max\Big{(}\omega_{k+1}\big{|}\langle{\boldsymbol{\theta}}_{*,>% k}-{\boldsymbol{\theta}}^{s}_{*,>k},{\boldsymbol{\theta}}_{*,>k}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{|};\;\omega_{k+1}^{2}\|{\boldsymbol{\theta}}_{*% ,>k}-{\boldsymbol{\theta}}^{s}_{*,>k}\|_{{\boldsymbol{\Omega}}^{-2}}^{2}\Big{)% }\,,:= roman_max ( italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ; italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

we have

Bn,mα2𝜽𝜽s2+𝜽,>k12+1ωk12𝜽,k𝛀22+3δk1+2Δk1,subscript𝐵𝑛𝑚superscript𝛼2superscriptnormsubscript𝜽subscriptsuperscript𝜽𝑠2superscriptnormsubscript𝜽absentsubscript𝑘121superscriptsubscript𝜔subscript𝑘12superscriptsubscriptnormsubscript𝜽absent𝑘superscript𝛀223subscript𝛿subscript𝑘12subscriptΔsubscript𝑘1\displaystyle B_{n,m}\leq\alpha^{2}\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{% \theta}}^{s}_{*}\|^{2}+\|{\boldsymbol{\theta}}_{*,>k_{1}}\|^{2}+\frac{1}{% \omega_{k_{1}}^{2}}\|{\boldsymbol{\theta}}_{*,\leq k}\|_{{\boldsymbol{\Omega}}% ^{2}}^{2}+3\delta_{k_{1}}+2\Delta_{k_{1}}\,,italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 3 italic_δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ,

whence

Rn,m(α,λ)subscript𝑅𝑛𝑚𝛼𝜆\displaystyle R_{n,m}(\alpha,\lambda)italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) α2𝜽𝜽s2+𝜽,>k12+1ωk12𝜽,k1𝛀22+(k1+c#)sn,m(α)+3δk1+2Δk1absentsuperscript𝛼2superscriptnormsubscript𝜽subscriptsuperscript𝜽𝑠2superscriptnormsubscript𝜽absentsubscript𝑘121superscriptsubscript𝜔subscript𝑘12superscriptsubscriptnormsubscript𝜽absentsubscript𝑘1superscript𝛀22subscript𝑘1subscript𝑐#subscript𝑠𝑛𝑚𝛼3subscript𝛿subscript𝑘12subscriptΔsubscript𝑘1\displaystyle\leq\alpha^{2}\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{\theta}}^{% s}_{*}\|^{2}+\|{\boldsymbol{\theta}}_{*,>k_{1}}\|^{2}+\frac{1}{\omega_{k_{1}}^% {2}}\|{\boldsymbol{\theta}}_{*,\leq k_{1}}\|_{{\boldsymbol{\Omega}}^{2}}^{2}+(% k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}}+2\Delta_{k_{1}}≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) + 3 italic_δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=α2Rsuex()+𝜽,>k12+1ωk12𝜽,k1𝛀22+(k1+c#)sn,m(α)+3δk1+2Δk1.absentsuperscript𝛼2subscriptsuperscript𝑅exsusuperscriptnormsubscript𝜽absentsubscript𝑘121superscriptsubscript𝜔subscript𝑘12superscriptsubscriptnormsubscript𝜽absentsubscript𝑘1superscript𝛀22subscript𝑘1subscript𝑐#subscript𝑠𝑛𝑚𝛼3subscript𝛿subscript𝑘12subscriptΔsubscript𝑘1\displaystyle=\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)+\|% {\boldsymbol{\theta}}_{*,>k_{1}}\|^{2}+\frac{1}{\omega_{k_{1}}^{2}}\|{% \boldsymbol{\theta}}_{*,\leq k_{1}}\|_{{\boldsymbol{\Omega}}^{2}}^{2}+(k_{1}+c% _{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}}+2\Delta_{k_{1}}\,.= italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 end_ARG start_ARG italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) + 3 italic_δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Next we specialize to the case 𝜽,>k2Cθk2ρsuperscriptnormsubscript𝜽absent𝑘2subscript𝐶𝜃superscript𝑘2𝜌\|{\boldsymbol{\theta}}_{*,>k}\|^{2}\leq C_{\theta}k^{-2\rho}∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - 2 italic_ρ end_POSTSUPERSCRIPT, ωkkμasymptotically-equalssubscript𝜔𝑘superscript𝑘𝜇\omega_{k}\asymp k^{\mu}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≍ italic_k start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT μρ𝜇𝜌\mu\neq\rhoitalic_μ ≠ italic_ρ. In this case we have ωk2𝜽,k𝛀22Ck2(μρ)superscriptsubscript𝜔𝑘2superscriptsubscriptnormsubscript𝜽absent𝑘superscript𝛀22𝐶superscript𝑘2𝜇𝜌\omega_{k}^{-2}\|{\boldsymbol{\theta}}_{*,\leq k}\|_{{\boldsymbol{\Omega}}^{2}% }^{2}\leq Ck^{-2(\mu\wedge\rho)}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ italic_C italic_k start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT, and therefore, by suitably adjusting the constant C𝐶Citalic_C

Rn,m(α,λ)subscript𝑅𝑛𝑚𝛼𝜆\displaystyle R_{n,m}(\alpha,\lambda)italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) α2Rsuex()+Ck12(μρ)+(k1+c#)sn,m(α)+3δk1+2Δk1.absentsuperscript𝛼2subscriptsuperscript𝑅exsu𝐶superscriptsubscript𝑘12𝜇𝜌subscript𝑘1subscript𝑐#subscript𝑠𝑛𝑚𝛼3subscript𝛿subscript𝑘12subscriptΔsubscript𝑘1\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+3\delta_{k_{1}% }+2\Delta_{k_{1}}\,.≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) + 3 italic_δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

We now bound δksubscript𝛿𝑘\delta_{k}italic_δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. By Cauchy-Schwarz and monotonicity of ω𝜔\omegaitalic_ω,

ωk+1|𝜽,>k𝜽,>ks,𝜽,>k𝛀1|𝜽,>k𝜽,>ks2𝜽,>k22Cθk2ρ,subscript𝜔𝑘1subscriptsubscript𝜽absent𝑘subscriptsuperscript𝜽𝑠absent𝑘subscript𝜽absent𝑘superscript𝛀1subscriptnormsubscript𝜽absent𝑘subscriptsuperscript𝜽𝑠absent𝑘2subscriptnormsubscript𝜽absent𝑘22subscript𝐶𝜃superscript𝑘2𝜌\displaystyle\omega_{k+1}\big{|}\langle{\boldsymbol{\theta}}_{*,>k}-{% \boldsymbol{\theta}}^{s}_{*,>k},{\boldsymbol{\theta}}_{*,>k}\rangle_{{% \boldsymbol{\Omega}}^{-1}}\big{|}\leq\|{\boldsymbol{\theta}}_{*,>k}-{% \boldsymbol{\theta}}^{s}_{*,>k}\|_{2}\|{\boldsymbol{\theta}}_{*,>k}\|_{2}\leq 2% C_{\theta}k^{-2\rho}\,,italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT | ⟨ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ⟩ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT | ≤ ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ 2 italic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - 2 italic_ρ end_POSTSUPERSCRIPT ,

and further

ωk+12𝜽,>k𝜽,>ks𝛀222𝜽,>k2+2𝜽,>ks24Cθk2ρ.superscriptsubscript𝜔𝑘12superscriptsubscriptnormsubscript𝜽absent𝑘subscriptsuperscript𝜽𝑠absent𝑘superscript𝛀222superscriptnormsubscript𝜽absent𝑘22superscriptnormsubscriptsuperscript𝜽𝑠absent𝑘24subscript𝐶𝜃superscript𝑘2𝜌\displaystyle\omega_{k+1}^{2}\|{\boldsymbol{\theta}}_{*,>k}-{\boldsymbol{% \theta}}^{s}_{*,>k}\|_{{\boldsymbol{\Omega}}^{-2}}^{2}\leq 2\|{\boldsymbol{% \theta}}_{*,>k}\|^{2}+2\|{\boldsymbol{\theta}}^{s}_{*,>k}\|^{2}\leq 4C_{\theta% }k^{-2\rho}\,.italic_ω start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 ∥ bold_italic_θ start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , > italic_k end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 4 italic_C start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - 2 italic_ρ end_POSTSUPERSCRIPT . (44)

Therefore,

Rn,m(α,λ)subscript𝑅𝑛𝑚𝛼𝜆\displaystyle R_{n,m}(\alpha,\lambda)italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) α2Rsuex()+Ck12(μρ)+(k1+c#)sn,m(α)+2Δk1.absentsuperscript𝛼2subscriptsuperscript𝑅exsu𝐶superscriptsubscript𝑘12𝜇𝜌subscript𝑘1subscript𝑐#subscript𝑠𝑛𝑚𝛼2subscriptΔsubscript𝑘1\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)+2\Delta_{k_{1}% }\,.≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) + 2 roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT .

Proof of claim (a)𝑎(a)( italic_a ). The stated assumption on ΔksubscriptΔ𝑘\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT imply that (eventually adjusting the constant C𝐶Citalic_C):

Rn,m(α,λ)subscript𝑅𝑛𝑚𝛼𝜆\displaystyle R_{n,m}(\alpha,\lambda)italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) α2Rsuex()+Ck12(μρ)+(k1+c#)sn,m(α).absentsuperscript𝛼2subscriptsuperscript𝑅exsu𝐶superscriptsubscript𝑘12𝜇𝜌subscript𝑘1subscript𝑐#subscript𝑠𝑛𝑚𝛼\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+(k_{1}+c_{\#})\cdot s_{n,m}(\alpha)\,.≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT + ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT # end_POSTSUBSCRIPT ) ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) .

We now select λ(α)subscript𝜆𝛼\lambda_{*}(\alpha)italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) so that k1sn,m(α)1+βasymptotically-equalssubscript𝑘1subscript𝑠𝑛𝑚superscript𝛼1𝛽k_{1}\asymp s_{n,m}(\alpha)^{-1+\beta}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≍ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) start_POSTSUPERSCRIPT - 1 + italic_β end_POSTSUPERSCRIPT where β=2(μρ)/(1+2(μρ))𝛽2𝜇𝜌12𝜇𝜌\beta=2(\mu\wedge\rho)/(1+2(\mu\wedge\rho))italic_β = 2 ( italic_μ ∧ italic_ρ ) / ( 1 + 2 ( italic_μ ∧ italic_ρ ) ). (this is possible for all n,m𝑛𝑚n,mitalic_n , italic_m large enough under the assumption on ωksubscript𝜔𝑘\omega_{k}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT), to A straightforward calculation yields:

Rn,m(α,λ(α))subscript𝑅𝑛𝑚𝛼subscript𝜆𝛼\displaystyle R_{n,m}(\alpha,\lambda_{*}(\alpha))italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( italic_α ) ) α2Rsuex()+Csn,m(α)β,absentsuperscript𝛼2subscriptsuperscript𝑅exsu𝐶subscript𝑠𝑛𝑚superscript𝛼𝛽\displaystyle\leq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +C\cdot s_{n,m}(\alpha)^{\beta}\,,≤ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) start_POSTSUPERSCRIPT italic_β end_POSTSUPERSCRIPT ,

which proves claim (a)𝑎(a)( italic_a ).

Proof of Claim (b)𝑏(b)( italic_b ). We choose ωk=kμsubscript𝜔𝑘superscript𝑘𝜇\omega_{k}=k^{\mu}italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT, θ,k=kρ1/2subscript𝜃𝑘superscript𝑘superscript𝜌12\theta_{*,k}=k^{-\rho^{\prime}-1/2}italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT = italic_k start_POSTSUPERSCRIPT - italic_ρ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - 1 / 2 end_POSTSUPERSCRIPT, θ,ks=θ,k+akkρ1/2superscriptsubscript𝜃𝑘𝑠subscript𝜃𝑘subscript𝑎𝑘superscript𝑘𝜌12\theta_{*,k}^{s}=\theta_{*,k}+a_{k}k^{-\rho-1/2}italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT = italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_k start_POSTSUPERSCRIPT - italic_ρ - 1 / 2 end_POSTSUPERSCRIPT, with ak𝖴𝗇𝗂𝖿({A,+A})similar-tosubscript𝑎𝑘𝖴𝗇𝗂𝖿𝐴𝐴a_{k}\sim{\sf Unif}(\{-A,+A\})italic_a start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ∼ sansserif_Unif ( { - italic_A , + italic_A } ) . We will choose A1𝐴1A\leq 1italic_A ≤ 1 a sufficiently small numerical constant. Note that, for μ>2ρ+1/2𝜇2𝜌12\mu>2\rho+1/2italic_μ > 2 italic_ρ + 1 / 2

ΔksubscriptΔ𝑘\displaystyle\Delta_{k}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT =kμ|=1kaμ2ρ1|CAkμ+ε|=1k2μ4ρ2|1/2CAk2ρ1/2+ε,absentsuperscript𝑘𝜇superscriptsubscript1𝑘subscript𝑎superscript𝜇2𝜌1𝐶𝐴superscript𝑘𝜇𝜀superscriptsuperscriptsubscript1𝑘superscript2𝜇4𝜌212𝐶𝐴superscript𝑘2𝜌12superscript𝜀\displaystyle=k^{-\mu}\left|\sum_{\ell=1}^{k}a_{\ell}\ell^{\mu-2\rho-1}\right|% \leq CAk^{-\mu+{\varepsilon}}\left|\sum_{\ell=1}^{k}\ell^{2\mu-4\rho-2}\right|% ^{1/2}\leq CAk^{-2\rho-1/2+{\varepsilon}^{\prime}}\,,= italic_k start_POSTSUPERSCRIPT - italic_μ end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT roman_ℓ start_POSTSUPERSCRIPT italic_μ - 2 italic_ρ - 1 end_POSTSUPERSCRIPT | ≤ italic_C italic_A italic_k start_POSTSUPERSCRIPT - italic_μ + italic_ε end_POSTSUPERSCRIPT | ∑ start_POSTSUBSCRIPT roman_ℓ = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT roman_ℓ start_POSTSUPERSCRIPT 2 italic_μ - 4 italic_ρ - 2 end_POSTSUPERSCRIPT | start_POSTSUPERSCRIPT 1 / 2 end_POSTSUPERSCRIPT ≤ italic_C italic_A italic_k start_POSTSUPERSCRIPT - 2 italic_ρ - 1 / 2 + italic_ε start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ,

where, for any ε>0𝜀0{\varepsilon}>0italic_ε > 0, the first inequality holds with probability at least 1/2121/21 / 2 for all k>k0(ε)𝑘subscript𝑘0𝜀k>k_{0}({\varepsilon})italic_k > italic_k start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( italic_ε ). We can therefore select the asubscript𝑎a_{\ell}italic_a start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT, so that ΔkC′′Ak2ρεsubscriptΔ𝑘superscript𝐶′′𝐴superscript𝑘2𝜌𝜀\Delta_{k}\leq C^{\prime\prime}Ak^{-2\rho-{\varepsilon}}roman_Δ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ≤ italic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT italic_A italic_k start_POSTSUPERSCRIPT - 2 italic_ρ - italic_ε end_POSTSUPERSCRIPT for some C′′<superscript𝐶′′C^{\prime\prime}<\inftyitalic_C start_POSTSUPERSCRIPT ′ ′ end_POSTSUPERSCRIPT < ∞.

Following the calculation at point (a)𝑎(a)( italic_a ) decompose the bias term as

Bn,msubscript𝐵𝑛𝑚\displaystyle B_{n,m}italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT =k=1(11+λωk)2[α2(θ,kθ,ks)2+(λωk)2θ,k2]+2αEn,m,absentsuperscriptsubscript𝑘1superscript11𝜆subscript𝜔𝑘2delimited-[]superscript𝛼2superscriptsubscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠2superscript𝜆subscript𝜔𝑘2subscriptsuperscript𝜃2𝑘2𝛼subscript𝐸𝑛𝑚\displaystyle=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^{% 2}\big{[}\alpha^{2}(\theta_{*,k}-\theta_{*,k}^{s})^{2}+(\lambda\omega_{k})^{2}% \theta^{2}_{*,k}\big{]}+2\alpha E_{n,m}\,,= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT [ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT ] + 2 italic_α italic_E start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ,
En,msubscript𝐸𝑛𝑚\displaystyle E_{n,m}italic_E start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT :=k=1(11+λωk)2(θ,kθ,ks)λωkθ,k.assignabsentsuperscriptsubscript𝑘1superscript11𝜆subscript𝜔𝑘2subscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠𝜆subscript𝜔𝑘subscript𝜃𝑘\displaystyle:=\sum_{k=1}^{\infty}\Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^% {2}(\theta_{*,k}-\theta_{*,k}^{s})\lambda\omega_{k}\theta_{*,k}\,.:= ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT .

Note that |En,m|δk1+Δk1CAk12(μρ)subscript𝐸𝑛𝑚subscript𝛿subscript𝑘1subscriptΔsubscript𝑘1𝐶𝐴superscriptsubscript𝑘12𝜇𝜌|E_{n,m}|\leq\delta_{k_{1}}+\Delta_{k_{1}}\leq CAk_{1}^{-2(\mu\wedge\rho)}| italic_E start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT | ≤ italic_δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_Δ start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≤ italic_C italic_A italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT. Therefore

Bn,mα2𝜽𝜽s2subscript𝐵𝑛𝑚superscript𝛼2superscriptnormsubscript𝜽superscriptsubscript𝜽𝑠2\displaystyle B_{n,m}-\alpha^{2}\|{\boldsymbol{\theta}}_{*}-{\boldsymbol{% \theta}}_{*}^{s}\|^{2}italic_B start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT - italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
k=1(λωk1+λωk)2θ,k2α2k=1[1(11+λωk)2](θ,kθ,ks)2CAk12(μρ)absentsuperscriptsubscript𝑘1superscript𝜆subscript𝜔𝑘1𝜆subscript𝜔𝑘2subscriptsuperscript𝜃2𝑘superscript𝛼2superscriptsubscript𝑘1delimited-[]1superscript11𝜆subscript𝜔𝑘2superscriptsubscript𝜃𝑘superscriptsubscript𝜃𝑘𝑠2𝐶𝐴superscriptsubscript𝑘12𝜇𝜌\displaystyle\geq\sum_{k=1}^{\infty}\Big{(}\frac{\lambda\omega_{k}}{1+\lambda% \omega_{k}}\Big{)}^{2}\theta^{2}_{*,k}-\alpha^{2}\sum_{k=1}^{\infty}\left[1-% \Big{(}\frac{1}{1+\lambda\omega_{k}}\Big{)}^{2}\right](\theta_{*,k}-\theta_{*,% k}^{s})^{2}-CAk_{1}^{-2(\mu\wedge\rho)}≥ ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT ( divide start_ARG italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_θ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT [ 1 - ( divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_ω start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ( italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ , italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C italic_A italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT
14ωk1+12𝜽,k1𝛀22+14𝜽,>k12A4ωk1+1𝜽,k1𝛀2A4𝜽,>k12CAk12(μρ)absent14subscriptsuperscript𝜔2subscript𝑘11subscriptsuperscriptnormsubscript𝜽absentsubscript𝑘12superscript𝛀214superscriptnormsubscript𝜽absentsubscript𝑘12𝐴4subscript𝜔subscript𝑘11subscriptsuperscriptnormsubscript𝜽absentsubscript𝑘12𝛀𝐴4superscriptnormsubscript𝜽absentsubscript𝑘12𝐶𝐴superscriptsubscript𝑘12𝜇𝜌\displaystyle\geq\frac{1}{4\omega^{2}_{k_{1}+1}}\|{\boldsymbol{\theta}}_{*,% \leq k_{1}}\|^{2}_{{\boldsymbol{\Omega}}^{2}}+\frac{1}{4}\|{\boldsymbol{\theta% }}_{*,>k_{1}}\|^{2}-\frac{A}{4\omega_{k_{1}+1}}\|{\boldsymbol{\theta}}_{*,\leq k% _{1}}\|^{2}_{{\boldsymbol{\Omega}}}-\frac{A}{4}\|{\boldsymbol{\theta}}_{*,>k_{% 1}}\|^{2}-CAk_{1}^{-2(\mu\wedge\rho)}≥ divide start_ARG 1 end_ARG start_ARG 4 italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_Ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + divide start_ARG 1 end_ARG start_ARG 4 end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - divide start_ARG italic_A end_ARG start_ARG 4 italic_ω start_POSTSUBSCRIPT italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 end_POSTSUBSCRIPT end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , ≤ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_Ω end_POSTSUBSCRIPT - divide start_ARG italic_A end_ARG start_ARG 4 end_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , > italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - italic_C italic_A italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT
Ck12(μρ).absent𝐶superscriptsubscript𝑘12𝜇𝜌\displaystyle\geq C\,k_{1}^{-2(\mu\wedge\rho)}\,.≥ italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT .

By a similar calculation, we also obtain

Vn,mCk1,subscript𝑉𝑛𝑚𝐶subscript𝑘1\displaystyle V_{n,m}\geq C\,k_{1}\,,italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ≥ italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ,

and therefore

Rn,m(α,λ)subscript𝑅𝑛𝑚𝛼𝜆\displaystyle R_{n,m}(\alpha,\lambda)italic_R start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α , italic_λ ) α2Rsuex()+Ck12(μρ)+Ck1sn,m(α).absentsuperscript𝛼2subscriptsuperscript𝑅exsu𝐶superscriptsubscript𝑘12𝜇𝜌𝐶subscript𝑘1subscript𝑠𝑛𝑚𝛼\displaystyle\geq\alpha^{2}R^{\mbox{\sf\tiny ex}}_{\mbox{\sf\tiny su}}(\infty)% +Ck_{1}^{-2(\mu\wedge\rho)}+Ck_{1}\cdot s_{n,m}(\alpha)\,.≥ italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_R start_POSTSUPERSCRIPT ex end_POSTSUPERSCRIPT start_POSTSUBSCRIPT su end_POSTSUBSCRIPT ( ∞ ) + italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 ( italic_μ ∧ italic_ρ ) end_POSTSUPERSCRIPT + italic_C italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_s start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) .

The proof is completed by minimizing over k1subscript𝑘1k_{1}italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Appendix D Analysis of the nonparametric model: Proofs for Section 3.2

This appendix is devoted to proving Theorem 2. Recall that this is established within the white noise model of Eq. (14), which we copy here for the readers’ convenience

dY=f(𝒙)d𝒙+σndB(𝒙),d𝑌subscript𝑓𝒙d𝒙𝜎𝑛d𝐵𝒙\displaystyle{\rm d}Y=f_{*}({\boldsymbol{x}})\,{\rm d}{\boldsymbol{x}}+\frac{% \sigma}{\sqrt{n}}{\rm d}B({\boldsymbol{x}})\,,roman_d italic_Y = italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_x ) roman_d bold_italic_x + divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG roman_d italic_B ( bold_italic_x ) , (45)

The adaptation of the estimator (13) to this continuous setting is given explicitly below

f^n,m,αsubscript^𝑓𝑛𝑚𝛼\displaystyle\hat{f}_{n,m,\alpha}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_n , italic_m , italic_α end_POSTSUBSCRIPT =argminf{(1α)Yf22+αYsf22+λfp,22}.absentsubscript𝑓1𝛼superscriptsubscriptnorm𝑌𝑓22𝛼superscriptsubscriptnormsubscript𝑌𝑠𝑓22𝜆superscriptsubscriptnorm𝑓𝑝22\displaystyle=\arg\min_{f}\Big{\{}(1-\alpha)\|Y-f\big{\|}_{2}^{2}+\alpha\|Y_{s% }-f\big{\|}_{2}^{2}+\lambda\|f\|_{p,2}^{2}\Big{\}}\,.= roman_arg roman_min start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT { ( 1 - italic_α ) ∥ italic_Y - italic_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ italic_Y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_f ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ italic_f ∥ start_POSTSUBSCRIPT italic_p , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (46)

The proof of Theorem 2 is based on a reduction to a suitable ‘sequence model’ via the Fourier transform, defined as

θ(𝒒):=[0,1]df(𝒙)eι𝒒,𝒙d𝒙,assign𝜃𝒒subscriptsuperscript01𝑑𝑓𝒙superscript𝑒𝜄𝒒𝒙differential-d𝒙\displaystyle\theta({\boldsymbol{q}}):=\int_{[0,1]^{d}}f({\boldsymbol{x}})\,e^% {-\iota\langle{\boldsymbol{q}},{\boldsymbol{x}}\rangle}\,{\rm d}{\boldsymbol{x% }}\,,italic_θ ( bold_italic_q ) := ∫ start_POSTSUBSCRIPT [ 0 , 1 ] start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_f ( bold_italic_x ) italic_e start_POSTSUPERSCRIPT - italic_ι ⟨ bold_italic_q , bold_italic_x ⟩ end_POSTSUPERSCRIPT roman_d bold_italic_x , (47)

for 𝒒𝒬d:={2π𝒒:𝒒d}𝒒subscript𝒬𝑑assignconditional-set2𝜋𝒒𝒒superscript𝑑{\boldsymbol{q}}\in{\mathcal{Q}}_{d}:=\{2\pi{\boldsymbol{q}}\;:\;{\boldsymbol{% q}}\in{\mathbb{Z}}^{d}\}bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT := { 2 italic_π bold_italic_q : bold_italic_q ∈ blackboard_Z start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT }, where ι=1𝜄1\iota=\sqrt{-1}italic_ι = square-root start_ARG - 1 end_ARG. The inverse Fourier transform is defined as

f(𝒙)=1(2π)d𝒒𝒬dθ(𝒒)eι𝒒,𝒙.𝑓𝒙1superscript2𝜋𝑑subscript𝒒subscript𝒬𝑑𝜃𝒒superscript𝑒𝜄𝒒𝒙\displaystyle f({\boldsymbol{x}})=\frac{1}{(2\pi)^{d}}\sum_{{\boldsymbol{q}}% \in{\mathcal{Q}}_{d}}\theta({\boldsymbol{q}})\,e^{\iota\langle{\boldsymbol{q}}% ,{\boldsymbol{x}}\rangle}\,.italic_f ( bold_italic_x ) = divide start_ARG 1 end_ARG start_ARG ( 2 italic_π ) start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_ARG ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ ( bold_italic_q ) italic_e start_POSTSUPERSCRIPT italic_ι ⟨ bold_italic_q , bold_italic_x ⟩ end_POSTSUPERSCRIPT . (48)

We let θsubscript𝜃\theta_{*}italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, θ,ssubscript𝜃𝑠\theta_{*,s}italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT, and θ^λ,p,n,m,αsubscript^𝜃𝜆𝑝𝑛𝑚𝛼\hat{\theta}_{\lambda,p,n,m,\alpha}over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_λ , italic_p , italic_n , italic_m , italic_α end_POSTSUBSCRIPT respectively denote the Fourier transform of fsubscript𝑓f_{*}italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT, f,ssubscript𝑓𝑠f_{*,s}italic_f start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT, and f^λ,p,n,m,αsubscript^𝑓𝜆𝑝𝑛𝑚𝛼\hat{f}_{\lambda,p,n,m,\alpha}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_λ , italic_p , italic_n , italic_m , italic_α end_POSTSUBSCRIPT.

The Fourier transforms of the observations are given by

Y^(𝒒)=θ(𝒒)+σnG(𝒒),Y^s(𝒒)=θ,s(𝒒)+σsmGs(𝒒),formulae-sequence^𝑌𝒒subscript𝜃𝒒𝜎𝑛𝐺𝒒subscript^𝑌𝑠𝒒subscript𝜃𝑠𝒒subscript𝜎𝑠𝑚subscript𝐺𝑠𝒒\displaystyle\hat{Y}({\boldsymbol{q}})=\theta_{*}({\boldsymbol{q}})+\frac{% \sigma}{\sqrt{n}}\,G({\boldsymbol{q}})\,,\;\;\;\;\;\;\hat{Y}_{s}({\boldsymbol{% q}})=\theta_{*,s}({\boldsymbol{q}})+\frac{\sigma_{s}}{\sqrt{m}}\,G_{s}({% \boldsymbol{q}})\,,over^ start_ARG italic_Y end_ARG ( bold_italic_q ) = italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) + divide start_ARG italic_σ end_ARG start_ARG square-root start_ARG italic_n end_ARG end_ARG italic_G ( bold_italic_q ) , over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) = italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ( bold_italic_q ) + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_m end_ARG end_ARG italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) , (49)

where G(𝒒)𝐺𝒒G({\boldsymbol{q}})italic_G ( bold_italic_q ) and Gs(𝒒)subscript𝐺𝑠𝒒G_{s}({\boldsymbol{q}})italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) are i.i.d. standard Gaussian. It then follows that

𝜽^n,m(α)=argmin𝜽{(1α)𝒀^𝜽22+α𝒀^s𝜽22+λ𝜽p,22}.subscript^𝜽𝑛𝑚𝛼subscript𝜽1𝛼superscriptsubscriptnorm^𝒀𝜽22𝛼superscriptsubscriptnormsubscript^𝒀𝑠𝜽22𝜆superscriptsubscriptnorm𝜽𝑝22\displaystyle\hat{\boldsymbol{\theta}}_{n,m}(\alpha)=\arg\min_{\boldsymbol{% \theta}}\left\{(1-\alpha)\|\hat{\boldsymbol{Y}}-{\boldsymbol{\theta}}\big{\|}_% {2}^{2}+\alpha\|\hat{\boldsymbol{Y}}_{s}-{\boldsymbol{\theta}}\big{\|}_{2}^{2}% +\lambda\|{\boldsymbol{\theta}}\|_{p,2}^{2}\right\}\,.over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) = roman_arg roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT { ( 1 - italic_α ) ∥ over^ start_ARG bold_italic_Y end_ARG - bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_α ∥ over^ start_ARG bold_italic_Y end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - bold_italic_θ ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_λ ∥ bold_italic_θ ∥ start_POSTSUBSCRIPT italic_p , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } . (50)

where we abuse the notation to define

𝜽p,22:=𝒒𝒬dcp,𝒒|θ(𝒒)|2.assignsuperscriptsubscriptnorm𝜽𝑝22subscript𝒒subscript𝒬𝑑subscript𝑐𝑝𝒒superscript𝜃𝒒2\displaystyle\|{\boldsymbol{\theta}}\|_{p,2}^{2}:=\sum_{{\boldsymbol{q}}\in{% \mathcal{Q}}_{d}}c_{p,{\boldsymbol{q}}}\,|\theta({\boldsymbol{q}})|^{2}\,.∥ bold_italic_θ ∥ start_POSTSUBSCRIPT italic_p , 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT := ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT | italic_θ ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (51)

with cp,𝒒:=1+𝒒2rassignsubscript𝑐𝑝𝒒1superscriptnorm𝒒2𝑟c_{p,{\boldsymbol{q}}}:=1+\|{\boldsymbol{q}}\|^{2r}italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT := 1 + ∥ bold_italic_q ∥ start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT. Minimizing (50) we get

θ^n,m(𝒒;α)=11+λcp,𝒒[(1α)Y^(𝒒)+αY^s(𝒒)].subscript^𝜃𝑛𝑚𝒒𝛼11𝜆subscript𝑐𝑝𝒒delimited-[]1𝛼^𝑌𝒒𝛼subscript^𝑌𝑠𝒒\displaystyle\hat{\theta}_{n,m}({\boldsymbol{q}};\alpha)=\frac{1}{1+\lambda c_% {p,{\boldsymbol{q}}}}\big{[}(1-\alpha)\,\hat{Y}({\boldsymbol{q}})+\alpha\,\hat% {Y}_{s}({\boldsymbol{q}})\big{]}\,.over^ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_q ; italic_α ) = divide start_ARG 1 end_ARG start_ARG 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT end_ARG [ ( 1 - italic_α ) over^ start_ARG italic_Y end_ARG ( bold_italic_q ) + italic_α over^ start_ARG italic_Y end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) ] . (52)

Taking the inverse Fourier transform and plugging it into the excess risk formula we get

R(f^n,m,α)𝑅subscript^𝑓𝑛𝑚𝛼\displaystyle R(\hat{f}_{n,m,\alpha})italic_R ( over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_n , italic_m , italic_α end_POSTSUBSCRIPT ) =𝒒𝒬d1(1+λcp,𝒒)2[α(θ,sθ)(𝒒)\displaystyle=\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c% _{p,{\boldsymbol{q}}})^{2}}\big{[}\alpha(\theta_{*,s}-\theta_{*})({\boldsymbol% {q}})= ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG [ italic_α ( italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT - italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ( bold_italic_q )
+λcp,𝒒θ(𝒒)]2+Vn,m𝒒𝒬d1(1+λcp,𝒒)2,\displaystyle+\lambda c_{p,{\boldsymbol{q}}}\theta_{*}({\boldsymbol{q}})\big{]% }^{2}+V_{n,m}\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {p,{\boldsymbol{q}}})^{2}}\,,+ italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (53)

where

Vn,msubscript𝑉𝑛𝑚\displaystyle V_{n,m}italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT :=(1α)2σ2n+α2σs2m.assignabsentsuperscript1𝛼2superscript𝜎2𝑛superscript𝛼2superscriptsubscript𝜎𝑠2𝑚\displaystyle:=(1-\alpha)^{2}\frac{\sigma^{2}}{n}+\alpha^{2}\frac{\sigma_{s}^{% 2}}{m}\,.:= ( 1 - italic_α ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_n end_ARG + italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_m end_ARG . (54)

The convexity of xx2𝑥superscript𝑥2x\to x^{2}italic_x → italic_x start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT implies

(a+b)2=(γaγ+(1γ)b1γ)2a2γ+b21γsuperscript𝑎𝑏2superscript𝛾𝑎𝛾1𝛾𝑏1𝛾2superscript𝑎2𝛾superscript𝑏21𝛾\displaystyle(a+b)^{2}=\left(\gamma\frac{a}{\gamma}+(1-\gamma)\frac{b}{1-% \gamma}\right)^{2}\leq\frac{a^{2}}{\gamma}+\frac{b^{2}}{1-\gamma}( italic_a + italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ( italic_γ divide start_ARG italic_a end_ARG start_ARG italic_γ end_ARG + ( 1 - italic_γ ) divide start_ARG italic_b end_ARG start_ARG 1 - italic_γ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ divide start_ARG italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_γ end_ARG + divide start_ARG italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 1 - italic_γ end_ARG (55)

for γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ) and therefore we can upper bound the first sum in (53) by taking γ=1/(1+δ)𝛾11𝛿\gamma=1/(1+\delta)italic_γ = 1 / ( 1 + italic_δ ) for any δ>0𝛿0\delta>0italic_δ > 0, which yields

R(fn,m,α)(1+δ)α2𝜽,s𝜽22+1+δδ𝒒𝒬d(λcp,𝒒1+λcp,𝒒)2|θ(𝒒)|2+Vn,m𝒒𝒬d1(1+λcp,𝒒)2.𝑅subscript𝑓𝑛𝑚𝛼1𝛿superscript𝛼2superscriptsubscriptdelimited-∥∥subscript𝜽𝑠subscript𝜽221𝛿𝛿subscript𝒒subscript𝒬𝑑superscript𝜆subscript𝑐𝑝𝒒1𝜆subscript𝑐𝑝𝒒2superscriptsubscript𝜃𝒒2subscript𝑉𝑛𝑚subscript𝒒subscript𝒬𝑑1superscript1𝜆subscript𝑐𝑝𝒒2R(f_{n,m,\alpha})\leq(1+\delta)\alpha^{2}\|{\boldsymbol{\theta}}_{*,s}-{% \boldsymbol{\theta}}_{*}\|_{2}^{2}+\frac{1+\delta}{\delta}\sum_{{\boldsymbol{q% }}\in{\mathcal{Q}}_{d}}\left(\frac{\lambda c_{p,{\boldsymbol{q}}}}{1+\lambda c% _{p,{\boldsymbol{q}}}}\right)^{2}|\theta_{*}({\boldsymbol{q}})|^{2}+V_{n,m}% \sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_{p,{% \boldsymbol{q}}})^{2}}\,.start_ROW start_CELL italic_R ( italic_f start_POSTSUBSCRIPT italic_n , italic_m , italic_α end_POSTSUBSCRIPT ) ≤ ( 1 + italic_δ ) italic_α start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + divide start_ARG 1 + italic_δ end_ARG start_ARG italic_δ end_ARG ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( divide start_ARG italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_V start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . end_CELL end_ROW (56)

D.1 Proof of Theorem 2

We now upper bound the first sum above. We note that, defining q0subscript𝑞0q_{0}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT via λcr(q0)=1𝜆subscript𝑐𝑟subscript𝑞01\lambda c_{r}(q_{0})=1italic_λ italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) = 1 (with an abuse of notation cr(t)=1+t2rsubscript𝑐𝑟𝑡1superscript𝑡2𝑟c_{r}(t)=1+t^{2r}italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_t ) = 1 + italic_t start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT), whence q0(λ/2)1/2rsubscript𝑞0superscript𝜆212𝑟q_{0}\geq(\lambda/2)^{-1/2r}italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ≥ ( italic_λ / 2 ) start_POSTSUPERSCRIPT - 1 / 2 italic_r end_POSTSUPERSCRIPT for all λ<1𝜆1\lambda<1italic_λ < 1:

𝒒𝒬dsubscript𝒒subscript𝒬𝑑\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT (λcp,𝒒1+λcr,𝒒)2|θ(𝒒)|2𝒒𝒬d,𝒒2q0λ2cr(𝒒)2|θ(𝒒)|2+𝒒𝒬d,𝒒2>q0|θ(𝒒)|2superscript𝜆subscript𝑐𝑝𝒒1𝜆subscript𝑐𝑟𝒒2superscriptsubscript𝜃𝒒2subscriptformulae-sequence𝒒subscript𝒬𝑑subscriptnorm𝒒2subscript𝑞0superscript𝜆2subscript𝑐𝑟superscript𝒒2superscriptsubscript𝜃𝒒2subscriptformulae-sequence𝒒subscript𝒬𝑑subscriptnorm𝒒2subscript𝑞0superscriptsubscript𝜃𝒒2\displaystyle\left(\frac{\lambda c_{p,{\boldsymbol{q}}}}{1+\lambda c_{r,{% \boldsymbol{q}}}}\right)^{2}\cdot|\theta_{*}({\boldsymbol{q}})|^{2}\leq\sum_{{% \boldsymbol{q}}\in{\mathcal{Q}}_{d},\|{\boldsymbol{q}}\|_{2}\leq q_{0}}\lambda% ^{2}c_{r}({\boldsymbol{q}})^{2}|\theta_{*}({\boldsymbol{q}})|^{2}+\sum_{{% \boldsymbol{q}}\in{\mathcal{Q}}_{d},\|{\boldsymbol{q}}\|_{2}>q_{0}}|\theta_{*}% ({\boldsymbol{q}})|^{2}( divide start_ARG italic_λ italic_c start_POSTSUBSCRIPT italic_p , bold_italic_q end_POSTSUBSCRIPT end_ARG start_ARG 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_r , bold_italic_q end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ⋅ | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
λ2max𝒒2q0cr(𝒒)2cs(𝒒)𝒒𝒬d,𝒒2q0cs(𝒒)|θ(𝒒)|2+max𝒒2>q01cs(𝒒)𝒒𝒬d,𝒒2>q0cs(𝒒)|θ(𝒒)|2absentsuperscript𝜆2subscriptsubscriptnorm𝒒2subscript𝑞0subscript𝑐𝑟superscript𝒒2subscript𝑐𝑠𝒒subscriptformulae-sequence𝒒subscript𝒬𝑑subscriptnorm𝒒2subscript𝑞0subscript𝑐𝑠𝒒superscriptsubscript𝜃𝒒2subscriptsubscriptnorm𝒒2subscript𝑞01subscript𝑐𝑠𝒒subscriptformulae-sequence𝒒subscript𝒬𝑑subscriptnorm𝒒2subscript𝑞0subscript𝑐𝑠𝒒superscriptsubscript𝜃𝒒2\displaystyle\leq\lambda^{2}\max_{\|{\boldsymbol{q}}\|_{2}\leq q_{0}}\frac{c_{% r}({\boldsymbol{q}})^{2}}{c_{s}({\boldsymbol{q}})}\sum_{{\boldsymbol{q}}\in{% \mathcal{Q}}_{d},\|{\boldsymbol{q}}\|_{2}\leq q_{0}}c_{s}({\boldsymbol{q}})|% \theta_{*}({\boldsymbol{q}})|^{2}+\max_{\|{\boldsymbol{q}}\|_{2}>q_{0}}\frac{1% }{c_{s}({\boldsymbol{q}})}\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d},\|{% \boldsymbol{q}}\|_{2}>q_{0}}c_{s}({\boldsymbol{q}})|\theta_{*}({\boldsymbol{q}% })|^{2}≤ italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) end_ARG ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + roman_max start_POSTSUBSCRIPT ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) end_ARG ∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
(a)λ2max𝒒2q0cr(𝒒)2cs(𝒒)max𝒒2>q01cs(𝒒)superscript𝑎absentsuperscript𝜆2subscriptsubscriptnorm𝒒2subscript𝑞0subscript𝑐𝑟superscript𝒒2subscript𝑐𝑠𝒒subscriptsubscriptnorm𝒒2subscript𝑞01subscript𝑐𝑠𝒒\displaystyle\stackrel{{\scriptstyle(a)}}{{\leq}}\lambda^{2}\max_{\|{% \boldsymbol{q}}\|_{2}\leq q_{0}}\frac{c_{r}({\boldsymbol{q}})^{2}}{c_{s}({% \boldsymbol{q}})}\max_{\|{\boldsymbol{q}}\|_{2}>q_{0}}\frac{1}{c_{s}({% \boldsymbol{q}})}start_RELOP SUPERSCRIPTOP start_ARG ≤ end_ARG start_ARG ( italic_a ) end_ARG end_RELOP italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_max start_POSTSUBSCRIPT ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( bold_italic_q ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) end_ARG roman_max start_POSTSUBSCRIPT ∥ bold_italic_q ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT > italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) end_ARG
λ2max(1,cr(q0)2cs(q0))+1cs(q0)absentsuperscript𝜆21subscript𝑐𝑟superscriptsubscript𝑞02subscript𝑐𝑠subscript𝑞01subscript𝑐𝑠subscript𝑞0\displaystyle\leq\lambda^{2}\max\Big{(}1,\,\frac{c_{r}(q_{0})^{2}}{c_{s}(q_{0}% )}\Big{)}+\frac{1}{c_{s}(q_{0})}≤ italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT roman_max ( 1 , divide start_ARG italic_c start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG ) + divide start_ARG 1 end_ARG start_ARG italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) end_ARG
Cmax(λ2,λp/r)+Cλp/rCλ2(p/r),absent𝐶superscript𝜆2superscript𝜆𝑝𝑟𝐶superscript𝜆𝑝𝑟𝐶superscript𝜆2𝑝𝑟\displaystyle\leq C\max(\lambda^{2},\lambda^{p/r})+C\lambda^{p/r}\leq C\lambda% ^{2\wedge(p/r)}\,,≤ italic_C roman_max ( italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT italic_p / italic_r end_POSTSUPERSCRIPT ) + italic_C italic_λ start_POSTSUPERSCRIPT italic_p / italic_r end_POSTSUPERSCRIPT ≤ italic_C italic_λ start_POSTSUPERSCRIPT 2 ∧ ( italic_p / italic_r ) end_POSTSUPERSCRIPT ,

where in (a)𝑎(a)( italic_a ) we used the fact that f2,p2=𝒒cs(𝒒)|θ(𝒒)|superscriptsubscriptnormsubscript𝑓2𝑝2subscript𝒒subscript𝑐𝑠𝒒subscript𝜃𝒒\|f_{*}\|_{2,p}^{2}=\sum_{{\boldsymbol{q}}}c_{s}({\boldsymbol{q}})|\theta_{*}(% {\boldsymbol{q}})|∥ italic_f start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 , italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∑ start_POSTSUBSCRIPT bold_italic_q end_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( bold_italic_q ) | italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_q ) |. Letting Ci(d)subscript𝐶𝑖𝑑C_{i}(d)italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_d ) be constants depending on d𝑑ditalic_d, we have

𝒒𝒬d1(1+λcr,𝒒)2subscript𝒒subscript𝒬𝑑1superscript1𝜆subscript𝑐𝑟𝒒2\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {r,{\boldsymbol{q}}})^{2}}∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_r , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG C1(d)d1(1+λcr,𝒒)2d𝒒absentsubscript𝐶1𝑑subscriptsuperscript𝑑1superscript1𝜆subscript𝑐𝑟𝒒2differential-d𝒒\displaystyle\leq C_{1}(d)\int_{\mathbb{R}^{d}}\frac{1}{(1+\lambda c_{r,{% \boldsymbol{q}}})^{2}}\,{\rm d}{\boldsymbol{q}}≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_r , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_d bold_italic_q
C1(d)d1(1+λ𝒒2r))2d𝒒\displaystyle\leq C_{1}(d)\int_{\mathbb{R}^{d}}\frac{1}{(1+\lambda\|{% \boldsymbol{q}}\|^{2r}))^{2}}\,{\rm d}{\boldsymbol{q}}≤ italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_d ) ∫ start_POSTSUBSCRIPT blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ ∥ bold_italic_q ∥ start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_d bold_italic_q
C2(d)0td1(1+λt2r)2dtabsentsubscript𝐶2𝑑superscriptsubscript0superscript𝑡𝑑1superscript1𝜆superscript𝑡2𝑟2differential-d𝑡\displaystyle\leq C_{2}(d)\int_{0}^{\infty}\frac{t^{d-1}}{(1+\lambda t^{2r})^{% 2}}{\rm d}t≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT divide start_ARG italic_t start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT end_ARG start_ARG ( 1 + italic_λ italic_t start_POSTSUPERSCRIPT 2 italic_r end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG roman_d italic_t
C2(d)0λ1/2rtd1dt+C2(d)λ2λ1/2rtd14rdt.absentsubscript𝐶2𝑑superscriptsubscript0superscript𝜆12𝑟superscript𝑡𝑑1differential-d𝑡subscript𝐶2𝑑superscript𝜆2superscriptsubscriptsuperscript𝜆12𝑟superscript𝑡𝑑14𝑟differential-d𝑡\displaystyle\leq C_{2}(d)\int_{0}^{\lambda^{-1/2r}}t^{d-1}\,{\rm d}t+C_{2}(d)% \lambda^{-2}\int_{\lambda^{-1/2r}}^{\infty}t^{d-1-4r}\,{\rm d}t\,.≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_λ start_POSTSUPERSCRIPT - 1 / 2 italic_r end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_d - 1 end_POSTSUPERSCRIPT roman_d italic_t + italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_d ) italic_λ start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ∫ start_POSTSUBSCRIPT italic_λ start_POSTSUPERSCRIPT - 1 / 2 italic_r end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT italic_d - 1 - 4 italic_r end_POSTSUPERSCRIPT roman_d italic_t .

For convergence we requite r>d/4𝑟𝑑4r>d/4italic_r > italic_d / 4, in which case

𝒒𝒬d1(1+λcr,𝒒)2C4(d)λd/2r.subscript𝒒subscript𝒬𝑑1superscript1𝜆subscript𝑐𝑟𝒒2subscript𝐶4𝑑superscript𝜆𝑑2𝑟\displaystyle\sum_{{\boldsymbol{q}}\in{\mathcal{Q}}_{d}}\frac{1}{(1+\lambda c_% {r,{\boldsymbol{q}}})^{2}}\leq C_{4}(d)\lambda^{-d/2r}\,.∑ start_POSTSUBSCRIPT bold_italic_q ∈ caligraphic_Q start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG ( 1 + italic_λ italic_c start_POSTSUBSCRIPT italic_r , bold_italic_q end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ≤ italic_C start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ( italic_d ) italic_λ start_POSTSUPERSCRIPT - italic_d / 2 italic_r end_POSTSUPERSCRIPT . (57)

Appendix E Analysis of high-dimensional regression: Proofs for Section 3.4

E.1 Auxiliary definition for Theorem 3

Our characterization is given in terms of a variational principle. For δ,δs(0,)𝛿subscript𝛿𝑠0\delta,\delta_{s}\in(0,\infty)italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ ( 0 , ∞ ), define R:03:𝑅superscriptsubscriptabsent03\mathscrsfs{R}:\mathbb{R}_{\geq 0}^{3}\to\mathbb{R}italic_R : blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT → blackboard_R via

R(ξ,ξ,ω,α,r,rs,δ,δs,λ,γ,σ,σs)𝑅𝜉subscript𝜉perpendicular-to𝜔𝛼𝑟subscript𝑟𝑠𝛿subscript𝛿𝑠𝜆𝛾𝜎subscript𝜎𝑠\displaystyle\mathscrsfs{R}(\xi,\xi_{\perp},\omega,\alpha,r,r_{s},\delta,% \delta_{s},\lambda,\gamma,\sigma,\sigma_{s})italic_R ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_α , italic_r , italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_λ , italic_γ , italic_σ , italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) :=ωρ2+ρs2+ρδ(τ2+σ2)+ρsδs(τs2+σs2)assignabsent𝜔superscript𝜌2superscriptsubscript𝜌𝑠2𝜌𝛿superscript𝜏2superscript𝜎2subscript𝜌𝑠subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript𝜎𝑠2\displaystyle:=-\omega\sqrt{\rho^{2}+\rho_{s}^{2}}+\rho\sqrt{\delta(\tau^{2}+% \sigma^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}:= - italic_ω square-root start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + italic_ρ square-root start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (58)
δρ22(1α)δsρs22α+λ2(ξ2+ξ2+ω2),𝛿superscript𝜌221𝛼subscript𝛿𝑠superscriptsubscript𝜌𝑠22𝛼𝜆2superscript𝜉2superscriptsubscript𝜉perpendicular-to2superscript𝜔2\displaystyle\qquad-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s% }^{2}}{2\alpha}+\frac{\lambda}{2}\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big% {)}\,,- divide start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where τ,τs𝜏subscript𝜏𝑠\tau,\tau_{s}italic_τ , italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are defined by

τ2superscript𝜏2\displaystyle\tau^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT :=(ξr)2+ξ2+ω2,assignabsentsuperscript𝜉𝑟2superscriptsubscript𝜉perpendicular-to2superscript𝜔2\displaystyle:=(\xi-r)^{2}+\xi_{\perp}^{2}+\omega^{2}\,,:= ( italic_ξ - italic_r ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (59)
τs2superscriptsubscript𝜏𝑠2\displaystyle\tau_{s}^{2}italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT :=(ξrscosγ)2+(ξrssinγ)2+ω2,assignabsentsuperscript𝜉subscript𝑟𝑠𝛾2superscriptsubscript𝜉perpendicular-tosubscript𝑟𝑠𝛾2superscript𝜔2\displaystyle:=(\xi-r_{s}\cos\gamma)^{2}+(\xi_{\perp}-r_{s}\sin\gamma)^{2}+% \omega^{2}\,,:= ( italic_ξ - italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_cos italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ( italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT - italic_r start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT roman_sin italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (60)

and ρ=ρ¯/1+t2𝜌¯𝜌1superscript𝑡2\rho={\overline{\rho}}/\sqrt{1+t^{2}}italic_ρ = over¯ start_ARG italic_ρ end_ARG / square-root start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, ρs=ρ¯t/1+t2subscript𝜌𝑠¯𝜌𝑡1superscript𝑡2\rho_{s}={\overline{\rho}}t/\sqrt{1+t^{2}}italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over¯ start_ARG italic_ρ end_ARG italic_t / square-root start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, with ρ¯¯𝜌{\overline{\rho}}over¯ start_ARG italic_ρ end_ARG solving the polynomial equation

ρ¯2=δ(τ2+σ2)(δ/(1α)+ω/ρ¯)2+δs(τs2+σs2)(δs/α+ω/ρ¯)2,superscript¯𝜌2𝛿superscript𝜏2superscript𝜎2superscript𝛿1𝛼𝜔¯𝜌2subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript𝜎𝑠2superscriptsubscript𝛿𝑠𝛼𝜔¯𝜌2\displaystyle{\overline{\rho}}^{2}=\frac{\delta(\tau^{2}+\sigma^{2})}{\big{(}% \delta/(1-\alpha)+\omega/{\overline{\rho}}\big{)}^{2}}+\frac{\delta_{s}(\tau_{% s}^{2}+\sigma_{s}^{2})}{\big{(}\delta_{s}/\alpha+\omega/{\overline{\rho}}\big{% )}^{2}}\,,over¯ start_ARG italic_ρ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = divide start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ( italic_δ / ( 1 - italic_α ) + italic_ω / over¯ start_ARG italic_ρ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG + divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG start_ARG ( italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT / italic_α + italic_ω / over¯ start_ARG italic_ρ end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , (61)

and t𝑡titalic_t is given by

t=ω+δρ¯/(1α)ω+δsρ¯/αδs(τs2+σs2)δ(τ2+σ2).𝑡𝜔𝛿¯𝜌1𝛼𝜔subscript𝛿𝑠¯𝜌𝛼subscript𝛿𝑠subscriptsuperscript𝜏2𝑠subscriptsuperscript𝜎2𝑠𝛿superscript𝜏2superscript𝜎2\displaystyle t=\frac{\omega+\delta{\overline{\rho}}/(1-\alpha)}{\omega+\delta% _{s}{\overline{\rho}}/\alpha}\cdot\sqrt{\frac{\delta_{s}(\tau^{2}_{s}+\sigma^{% 2}_{s})}{\delta(\tau^{2}+\sigma^{2})}}\,.italic_t = divide start_ARG italic_ω + italic_δ over¯ start_ARG italic_ρ end_ARG / ( 1 - italic_α ) end_ARG start_ARG italic_ω + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over¯ start_ARG italic_ρ end_ARG / italic_α end_ARG ⋅ square-root start_ARG divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) end_ARG start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG end_ARG . (62)

Theorem 3 states that the asymptotics of the test error is determined by the minimizer of R𝑅\mathscrsfs{R}italic_R.

E.2 Proof of Theorem 3

The proof is based on Gordon Gaussian comparison inequality [Gor85, Ver18], and follow a standard route, see e.g. [TOH15, TAH18, MM21]. We will limit ourselves to outlining the main steps of the calculation. Throughout, we consider the case ε0>0subscript𝜀00{\varepsilon}_{0}>0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0, δ+δs>1𝛿subscript𝛿𝑠1\delta+\delta_{s}>1italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1 because the other one (ε0=0subscript𝜀00{\varepsilon}_{0}=0italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 and δ,δs>1𝛿subscript𝛿𝑠1\delta,\delta_{s}>1italic_δ , italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1) is analogous and less interesting.

We begin by rewriting the ridge cost function in terms of a Lagrangian

R^n,m(𝜽;α)=subscript^𝑅𝑛𝑚𝜽𝛼absent\displaystyle\widehat{R}_{n,m}({\boldsymbol{\theta}};\alpha)=over^ start_ARG italic_R end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ ; italic_α ) = max𝒖nmax𝒖smL^n,m(𝜽,𝒖,𝒖s;α),subscript𝒖superscript𝑛subscriptsuperscript𝒖𝑠superscript𝑚subscript^𝐿𝑛𝑚𝜽𝒖superscript𝒖𝑠𝛼\displaystyle\max_{{\boldsymbol{u}}\in\mathbb{R}^{n}}\max_{{\boldsymbol{u}}^{s% }\in\mathbb{R}^{m}}\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha)\,,roman_max start_POSTSUBSCRIPT bold_italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT end_POSTSUBSCRIPT over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_α ) , (63)
L^n,m(𝜽,𝒖,𝒖s;α):=assignsubscript^𝐿𝑛𝑚𝜽𝒖superscript𝒖𝑠𝛼absent\displaystyle\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha):=over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_α ) := 𝒖,𝑿(𝜽𝜽)+𝒖s,𝑿s(𝜽𝜽,s)𝒖,𝜺𝒖s,𝜺s𝒖𝑿𝜽subscript𝜽superscript𝒖𝑠superscript𝑿𝑠𝜽subscript𝜽𝑠𝒖𝜺superscript𝒖𝑠superscript𝜺𝑠\displaystyle\langle{\boldsymbol{u}},{\boldsymbol{X}}({\boldsymbol{\theta}}-{% \boldsymbol{\theta}}_{*})\rangle+\langle{\boldsymbol{u}}^{s},{\boldsymbol{X}}^% {s}({\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s})\rangle-\langle{% \boldsymbol{u}},{\boldsymbol{{\varepsilon}}}\rangle-\langle{\boldsymbol{u}}^{s% },{\boldsymbol{{\varepsilon}}}^{s}\rangle⟨ bold_italic_u , bold_italic_X ( bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ) ⟩ + ⟨ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ( bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ) ⟩ - ⟨ bold_italic_u , bold_italic_ε ⟩ - ⟨ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ (64)
n𝒖222(1α)m𝒖s222α+λ2𝜽22.𝑛superscriptsubscriptnorm𝒖2221𝛼𝑚superscriptsubscriptnormsuperscript𝒖𝑠222𝛼𝜆2subscriptsuperscriptnorm𝜽22\displaystyle-\frac{n\|{\boldsymbol{u}}\|_{2}^{2}}{2(1-\alpha)}-\frac{m\|{% \boldsymbol{u}}^{s}\|_{2}^{2}}{2\alpha}+\frac{\lambda}{2}\,\|{\boldsymbol{% \theta}}\|^{2}_{2}\,.- divide start_ARG italic_n ∥ bold_italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_m ∥ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Let Δ(𝜽,𝒖,𝒖s):=𝒖2𝜽𝜽2G+𝒖s2𝜽𝜽,s2GsassignΔ𝜽𝒖superscript𝒖𝑠subscriptnorm𝒖2subscriptnorm𝜽subscript𝜽2𝐺subscriptnormsuperscript𝒖𝑠2subscriptnorm𝜽subscript𝜽𝑠2subscript𝐺𝑠\Delta({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s}):=\|{% \boldsymbol{u}}\|_{2}\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*}\|_{2}G+% \|{\boldsymbol{u}}^{s}\|_{2}\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s% }\|_{2}G_{s}roman_Δ ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) := ∥ bold_italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_G + ∥ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT., where G,Gs𝐺subscript𝐺𝑠G,G_{s}italic_G , italic_G start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT are independent standard normal random variables, independent of 𝑿,𝑿s𝑿superscript𝑿𝑠{\boldsymbol{X}},{\boldsymbol{X}}^{s}bold_italic_X , bold_italic_X start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT. By Gordon’s inequality [Gor85], we can compare the Gaussian process L^n,m(𝜽,𝒖,𝒖s;α)+Δ(𝜽,𝒖,𝒖s)subscript^𝐿𝑛𝑚𝜽𝒖superscript𝒖𝑠𝛼Δ𝜽𝒖superscript𝒖𝑠\widehat{L}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s};% \alpha)+\Delta({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s})over^ start_ARG italic_L end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_α ) + roman_Δ ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ) to

L^n,mG(𝜽,𝒖,𝒖s;α):=assignsubscriptsuperscript^𝐿𝐺𝑛𝑚𝜽𝒖superscript𝒖𝑠𝛼absent\displaystyle\widehat{L}^{G}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{% \boldsymbol{u}}^{s};\alpha):=over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_α ) := 𝒖𝒈,𝜽𝜽+𝜽𝜽𝒉,𝒖+𝒖s𝒈s,𝜽𝜽,s+𝜽𝜽,s𝒉,𝒖snorm𝒖𝒈𝜽subscript𝜽norm𝜽subscript𝜽𝒉𝒖normsuperscript𝒖𝑠superscript𝒈𝑠𝜽subscript𝜽𝑠norm𝜽subscript𝜽𝑠𝒉superscript𝒖𝑠\displaystyle\|{\boldsymbol{u}}\|\langle{\boldsymbol{g}},{\boldsymbol{\theta}}% -{\boldsymbol{\theta}}_{*}\rangle+\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}% }_{*}\|\langle{\boldsymbol{h}},{\boldsymbol{u}}\rangle+\|{\boldsymbol{u}}^{s}% \|\langle{\boldsymbol{g}}^{s},{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s% }\rangle+\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s}\|\langle{% \boldsymbol{h}},{\boldsymbol{u}}^{s}\rangle∥ bold_italic_u ∥ ⟨ bold_italic_g , bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ + ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ ⟨ bold_italic_h , bold_italic_u ⟩ + ∥ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ ⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ⟩ + ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ ⟨ bold_italic_h , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ (65)
𝒖,𝜺𝒖s,𝜺sn𝒖222(1α)m𝒖s222α+λ2𝜽22.𝒖𝜺superscript𝒖𝑠superscript𝜺𝑠𝑛superscriptsubscriptnorm𝒖2221𝛼𝑚superscriptsubscriptnormsuperscript𝒖𝑠222𝛼𝜆2subscriptsuperscriptnorm𝜽22\displaystyle-\langle{\boldsymbol{u}},{\boldsymbol{{\varepsilon}}}\rangle-% \langle{\boldsymbol{u}}^{s},{\boldsymbol{{\varepsilon}}}^{s}\rangle-\frac{n\|{% \boldsymbol{u}}\|_{2}^{2}}{2(1-\alpha)}-\frac{m\|{\boldsymbol{u}}^{s}\|_{2}^{2% }}{2\alpha}+\frac{\lambda}{2}\,\|{\boldsymbol{\theta}}\|^{2}_{2}\,.- ⟨ bold_italic_u , bold_italic_ε ⟩ - ⟨ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⟩ - divide start_ARG italic_n ∥ bold_italic_u ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_m ∥ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ∥ bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT .

Next we define the orthonormal vectors

𝒗:=𝜽𝜽2,𝒗:=𝑷𝜽𝜽,s𝑷𝜽𝜽,s2,formulae-sequenceassignsubscript𝒗subscript𝜽subscriptnormsubscript𝜽2assignsubscriptsuperscript𝒗perpendicular-tosuperscriptsubscript𝑷subscript𝜽perpendicular-tosubscript𝜽𝑠subscriptnormsuperscriptsubscript𝑷subscript𝜽perpendicular-tosubscript𝜽𝑠2\displaystyle{\boldsymbol{v}}_{*}:=\frac{{\boldsymbol{\theta}}_{*}}{\|{% \boldsymbol{\theta}}_{*}\|_{2}}\,,\;\;\;\;\;\;{\boldsymbol{v}}^{\perp}_{*}:=% \frac{{\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}{\boldsymbol{\theta}% }_{*,s}}{\|{\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}{\boldsymbol{% \theta}}_{*,s}\|_{2}}\,,bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := divide start_ARG bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , bold_italic_v start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT := divide start_ARG bold_italic_P start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT end_ARG start_ARG ∥ bold_italic_P start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_ARG , (66)

where 𝑷𝜽=𝑰𝑷𝜽:=𝑰𝒗𝒗𝖳superscriptsubscript𝑷subscript𝜽perpendicular-to𝑰subscript𝑷subscript𝜽assign𝑰subscript𝒗superscriptsubscript𝒗𝖳{\boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}^{\perp}={\boldsymbol{I}}-{% \boldsymbol{P}}_{{\boldsymbol{\theta}}_{*}}:={\boldsymbol{I}}-{\boldsymbol{v}}% _{*}{\boldsymbol{v}}_{*}^{{\sf T}}bold_italic_P start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT = bold_italic_I - bold_italic_P start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT end_POSTSUBSCRIPT := bold_italic_I - bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT is the projector orthogonal to 𝜽subscript𝜽{\boldsymbol{\theta}}_{*}bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT. We then decompose

𝜽=ξ𝒗+ξ𝒗+𝜽,𝜽𝜉subscript𝒗subscript𝜉perpendicular-tosuperscriptsubscript𝒗perpendicular-tosuperscript𝜽perpendicular-to\displaystyle{\boldsymbol{\theta}}=\xi{\boldsymbol{v}}_{*}+\xi_{\perp}\,{% \boldsymbol{v}}_{*}^{\perp}+{\boldsymbol{\theta}}^{\perp}\,,bold_italic_θ = italic_ξ bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , (67)

where 𝒗,𝜽=𝒗,𝜽=0subscript𝒗superscript𝜽perpendicular-tosubscriptsuperscript𝒗perpendicular-tosuperscript𝜽perpendicular-to0\langle{\boldsymbol{v}}_{*},{\boldsymbol{\theta}}^{\perp}\rangle=\langle{% \boldsymbol{v}}^{\perp}_{*},{\boldsymbol{\theta}}^{\perp}\rangle=0⟨ bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ = ⟨ bold_italic_v start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ⟩ = 0, and define ω:=𝜽2assign𝜔subscriptnormsuperscript𝜽perpendicular-to2\omega:=\|{\boldsymbol{\theta}}^{\perp}\|_{2}italic_ω := ∥ bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Defining τ2=𝜽𝜽22superscript𝜏2superscriptsubscriptnorm𝜽subscript𝜽22\tau^{2}=\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*}\|_{2}^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, τs2=𝜽𝜽,s22subscriptsuperscript𝜏2𝑠superscriptsubscriptnorm𝜽subscript𝜽𝑠22\tau^{2}_{s}=\|{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s}\|_{2}^{2}italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = ∥ bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, Eq. (60) follows.

With these notations, and letting σ^2=τ𝒉+𝜺22/nτ2superscript^𝜎2subscriptsuperscriptnorm𝜏𝒉𝜺22𝑛superscript𝜏2\hat{\sigma}^{2}=\|\tau{\boldsymbol{h}}+{\boldsymbol{{\varepsilon}}}\|^{2}_{2}% /n-\tau^{2}over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_τ bold_italic_h + bold_italic_ε ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_n - italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, σ^s2=τs𝒉s+𝜺s22/mτs2superscriptsubscript^𝜎𝑠2subscriptsuperscriptnormsubscript𝜏𝑠superscript𝒉𝑠superscript𝜺𝑠22𝑚superscriptsubscript𝜏𝑠2\hat{\sigma}_{s}^{2}=\|\tau_{s}{\boldsymbol{h}}^{s}+{\boldsymbol{{\varepsilon}% }}^{s}\|^{2}_{2}/m-\tau_{s}^{2}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT = ∥ italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_h start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT + bold_italic_ε start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT / italic_m - italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, we get

L^n,mG(𝜽,ρ,ρs;α)subscriptsuperscript^𝐿𝐺𝑛𝑚𝜽𝜌subscript𝜌𝑠𝛼\displaystyle\widehat{\mathscrsfs{L}}^{G}_{n,m}({\boldsymbol{\theta}},\rho,% \rho_{s};\alpha)over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) :=max𝒖,𝒖s{L^n,mG(𝜽,𝒖,𝒖s;α):𝒖=ρd,𝒖s=ρsd},assignabsentsubscript𝒖superscript𝒖𝑠:subscriptsuperscript^𝐿𝐺𝑛𝑚𝜽𝒖superscript𝒖𝑠𝛼formulae-sequencenorm𝒖𝜌𝑑normsuperscript𝒖𝑠subscript𝜌𝑠𝑑\displaystyle:=\max_{{\boldsymbol{u}},{\boldsymbol{u}}^{s}}\Big{\{}\widehat{L}% ^{G}_{n,m}({\boldsymbol{\theta}},{\boldsymbol{u}},{\boldsymbol{u}}^{s};\alpha)% :\;\|{\boldsymbol{u}}\|=\frac{\rho}{\sqrt{d}},\,\|{\boldsymbol{u}}^{s}\|=\frac% {\rho_{s}}{\sqrt{d}}\Big{\}}\,,:= roman_max start_POSTSUBSCRIPT bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT end_POSTSUBSCRIPT { over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , bold_italic_u , bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ; italic_α ) : ∥ bold_italic_u ∥ = divide start_ARG italic_ρ end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG , ∥ bold_italic_u start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ∥ = divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG } , (68)
L^n,mG(𝜽,ρ,ρs;α)subscriptsuperscript^𝐿𝐺𝑛𝑚𝜽𝜌subscript𝜌𝑠𝛼\displaystyle\widehat{\mathscrsfs{L}}^{G}_{n,m}({\boldsymbol{\theta}},\rho,% \rho_{s};\alpha)over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) =ρd𝒈,𝜽𝜽+ρsd𝒈s,𝜽𝜽,s+ρδ(τ2+σ^2)+ρsδs(τs2+σ^s2)absent𝜌𝑑𝒈𝜽subscript𝜽subscript𝜌𝑠𝑑superscript𝒈𝑠𝜽subscript𝜽𝑠𝜌𝛿superscript𝜏2superscript^𝜎2subscript𝜌𝑠subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript^𝜎𝑠2\displaystyle=\frac{\rho}{\sqrt{d}}\langle{\boldsymbol{g}},{\boldsymbol{\theta% }}-{\boldsymbol{\theta}}_{*}\rangle+\frac{\rho_{s}}{\sqrt{d}}\langle{% \boldsymbol{g}}^{s},{\boldsymbol{\theta}}-{\boldsymbol{\theta}}_{*,s}\rangle+% \rho\sqrt{\delta(\tau^{2}+\hat{\sigma}^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}% ^{2}+\hat{\sigma}_{s}^{2})}= divide start_ARG italic_ρ end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ⟨ bold_italic_g , bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ⟩ + divide start_ARG italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG square-root start_ARG italic_d end_ARG end_ARG ⟨ bold_italic_g start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT , bold_italic_θ - bold_italic_θ start_POSTSUBSCRIPT ∗ , italic_s end_POSTSUBSCRIPT ⟩ + italic_ρ square-root start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (69)
δρ22(1α)δsρs2α+λ2(ξ2+ξ2+ω2).𝛿superscript𝜌221𝛼subscript𝛿𝑠subscript𝜌𝑠2𝛼𝜆2superscript𝜉2superscriptsubscript𝜉perpendicular-to2superscript𝜔2\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}}{2% \alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,.- divide start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) .

We finally decompose 𝒈=𝒈+𝒈𝒈subscript𝒈parallel-tosubscript𝒈perpendicular-to{\boldsymbol{g}}={\boldsymbol{g}}_{\parallel}+{\boldsymbol{g}}_{\perp}bold_italic_g = bold_italic_g start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT + bold_italic_g start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT where 𝒈𝗌𝗉𝖺𝗇(𝒗,𝒗)subscript𝒈parallel-to𝗌𝗉𝖺𝗇subscript𝒗superscriptsubscript𝒗perpendicular-to{\boldsymbol{g}}_{\parallel}\in{\sf span}({\boldsymbol{v}}_{*},{\boldsymbol{v}% }_{*}^{\perp})bold_italic_g start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ∈ sansserif_span ( bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) and 𝒈𝗌𝗉𝖺𝗇(𝒗,𝒗)perpendicular-tosubscript𝒈parallel-to𝗌𝗉𝖺𝗇subscript𝒗superscriptsubscript𝒗perpendicular-to{\boldsymbol{g}}_{\parallel}\perp{\sf span}({\boldsymbol{v}}_{*},{\boldsymbol{% v}}_{*}^{\perp})bold_italic_g start_POSTSUBSCRIPT ∥ end_POSTSUBSCRIPT ⟂ sansserif_span ( bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ), and similarly for 𝒈ssubscript𝒈𝑠{\boldsymbol{g}}_{s}bold_italic_g start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, and define

Ln,mG(ξ,ξ,ω,ρ,ρs;α):=min𝜽{L^n,mG(𝜽,ρ,ρs;α):𝜽=ξ𝒗+ξ𝒗+𝜽,𝜽=ω}.assignsubscriptsuperscript𝐿𝐺𝑛𝑚𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼subscript𝜽:subscriptsuperscript^𝐿𝐺𝑛𝑚𝜽𝜌subscript𝜌𝑠𝛼formulae-sequence𝜽𝜉subscript𝒗subscript𝜉perpendicular-tosuperscriptsubscript𝒗perpendicular-tosuperscript𝜽perpendicular-tonormsuperscript𝜽perpendicular-to𝜔\displaystyle\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};% \alpha):=\min_{{\boldsymbol{\theta}}}\Big{\{}\widehat{\mathscrsfs{L}}^{G}_{n,m% }({\boldsymbol{\theta}},\rho,\rho_{s};\alpha):\;{\boldsymbol{\theta}}=\xi{% \boldsymbol{v}}_{*}+\xi_{\perp}\,{\boldsymbol{v}}_{*}^{\perp}+{\boldsymbol{% \theta}}^{\perp}\,,\;\;\|{\boldsymbol{\theta}}^{\perp}\|=\omega\Big{\}}\,.italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) := roman_min start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT { over^ start_ARG italic_L end_ARG start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( bold_italic_θ , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) : bold_italic_θ = italic_ξ bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT + bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT , ∥ bold_italic_θ start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ∥ = italic_ω } . (70)

Defining ι𝜄\iotaitalic_ι via ρ𝒈/n+ρs𝒈s,/m=(1+ι)ρ2+ρs2norm𝜌subscript𝒈perpendicular-to𝑛subscript𝜌𝑠subscript𝒈𝑠perpendicular-to𝑚1𝜄superscript𝜌2superscriptsubscript𝜌𝑠2\|\rho{\boldsymbol{g}}_{\perp}/\sqrt{n}+\rho_{s}{\boldsymbol{g}}_{s,\perp}/% \sqrt{m}\|=(1+\iota)\sqrt{\rho^{2}+\rho_{s}^{2}}∥ italic_ρ bold_italic_g start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT / square-root start_ARG italic_n end_ARG + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT bold_italic_g start_POSTSUBSCRIPT italic_s , ⟂ end_POSTSUBSCRIPT / square-root start_ARG italic_m end_ARG ∥ = ( 1 + italic_ι ) square-root start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, we obtain

Ln,mG(ξ,ξ,ω,ρ,ρs;α)=subscriptsuperscript𝐿𝐺𝑛𝑚𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼absent\displaystyle\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};% \alpha)=italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) = (1+ι)ρ2+ρs2ω+Δ+ρδ(τ2+σ^2)+ρsδs(τs2+σ^s2)1𝜄superscript𝜌2superscriptsubscript𝜌𝑠2𝜔Δ𝜌𝛿superscript𝜏2superscript^𝜎2subscript𝜌𝑠subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript^𝜎𝑠2\displaystyle-(1+\iota)\sqrt{\rho^{2}+\rho_{s}^{2}}\cdot\omega+\Delta+\rho% \sqrt{\delta(\tau^{2}+\hat{\sigma}^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}% +\hat{\sigma}_{s}^{2})}- ( 1 + italic_ι ) square-root start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ω + roman_Δ + italic_ρ square-root start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (71)
δρ22(1α)δsρs22α+λ2(ξ2+ξ2+ω2),𝛿superscript𝜌221𝛼subscript𝛿𝑠superscriptsubscript𝜌𝑠22𝛼𝜆2superscript𝜉2superscriptsubscript𝜉perpendicular-to2superscript𝜔2\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}^{2}}% {2\alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,,- divide start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

where ΔΔ\Deltaroman_Δ is the contribution of the perpendicular components. Simple concentration estimates imply that for any ε>0𝜀0{\varepsilon}>0italic_ε > 0 there exist c(ε)>0𝑐𝜀0c({\varepsilon})>0italic_c ( italic_ε ) > 0 such that

(|σ^σ|ετ2+σ2,|σ^sσs|ετs2+σs2)formulae-sequence^𝜎𝜎𝜀superscript𝜏2superscript𝜎2subscript^𝜎𝑠subscript𝜎𝑠𝜀superscriptsubscript𝜏𝑠2superscriptsubscript𝜎𝑠2\displaystyle{\mathbb{P}}\big{(}|\hat{\sigma}-\sigma|\leq{\varepsilon}\sqrt{% \tau^{2}+\sigma^{2}},|\hat{\sigma}_{s}-\sigma_{s}|\leq{\varepsilon}\sqrt{\tau_% {s}^{2}+\sigma_{s}^{2}}\big{)}blackboard_P ( | over^ start_ARG italic_σ end_ARG - italic_σ | ≤ italic_ε square-root start_ARG italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , | over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT | ≤ italic_ε square-root start_ARG italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) 1ec(ε)n,absent1superscript𝑒𝑐𝜀𝑛\displaystyle\geq 1-e^{-c({\varepsilon})n}\,,≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_c ( italic_ε ) italic_n end_POSTSUPERSCRIPT , (72)
(Δ|(ρ2+ρs2)(ξ2+ξ2))\displaystyle{\mathbb{P}}\big{(}\Delta|\leq\sqrt{(\rho^{2}+\rho_{s}^{2})(\xi^{% 2}+\xi_{\perp}^{2})}\big{)}blackboard_P ( roman_Δ | ≤ square-root start_ARG ( italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG ) 1ec(ε)n,absent1superscript𝑒𝑐𝜀𝑛\displaystyle\geq 1-e^{-c({\varepsilon})n}\,,≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_c ( italic_ε ) italic_n end_POSTSUPERSCRIPT , (73)
(|ι|ε)𝜄𝜀\displaystyle{\mathbb{P}}\big{(}|\iota|\leq{\varepsilon})blackboard_P ( | italic_ι | ≤ italic_ε ) 1ec(ε)n.absent1superscript𝑒𝑐𝜀𝑛\displaystyle\geq 1-e^{-c({\varepsilon})n}\,.≥ 1 - italic_e start_POSTSUPERSCRIPT - italic_c ( italic_ε ) italic_n end_POSTSUPERSCRIPT . (74)

We can then estimate Ln,mG(ξ,ξ,ω,ρ,ρs;α)subscriptsuperscript𝐿𝐺𝑛𝑚𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼\mathscrsfs{L}^{G}_{n,m}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) by

LG(ξ,ξ,ω,ρ,ρs;α)=superscript𝐿𝐺𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼absent\displaystyle\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)=italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) = ρ2+ρs2ω+ρδ(τ2+σ2)+ρsδs(τs2+σs2)superscript𝜌2superscriptsubscript𝜌𝑠2𝜔𝜌𝛿superscript𝜏2superscript𝜎2subscript𝜌𝑠subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript𝜎𝑠2\displaystyle-\sqrt{\rho^{2}+\rho_{s}^{2}}\cdot\omega+\rho\sqrt{\delta(\tau^{2% }+\sigma^{2})}+\rho_{s}\sqrt{\delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}- square-root start_ARG italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_ω + italic_ρ square-root start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG (75)
δρ22(1α)δsρs22α+λ2(ξ2+ξ2+ω2),𝛿superscript𝜌221𝛼subscript𝛿𝑠superscriptsubscript𝜌𝑠22𝛼𝜆2superscript𝜉2superscriptsubscript𝜉perpendicular-to2superscript𝜔2\displaystyle-\frac{\delta\rho^{2}}{2(1-\alpha)}-\frac{\delta_{s}\rho_{s}^{2}}% {2\alpha}+\frac{\lambda}{2}\,\big{(}\xi^{2}+\xi_{\perp}^{2}+\omega^{2}\big{)}\,,- divide start_ARG italic_δ italic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 ( 1 - italic_α ) end_ARG - divide start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 italic_α end_ARG + divide start_ARG italic_λ end_ARG start_ARG 2 end_ARG ( italic_ξ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ω start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ,

Differentiating with respect to ρ𝜌\rhoitalic_ρ and ρssubscript𝜌𝑠\rho_{s}italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and setting the derivatives to 00 yields ρ=ρ¯/1+t2𝜌¯𝜌1superscript𝑡2\rho={\overline{\rho}}/\sqrt{1+t^{2}}italic_ρ = over¯ start_ARG italic_ρ end_ARG / square-root start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, ρs=ρ¯t/1+t2subscript𝜌𝑠¯𝜌𝑡1superscript𝑡2\rho_{s}={\overline{\rho}}t/\sqrt{1+t^{2}}italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = over¯ start_ARG italic_ρ end_ARG italic_t / square-root start_ARG 1 + italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG, with ρ¯,t¯𝜌𝑡{\overline{\rho}},tover¯ start_ARG italic_ρ end_ARG , italic_t given by Eqs. (61), (62). By computing second derivatives, one obtain that this is a local maximum. Since LG(ξ,ξ,ω,ρ,ρs;α)superscript𝐿𝐺𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)\to-\inftyitalic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) → - ∞ as ρ2+ρs2superscript𝜌2superscriptsubscript𝜌𝑠2\rho^{2}+\rho_{s}^{2}\to\inftyitalic_ρ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT → ∞, the maximum over ρ,ρs𝜌subscript𝜌𝑠\rho,\rho_{s}italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is either achieved at this point or at the boundary {ρ=0}{ρs=0}𝜌0subscript𝜌𝑠0\{\rho=0\}\cup\{\rho_{s}=0\}{ italic_ρ = 0 } ∪ { italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 }. By checking the signs of partial derivatives along this boundary, the only other possibility is ρ=ρs=0𝜌subscript𝜌𝑠0\rho=\rho_{s}=0italic_ρ = italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0.

For economy of notation, write F(ρ,ρs):=LG(ξ,ξ,ω,ρ,ρs;α)assign𝐹𝜌subscript𝜌𝑠superscript𝐿𝐺𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼F(\rho,\rho_{s}):=\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)italic_F ( italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) := italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ). For any unit vector 𝒗=(v1,v2)0𝒗subscript𝑣1subscript𝑣20{\boldsymbol{v}}=(v_{1},v_{2})\geq 0bold_italic_v = ( italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≥ 0, the directional derivative is

𝒗F(𝒓)|𝒓=0evaluated-atsubscript𝒗𝐹𝒓𝒓0\displaystyle\nabla_{{\boldsymbol{v}}}F({\boldsymbol{r}})\big{|}_{{\boldsymbol% {r}}=0}∇ start_POSTSUBSCRIPT bold_italic_v end_POSTSUBSCRIPT italic_F ( bold_italic_r ) | start_POSTSUBSCRIPT bold_italic_r = 0 end_POSTSUBSCRIPT =ω+v1δ(τ2+σ2)+v2δs(τs2+σs2)absent𝜔subscript𝑣1𝛿superscript𝜏2superscript𝜎2subscript𝑣2subscript𝛿𝑠superscriptsubscript𝜏𝑠2superscriptsubscript𝜎𝑠2\displaystyle=-\omega+v_{1}\sqrt{\delta(\tau^{2}+\sigma^{2})}+v_{2}\sqrt{% \delta_{s}(\tau_{s}^{2}+\sigma_{s}^{2})}= - italic_ω + italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG italic_δ ( italic_τ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG + italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_τ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) end_ARG
ω[1+v1δ+v2δs].absent𝜔delimited-[]1subscript𝑣1𝛿subscript𝑣2subscript𝛿𝑠\displaystyle\geq\omega\big{[}-1+v_{1}\sqrt{\delta}+v_{2}\sqrt{\delta_{s}}\big% {]}\,.≥ italic_ω [ - 1 + italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT square-root start_ARG italic_δ end_ARG + italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT square-root start_ARG italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ] .

By maximizing over the direction, we see that 𝒗𝒗{\boldsymbol{v}}bold_italic_v can be chosen so that 𝒗F(𝟎)ω[1+δ+δs]subscript𝒗𝐹0𝜔delimited-[]1𝛿subscript𝛿𝑠\nabla_{{\boldsymbol{v}}}F({\boldsymbol{0}})\geq\omega[-1+\sqrt{\delta+\delta_% {s}}\big{]}∇ start_POSTSUBSCRIPT bold_italic_v end_POSTSUBSCRIPT italic_F ( bold_0 ) ≥ italic_ω [ - 1 + square-root start_ARG italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_ARG ]. Hence ρ=ρs=0𝜌subscript𝜌𝑠0\rho=\rho_{s}=0italic_ρ = italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = 0 cannot be the global aximum for δ+δs>1𝛿subscript𝛿𝑠1\delta+\delta_{s}>1italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1.

Hence, we get

R(ξ,ξ,ω)=maxρ,ρs0LG(ξ,ξ,ω,ρ,ρs;α).𝑅𝜉subscript𝜉perpendicular-to𝜔subscript𝜌subscript𝜌𝑠0superscript𝐿𝐺𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼\displaystyle\mathscrsfs{R}(\xi,\xi_{\perp},\omega)=\max_{\rho,\rho_{s}\geq 0}% \mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,\rho_{s};\alpha)\,.italic_R ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω ) = roman_max start_POSTSUBSCRIPT italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) . (76)

We further note that, for fixed ρ,ρs>0𝜌subscript𝜌𝑠0\rho,\rho_{s}>0italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0, the function (ξ,ξ,ω)LG(ξ,ξ,ω,ρ,ρs;α)maps-to𝜉subscript𝜉perpendicular-to𝜔superscript𝐿𝐺𝜉subscript𝜉perpendicular-to𝜔𝜌subscript𝜌𝑠𝛼(\xi,\xi_{\perp},\omega)\mapsto\mathscrsfs{L}^{G}(\xi,\xi_{\perp},\omega,\rho,% \rho_{s};\alpha)( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω ) ↦ italic_L start_POSTSUPERSCRIPT italic_G end_POSTSUPERSCRIPT ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω , italic_ρ , italic_ρ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ; italic_α ) is jointly strictly convex for λ>0𝜆0\lambda>0italic_λ > 0. Hence (ξ,ξ,ω)R(ξ,ξ,ω)maps-to𝜉subscript𝜉perpendicular-to𝜔𝑅𝜉subscript𝜉perpendicular-to𝜔(\xi,\xi_{\perp},\omega)\mapsto\mathscrsfs{R}(\xi,\xi_{\perp},\omega)( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω ) ↦ italic_R ( italic_ξ , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT , italic_ω ) is also strictly convex for λ>0𝜆0\lambda>0italic_λ > 0. Therefore, it has a unique minimizer, which we denote by (ξ,ξ,ω)superscript𝜉superscriptsubscript𝜉perpendicular-tosuperscript𝜔(\xi^{*},\xi_{\perp}^{*},\omega^{*})( italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ξ start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ). Proceeding as in [MM21], we obtain the following result.

Proposition E.1.

Under the assumptions of Proposition 3, for any ε,ε0>0𝜀subscript𝜀00{\varepsilon},{\varepsilon}_{0}>0italic_ε , italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT > 0 there exists c=c(ε,ε0)>0𝑐𝑐𝜀subscript𝜀00c=c({\varepsilon},{\varepsilon}_{0})>0italic_c = italic_c ( italic_ε , italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) > 0 such that, if α[ε0,1ε0]𝛼subscript𝜀01subscript𝜀0\alpha\in[{\varepsilon}_{0},1-{\varepsilon}_{0}]italic_α ∈ [ italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] (letting 𝐏:=𝐈𝐯𝐯𝖳𝐯(𝐯)𝖳assignsuperscript𝐏perpendicular-to𝐈subscript𝐯superscriptsubscript𝐯𝖳subscriptsuperscript𝐯perpendicular-tosuperscriptsuperscriptsubscript𝐯perpendicular-to𝖳{\boldsymbol{P}}^{\perp}:={\boldsymbol{I}}-{\boldsymbol{v}}_{*}{\boldsymbol{v}% }_{*}^{{\sf T}}-{\boldsymbol{v}}^{\perp}_{*}({\boldsymbol{v}}_{*}^{\perp})^{{% \sf T}}bold_italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT := bold_italic_I - bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT - bold_italic_v start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT ( bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT)

{|𝒗,𝜽^n,mξ|ε,|𝒗,𝜽^n,mξ|ε,,|𝑷𝜽^n,mω|ε}12ecn.\displaystyle{\mathbb{P}}\Big{\{}\big{|}\langle{\boldsymbol{v}}_{*},\hat{% \boldsymbol{\theta}}_{n,m}\rangle-\xi^{*}\big{|}\leq{\varepsilon},\,\big{|}% \langle{\boldsymbol{v}}^{\perp}_{*},\hat{\boldsymbol{\theta}}_{n,m}\rangle-\xi% ^{*}_{\perp}\big{|}\leq{\varepsilon},\,,\big{|}\|{\boldsymbol{P}}^{\perp}\hat{% \boldsymbol{\theta}}_{n,m}\|-\omega^{*}\big{|}\leq{\varepsilon}\,\Big{\}}\geq 1% -2\,e^{-cn}\,.blackboard_P { | ⟨ bold_italic_v start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ⟩ - italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_ε , | ⟨ bold_italic_v start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ∗ end_POSTSUBSCRIPT , over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ⟩ - italic_ξ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT ⟂ end_POSTSUBSCRIPT | ≤ italic_ε , , | ∥ bold_italic_P start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ∥ - italic_ω start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_ε } ≥ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_c italic_n end_POSTSUPERSCRIPT . (77)

In particular, the last proposition implies (a weaker form of) Theorem 3 whereby the supremum is taken over a finite net. Namely for η>0𝜂0\eta>0italic_η > 0, we define

N(ε0,η):=[ε0,1ε0]η.assign𝑁subscript𝜀0𝜂subscript𝜀01subscript𝜀0𝜂\displaystyle N({\varepsilon}_{0},\eta):=[{\varepsilon}_{0},1-{\varepsilon}_{0% }]\cap\eta{\mathbb{Z}}\,.italic_N ( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η ) := [ italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ∩ italic_η blackboard_Z .

Recalling that, in the present case, R(𝜽^)=𝜽^𝜽22𝑅^𝜽subscriptsuperscriptnorm^𝜽𝜽22R(\hat{\boldsymbol{\theta}})=\|\hat{\boldsymbol{\theta}}-{\boldsymbol{\theta}}% \|^{2}_{2}italic_R ( over^ start_ARG bold_italic_θ end_ARG ) = ∥ over^ start_ARG bold_italic_θ end_ARG - bold_italic_θ ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, we obtain (after adjusting the constant c𝑐citalic_c) we have therefore:

(maxαN(ε0,η)|R(𝜽^n,m(α))R(α)|ε)12ecn.subscript𝛼𝑁subscript𝜀0𝜂𝑅subscript^𝜽𝑛𝑚𝛼superscript𝑅𝛼𝜀12superscript𝑒𝑐𝑛\displaystyle{\mathbb{P}}\Big{(}\max_{\alpha\in N({\varepsilon}_{0},\eta)}\big% {|}R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha))-\mathscrsfs{R}^{*}(\alpha)\big{|% }\geq{\varepsilon}\Big{)}\geq 1-2\,e^{-cn}\,.blackboard_P ( roman_max start_POSTSUBSCRIPT italic_α ∈ italic_N ( italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_η ) end_POSTSUBSCRIPT | italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) ) - italic_R start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( italic_α ) | ≥ italic_ε ) ≥ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_c italic_n end_POSTSUPERSCRIPT . (78)

Finally, let 𝑿+(m+n)×dsubscript𝑿superscript𝑚𝑛𝑑{\boldsymbol{X}}_{+}\in\mathbb{R}^{(m+n)\times d}bold_italic_X start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT ( italic_m + italic_n ) × italic_d end_POSTSUPERSCRIPT be the matrix obtained by stacking 𝑿𝑿{\boldsymbol{X}}bold_italic_X and 𝑿ssubscript𝑿𝑠{\boldsymbol{X}}_{s}bold_italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Given constants C1,C2,C3subscript𝐶1subscript𝐶2subscript𝐶3C_{1},C_{2},C_{3}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT, define the good event

𝒢:={C1nλmin(𝑿+𝖳𝑿+)λmax(𝑿+𝖳𝑿+)C2n;𝑿𝖳𝒚C3n,𝑿s𝖳𝒚sC3n}/\displaystyle{\mathcal{G}}:=\Big{\{}C_{1}n\leq\lambda_{\min}({\boldsymbol{X}}_% {+}^{{\sf T}}{\boldsymbol{X}}_{+})\leq\lambda_{\max}({\boldsymbol{X}}_{+}^{{% \sf T}}{\boldsymbol{X}}_{+})\leq C_{2}n;\|{\boldsymbol{X}}^{{\sf T}}{% \boldsymbol{y}}\|\leq C_{3}n\,,\;\|{\boldsymbol{X}}_{s}^{{\sf T}}{\boldsymbol{% y}}_{s}\|\leq C_{3}n\Big{\}}\,/caligraphic_G := { italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_n ≤ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ≤ italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ( bold_italic_X start_POSTSUBSCRIPT + end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_X start_POSTSUBSCRIPT + end_POSTSUBSCRIPT ) ≤ italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_n ; ∥ bold_italic_X start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_y ∥ ≤ italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_n , ∥ bold_italic_X start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_T end_POSTSUPERSCRIPT bold_italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ≤ italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_n } / (79)

By a standard bound on eigenvalues of Wishart matrices [Ver18], for δ+δs>1𝛿subscript𝛿𝑠1\delta+\delta_{s}>1italic_δ + italic_δ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 1, we can choose C1,C2,C3subscript𝐶1subscript𝐶2subscript𝐶3C_{1},C_{2},C_{3}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT such that

(𝒢)12ecn.𝒢12superscript𝑒𝑐𝑛\displaystyle{\mathbb{P}}({\mathcal{G}})\geq 1-2e^{-cn}\,.blackboard_P ( caligraphic_G ) ≥ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_c italic_n end_POSTSUPERSCRIPT . (80)

Further on 𝒢𝒢{\mathcal{G}}caligraphic_G, 𝜽n,m(α)subscript𝜽𝑛𝑚𝛼{\boldsymbol{\theta}}_{n,m}(\alpha)bold_italic_θ start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α ) is bounded (in 2subscript2\ell_{2}roman_ℓ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm, and Lipschitz continuous in α𝛼\alphaitalic_α). As a consequence, for a sufficiently large constant L𝐿Litalic_L,

(|R(𝜽^n,m(α1))R(𝜽^n,m(α2))|L|α1α2|α1,α2[ε0,1ε0])12ecn.formulae-sequence𝑅subscript^𝜽𝑛𝑚subscript𝛼1𝑅subscript^𝜽𝑛𝑚subscript𝛼2𝐿subscript𝛼1subscript𝛼2for-allsubscript𝛼1subscript𝛼2subscript𝜀01subscript𝜀012superscript𝑒𝑐𝑛\displaystyle{\mathbb{P}}\Big{(}\big{|}R(\hat{\boldsymbol{\theta}}_{n,m}(% \alpha_{1}))-R(\hat{\boldsymbol{\theta}}_{n,m}(\alpha_{2}))\big{|}\leq L|% \alpha_{1}-\alpha_{2}|\forall\alpha_{1},\alpha_{2}\in[{\varepsilon}_{0},1-{% \varepsilon}_{0}]\Big{)}\geq 1-2e^{-cn}\,.blackboard_P ( | italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ) - italic_R ( over^ start_ARG bold_italic_θ end_ARG start_POSTSUBSCRIPT italic_n , italic_m end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) | ≤ italic_L | italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT | ∀ italic_α start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ [ italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 1 - italic_ε start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ] ) ≥ 1 - 2 italic_e start_POSTSUPERSCRIPT - italic_c italic_n end_POSTSUPERSCRIPT . (81)

The claim follows by using this estimate together with Eq. (78).