Search | arXiv e-print repository

Uncertainty quantification for iterative algorithms in linear models with application to early stop**

Abstract: This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thr… ▽ More This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\sqrt n$-consistent under Gaussian designs. Applications to early-stop** are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for develo** debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2310.01374 [pdf, other]

Corrected generalized cross-validation for finite ensembles of penalized estimators

Authors: Pierre C. Bellec, **-Hong Du, Takuya Koriyama, Pratik Patil, Kai Tan

Abstract: Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsi… ▽ More Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV. △ Less

Submitted 21 April, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: 91 pages, 34 figures; this version adds general proof outlines (in Sections 4.3 and 5.3), add more experiments with non-Gaussian data (in Sections D and E), relaxes an assumption (in Section A.7), clarifies explanations at several places, and corrects minor typos at several places

arXiv:2305.17825 [pdf, other]

Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions

Authors: Kai Tan, Pierre C. Bellec

Abstract: This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the bi… ▽ More This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the binary logistic case in the seminal work of Sur and Candès [2019]. We address this issue in classification problems with 3 or more classes, by develo** asymptotic normality and asymptotic chi-square results for the multinomial logistic MLE (also known as cross-entropy minimizer) on null covariates. Our theory leads to a new methodology to test the significance of a given feature. Extensive simulation studies on synthetic data corroborate these asymptotic results and confirm the validity of proposed p-values for testing the significance of a given feature. △ Less

Submitted 28 May, 2023; originally announced May 2023.

arXiv:2206.07256 [pdf, other]

Noise Covariance Estimation in Multi-Task High-dimensional Linear Models

Authors: Kai Tan, Gabriel Romon, Pierre C Bellec

Abstract: This paper studies the multi-task high-dimensional linear regression models where the noise among different tasks is correlated, in the moderately high dimensional regime where sample size $n$ and dimension $p$ are of the same order. Our goal is to estimate the covariance matrix of the noise random vectors, or equivalently the correlation of the noise variables on any pair of two tasks. Treating t… ▽ More This paper studies the multi-task high-dimensional linear regression models where the noise among different tasks is correlated, in the moderately high dimensional regime where sample size $n$ and dimension $p$ are of the same order. Our goal is to estimate the covariance matrix of the noise random vectors, or equivalently the correlation of the noise variables on any pair of two tasks. Treating the regression coefficients as a nuisance parameter, we leverage the multi-task elastic-net and multi-task lasso estimators to estimate the nuisance. By precisely understanding the bias of the squared residual matrix and by correcting this bias, we develop a novel estimator of the noise covariance that converges in Frobenius norm at the rate $n^{-1/2}$ when the covariates are Gaussian. This novel estimator is efficiently computable. Under suitable conditions, the proposed estimator of the noise covariance attains the same rate of convergence as the "oracle" estimator that knows in advance the regression coefficients of the multi-task model. The Frobenius error bounds obtained in this paper also illustrate the advantage of this new estimator compared to a method-of-moments estimator that does not attempt to estimate the nuisance. As a byproduct of our techniques, we obtain an estimate of the generalization error of the multi-task elastic-net and multi-task lasso estimators. Extensive simulation studies are carried out to illustrate the numerical performance of the proposed method. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2204.06990 [pdf, other]

Observable adjustments in single-index models for regularized M-estimators

Authors: Pierre C Bellec

Abstract: We consider observations $(X,y)$ from single index models with unknown link function, Gaussian covariates and a regularized M-estimator $\hatβ$ constructed from convex loss function and regularizer. In the regime where sample size $n$ and dimension $p$ are both increasing such that $p/n$ has a finite limit, the behavior of the empirical distribution of $\hatβ$ and the predicted values $X\hatβ$ has… ▽ More We consider observations $(X,y)$ from single index models with unknown link function, Gaussian covariates and a regularized M-estimator $\hatβ$ constructed from convex loss function and regularizer. In the regime where sample size $n$ and dimension $p$ are both increasing such that $p/n$ has a finite limit, the behavior of the empirical distribution of $\hatβ$ and the predicted values $X\hatβ$ has been previously characterized in a number of models: The empirical distributions are known to converge to proximal operators of the loss and penalty in a related Gaussian sequence model, which captures the interplay between ratio $p/n$, loss, regularization and the data generating process. This connection between$(\hatβ,X\hatβ)$ and the corresponding proximal operators require solving fixed-point equations that typically involve unobservable quantities such as the prior distribution on the index or the link function. This paper develops a different theory to describe the empirical distribution of $\hatβ$ and $X\hatβ$: Approximations of $(\hatβ,X\hatβ)$ in terms of proximal operators are provided that only involve observable adjustments. These proposed observable adjustments are data-driven, e.g., do not require prior knowledge of the index or the link function. These new adjustments yield confidence intervals for individual components of the index, as well as estimators of the correlation of $\hatβ$ with the index. The interplay between loss, regularization and the model is thus captured in a data-driven manner, without solving the fixed-point equations studied in previous works. The results apply to both strongly convex regularizers and unregularized M-estimation. Simulations are provided for the square and logistic loss in single index models including logistic regression and 1-bit compressed sensing with 20\% corrupted bits. △ Less

Submitted 3 January, 2024; v1 submitted 14 April, 2022; originally announced April 2022.

arXiv:2107.07828 [pdf, other]

Chi-square and normal inference in high-dimensional multi-task regression

Authors: Pierre C Bellec, Gabriel Romon

Abstract: The paper proposes chi-square and normal inference methodologies for the unknown coefficient matrix $B^*$ of size $p\times T$ in a Multi-Task (MT) linear model with $p$ covariates, $T$ tasks and $n$ observations under a row-sparse assumption on $B^*$. The row-sparsity $s$, dimension $p$ and number of tasks $T$ are allowed to grow with $n$. In the high-dimensional regime $p\ggg n$, in order to leve… ▽ More The paper proposes chi-square and normal inference methodologies for the unknown coefficient matrix $B^*$ of size $p\times T$ in a Multi-Task (MT) linear model with $p$ covariates, $T$ tasks and $n$ observations under a row-sparse assumption on $B^*$. The row-sparsity $s$, dimension $p$ and number of tasks $T$ are allowed to grow with $n$. In the high-dimensional regime $p\ggg n$, in order to leverage row-sparsity, the MT Lasso is considered. We build upon the MT Lasso with a de-biasing scheme to correct for the bias induced by the penalty. This scheme requires the introduction of a new data-driven object, coined the interaction matrix, that captures effective correlations between noise vector and residuals on different tasks. This matrix is psd, of size $T\times T$ and can be computed efficiently. The interaction matrix lets us derive asymptotic normal and $χ^2_T$ results under Gaussian design and $\frac{sT+s\log(p/s)}{n}\to0$ which corresponds to consistency in Frobenius norm. These asymptotic distribution results yield valid confidence intervals for single entries of $B^*$ and valid confidence ellipsoids for single rows of $B^*$, for both known and unknown design covariance $Σ$. While previous proposals in grouped-variables regression require row-sparsity $s\lesssim\sqrt n$ up to constants depending on $T$ and logarithmic factors in $n,p$, the de-biasing scheme using the interaction matrix provides confidence intervals and $χ^2_T$ confidence ellipsoids under the conditions ${\min(T^2,\log^8p)}/{n}\to 0$ and $$ \frac{sT+s\log(p/s)+\|Σ^{-1}e_j\|_0\log p}{n}\to0, \quad \frac{\min(s,\|Σ^{-1}e_j\|_0)}{\sqrt n} \sqrt{[T+\log(p/s)]\log p}\to 0, $$ allowing row-sparsity $s\ggg\sqrt n$ when $\|Σ^{-1}e_j\|_0 \sqrt T\lll \sqrt{n}$ up to logarithmic factors. △ Less

Submitted 16 July, 2021; originally announced July 2021.

arXiv:2107.05143 [pdf, other]

Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning

Authors: Pierre C Bellec, Yiwei Shen

Abstract: This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formula… ▽ More This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators $\hatβ(y,X)$ where differentiation is taken with respect to both $y$ and $X$; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residual $r_i = y_i-x_i^\top\hatβ$ in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of $\hatβ(y,X)$ and the effective degrees of freedom of the M-estimator, which are of independent interest. △ Less

Submitted 11 July, 2021; originally announced July 2021.

arXiv:2107.03826 [pdf, other]

Asymptotic normality of robust $M$-estimators with convex penalty

Authors: Pierre C Bellec, Yiwei Shen, Cun-Hui Zhang

Abstract: This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\leγ$ for some fixed constant $γ>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions… ▽ More This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\leγ$ for some fixed constant $γ>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions including the Huber loss and its smoothed versions regularized with a strongly convex penalty. The asymptotic variance that characterizes the width of the resulting confidence intervals is estimated with data-driven quantities. This estimate of the variance adapts automatically to low ($p/n\to0)$ or high ($p/n \le γ$) dimensions and does not involve the proximal operators seen in previous works on asymptotic normality of M-estimators. For the Huber loss, the estimated variance has a simple expression involving an effective degrees-of-freedom as well as an effective sample size. The case of the Huber loss with Elastic-Net penalty is studied in details and a simulation study confirms the theoretical findings. The asymptotic normality results follow from Stein formulae for high-dimensional random vectors on the sphere developed in the paper which are of independent interest. △ Less

Submitted 8 July, 2021; originally announced July 2021.

arXiv:2008.11840 [pdf, other]

Out-of-sample error estimate for robust M-estimators with convex penalty

Authors: Pierre C Bellec

Abstract: A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(X,y)$ is observed and $p,n$ are of the same order. If $ψ$ is the derivative of the robust data-fitting loss $ρ$, the estimate depends on the observed data only through the quantities $\hatψ= ψ(y-X\hatβ)$, $X^\top \hatψ$ and the derivatives… ▽ More A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(X,y)$ is observed and $p,n$ are of the same order. If $ψ$ is the derivative of the robust data-fitting loss $ρ$, the estimate depends on the observed data only through the quantities $\hatψ= ψ(y-X\hatβ)$, $X^\top \hatψ$ and the derivatives $(\partial/\partial y) \hatψ$ and $(\partial/\partial y) X\hatβ$ for fixed $X$. The out-of-sample error estimate enjoys a relative error of order $n^{-1/2}$ in a linear model with Gaussian covariates and independent noise, either non-asymptotically when $p/n\le γ$ or asymptotically in the high-dimensional asymptotic regime $p/n\toγ'\in(0,\infty)$. General differentiable loss functions $ρ$ are allowed provided that $ψ=ρ'$ is 1-Lipschitz. The validity of the out-of-sample error estimate holds either under a strong convexity assumption, or for the $\ell_1$-penalized Huber M-estimator if the number of corrupted observations and sparsity of the true $β$ are bounded from above by $s_*n$ for some small enough constant $s_*\in(0,1)$ independent of $n,p$. For the square loss and in the absence of corruption in the response, the results additionally yield $n^{-1/2}$-consistent estimates of the noise variance and of the generalization error. This generalizes, to arbitrary convex penalty, estimates that were previously known for the Lasso. △ Less

Submitted 30 March, 2023; v1 submitted 26 August, 2020; originally announced August 2020.

Comments: This version adds simulations for the nuclear norm penalty

arXiv:1910.05480 [pdf, ps, other]

First order expansion of convex regularized estimators

Authors: Pierre C Bellec, Arun K Kuchibhotla

Abstract: We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. Our setting includes linear regression and logistic regression as special cases. For a given penalty function $h$ and the corresponding penalized estimator $\hatβ$, we construct a quantity $η$, the first order expansion of $\hatβ$, such that the distance between $\hatβ$ an… ▽ More We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. Our setting includes linear regression and logistic regression as special cases. For a given penalty function $h$ and the corresponding penalized estimator $\hatβ$, we construct a quantity $η$, the first order expansion of $\hatβ$, such that the distance between $\hatβ$ and $η$ is an order of magnitude smaller than the estimation error $\|\hatβ - β^*\|$. In this sense, the first order expansion $η$ can be thought of as a generalization of influence functions from the mathematical statistics literature to regularized estimators in high-dimensions. Such first order expansion implies that the risk of $\hatβ$ is asymptotically the same as the risk of $η$ which leads to a precise characterization of the MSE of $\hatβ$; this characterization takes a particularly simple form for isotropic design. Such first order expansion also leads to inference results based on $\hatβ$. We provide sufficient conditions for the existence of such first order expansion for three regularizers: the Lasso in its constrained form, the lasso in its penalized form, and the Group-Lasso. The results apply to general loss functions under some conditions and those conditions are satisfied for the squared loss in linear regression and for the logistic loss in the logistic model. △ Less

Submitted 8 March, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

Comments: Accepted at NeurIPS 2019 and published at https://papers.nips.cc/paper/8606-first-order-expansion-of-convex-regularized-estimators . The version here includes the supplementary material

arXiv:1905.12517 [pdf, ps, other]

The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers

Authors: Pierre C Bellec, Dana Yang

Abstract: We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achie… ▽ More We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than $Cσ^2$, where $σ^2$ is the noise level and $C>0$ is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of $σ^2\log(M)$ where $M$ is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices. △ Less

Submitted 29 May, 2019; originally announced May 2019.

arXiv:1902.08885 [pdf, other]

De-Biasing The Lasso With Degrees-of-Freedom Adjustment

Authors: Pierre C. Bellec, Cun-Hui Zhang

Abstract: This paper studies schemes to de-bias the Lasso in a linear model $y=Xβ+ε$ where the goal is to construct confidence intervals for $a_0^Tβ$ in a direction $a_0$, where $X$ has iid $N(0,Σ)$ rows. We show that previously analyzed propositions to de-bias the Lasso require a modification in order to enjoy efficiency in a full range of sparsity. This modification takes the form of a degrees-of-freedom… ▽ More This paper studies schemes to de-bias the Lasso in a linear model $y=Xβ+ε$ where the goal is to construct confidence intervals for $a_0^Tβ$ in a direction $a_0$, where $X$ has iid $N(0,Σ)$ rows. We show that previously analyzed propositions to de-bias the Lasso require a modification in order to enjoy efficiency in a full range of sparsity. This modification takes the form of a degrees-of-freedom adjustment that accounts for the dimension of the model selected by Lasso. Let $s_0$ be the true sparsity. If $Σ$ is known and the ideal score vector proportional to $XΣ^{-1}a_0$ is used, the unadjusted de-biasing schemes proposed previously enjoy efficiency if $s_0\lll n^{2/3}$. However, if $s_0\ggg n^{2/3}$, the unadjusted schemes cannot be efficient in certain $a_0$: then it is necessary to modify existing procedures by a degrees-of-freedom adjustment. This modification grants asymptotic efficiency for any $a_0$ when $s_0/p\to 0$ and $s_0\log(p/s_0)/n \to 0$. If $Σ$ is unknown, efficiency is granted for general $a_0$ when $$\frac{s_0\log p}{n}+\min\Big\{\frac{s_Ω\log p}{n},\frac{\|Σ^{-1}a_0\|_1\sqrt{\log p}}{\|Σ^{-1/2}a_0\|_2 \sqrt n}\Big\}+\frac{\min(s_Ω,s_0)\log p}{\sqrt n}\to0$$ where $s_Ω=\|Σ^{-1}a_0\|_0$, provided that the de-biased estimate is modified with the degrees-of-freedom adjustment. The dependence in $s_0,s_Ω$ and $\|Σ^{-1}a_0\|_1$ is optimal. Our estimated score vector provides a novel methodology to handle dense $a_0$. Our analysis shows that the degrees-of-freedom adjustment is not needed when the initial bias in direction $a_0$ is small, which is granted under stringent conditions on $Σ^{-1}$. The main proof argument is an interpolation path similar to that typically used to derive Slepian's lemma. It yields a new $\ell_\infty$ error bound for the Lasso which is of independent interest. △ Less

Submitted 8 July, 2021; v1 submitted 23 February, 2019; originally announced February 2019.

arXiv:1606.06179 [pdf, ps, other]

On the prediction loss of the lasso in the partially labeled setting

Authors: Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, Quentin Paris

Abstract: In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-… ▽ More In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-dimensional) covariates is considered. We propose some new adaptations of the lasso to these settings and establish oracle inequalities both in expectation and in deviation. These results provide non-asymptotic upper bounds on the risk that highlight the interplay between the bias due to the mis-specification of the linear model, the bias due to the approximate sparsity and the variance. They also demonstrate that the presence of a large number of unlabeled features may have significant positive impact in the situations where the restricted eigenvalue of the design matrix vanishes or is very small. △ Less

Submitted 8 November, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 25 pages

Showing 1–13 of 13 results for author: Bellec, P C