Search | arXiv e-print repository

Uncertainty quantification for iterative algorithms in linear models with application to early stop**

Abstract: This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thr… ▽ More This paper investigates the iterates $\hbb^1,\dots,\hbb^T$ obtained from iterative algorithms in high-dimensional linear regression problems, in the regime where the feature dimension $p$ is comparable with the sample size $n$, i.e., $p \asymp n$. The analysis and proposed estimators are applicable to Gradient Descent (GD), proximal GD and their accelerated variants such as Fast Iterative Soft-Thresholding (FISTA). The paper proposes novel estimators for the generalization error of the iterate $\hbb^t$ for any fixed iteration $t$ along the trajectory. These estimators are proved to be $\sqrt n$-consistent under Gaussian designs. Applications to early-stop** are provided: when the generalization error of the iterates is a U-shape function of the iteration $t$, the estimates allow to select from the data an iteration $\hat t$ that achieves the smallest generalization error along the trajectory. Additionally, we provide a technique for develo** debiasing corrections and valid confidence intervals for the components of the true coefficient vector from the iterate $\hbb^t$ at any finite iteration $t$. Extensive simulations on synthetic data illustrate the theoretical results. △ Less

Submitted 27 April, 2024; originally announced April 2024.

arXiv:2404.02070 [pdf, other]

Asymptotics of resampling without replacement in robust and logistic regression

Authors: Pierre C Bellec, Takuya Koriyama

Abstract: This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension $p$ and sample size $n$ are of the same order. For a given dataset $(X,y)\in \mathbb{R}^{n\times p}\times \mathbb{R}^n$ and fixed subsample ratio $q\in(0,1)$, the practitioner samples independently of $(X,y)$ iid subsets $I_1,...,I_M$ of $\{1,...,n\}$ of size $q n$ and trains estimators… ▽ More This paper studies the asymptotics of resampling without replacement in the proportional regime where dimension $p$ and sample size $n$ are of the same order. For a given dataset $(X,y)\in \mathbb{R}^{n\times p}\times \mathbb{R}^n$ and fixed subsample ratio $q\in(0,1)$, the practitioner samples independently of $(X,y)$ iid subsets $I_1,...,I_M$ of $\{1,...,n\}$ of size $q n$ and trains estimators $\hatβ(I_1),...,\hatβ(I_M)$ on the corresponding subsets of rows of $(X, y)$. Understanding the performance of the bagged estimate $\barβ = \frac1M\sum_{m=1}^M \hatβ(I_1),...,\hatβ(I_M)$, for instance its squared error, requires us to understand correlations between two distinct $\hatβ(I_m)$ and $\hatβ(I_{m'})$ trained on different subsets $I_m$ and $I_{m'}$. In robust linear regression and logistic regression, we characterize the limit in probability of the correlation between two estimates trained on different subsets of the data. The limit is characterized as the unique solution of a simple nonlinear equation. We further provide data-driven estimators that are consistent for estimating this limit. These estimators of the limiting correlation allow us to estimate the squared error of the bagged estimate $\barβ$, and for instance perform parameter tuning to choose the optimal subsample ratio $q$. As a by-product of the proof argument, we obtain the limiting distribution of the bivariate pair $(x_i^T \hatβ(I_m), x_i^T \hatβ(I_{m'}))$ for observations $i\in I_m\cap I_{m'}$, i.e., for observations used to train both estimates. △ Less

Submitted 16 April, 2024; v1 submitted 2 April, 2024; originally announced April 2024.

Comments: 25 pages, 10 figures

arXiv:2312.13257 [pdf, other]

Error estimation and adaptive tuning for unregularized robust M-estimator

Authors: Pierre C. Bellec, Takuya Koriyama

Abstract: We consider unregularized robust M-estimators for linear models under Gaussian design and heavy-tailed noise, in the proportional asymptotics regime where the sample size n and the number of features p are both increasing such that $p/n \to γ\in (0,1)$. An estimator of the out-of-sample error of a robust M-estimator is analysed and proved to be consistent for a large family of loss functions that… ▽ More We consider unregularized robust M-estimators for linear models under Gaussian design and heavy-tailed noise, in the proportional asymptotics regime where the sample size n and the number of features p are both increasing such that $p/n \to γ\in (0,1)$. An estimator of the out-of-sample error of a robust M-estimator is analysed and proved to be consistent for a large family of loss functions that includes the Huber loss. As an application of this result, we propose an adaptive tuning procedure of the scale parameter $λ>0$ of a given loss function $ρ$: choosing$\hat λ$ in a given interval $I$ that minimizes the out-of-sample error estimate of the M-estimator constructed with loss $ρ_λ(\cdot) = λ^2 ρ(\cdot/λ)$ leads to the optimal out-of-sample error over $I$. The proof relies on a smoothing argument: the unregularized M-estimation objective function is perturbed, or smoothed, with a Ridge penalty that vanishes as $n\to+\infty$, and show that the unregularized M-estimator of interest inherits properties of its smoothed version. △ Less

Submitted 20 December, 2023; originally announced December 2023.

Comments: 33 pages, 10 figures

arXiv:2312.13254 [pdf, ps, other]

Existence of solutions to the nonlinear equations characterizing the precise error of M-estimators

Authors: Pierre C Bellec, Takuya Koriyama

Abstract: Major progress has been made in the previous decade to characterize the asymptotic behavior of regularized M-estimators in high-dimensional regression problems in the proportional asymptotic regime where the sample size $n$ and the number of features $p$ are increasing simultaneously such that $n/p\to δ\in(0,\infty)$, using powerful tools such as Approximate Message Passing or the Convex Gaussian… ▽ More Major progress has been made in the previous decade to characterize the asymptotic behavior of regularized M-estimators in high-dimensional regression problems in the proportional asymptotic regime where the sample size $n$ and the number of features $p$ are increasing simultaneously such that $n/p\to δ\in(0,\infty)$, using powerful tools such as Approximate Message Passing or the Convex Gaussian Min-Max Theorem (CGMT). The asymptotic error and behavior of the regularized M-estimator is then typically described by a system of nonlinear equations with a few scalar unknowns, and the solution to this system precisely characterize the asymptotic error. Application of the CGMT and related machinery requires the existence of a solution to this low-dimensional system of equations. This paper resolves the question of existence of solution to this low-dimensional system for the case of linear models with independent additive noise, when both the data-fitting loss function and regularization penalty are separable and convex. Such existence result for solution to the nonlinear system were previously known under strong convexity for specific estimators such as the Lasso. The main idea behind this existence result is inspired by an argument developed \cite{montanari2019generalization,celentano2020lasso} in different contexts: By constructing an ad-hoc convex minimization problem in an infinite dimensional Hilbert space, the existence of the Lagrange multiplier for this optimization problem makes it possible to construct explicitly solutions to the low-dimensional system of interest. The conditions under which we derive this existence result exactly correspond to the side of the phase transition where perfect recovery $\hat x= x_0$ fails, so that these conditions are optimal. △ Less

Submitted 3 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

arXiv:2310.01374 [pdf, other]

Corrected generalized cross-validation for finite ensembles of penalized estimators

Authors: Pierre C. Bellec, **-Hong Du, Takuya Koriyama, Pratik Patil, Kai Tan

Abstract: Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsi… ▽ More Generalized cross-validation (GCV) is a widely-used method for estimating the squared out-of-sample prediction risk that employs a scalar degrees of freedom adjustment (in a multiplicative sense) to the squared training error. In this paper, we examine the consistency of GCV for estimating the prediction risk of arbitrary ensembles of penalized least-squares estimators. We show that GCV is inconsistent for any finite ensemble of size greater than one. Towards repairing this shortcoming, we identify a correction that involves an additional scalar correction (in an additive sense) based on degrees of freedom adjusted training errors from each ensemble component. The proposed estimator (termed CGCV) maintains the computational advantages of GCV and requires neither sample splitting, model refitting, or out-of-bag risk estimation. The estimator stems from a finer inspection of the ensemble risk decomposition and two intermediate risk estimators for the components in this decomposition. We provide a non-asymptotic analysis of the CGCV and the two intermediate risk estimators for ensembles of convex penalized estimators under Gaussian features and a linear response model. Furthermore, in the special case of ridge regression, we extend the analysis to general feature and response distributions using random matrix theory, which establishes model-free uniform consistency of CGCV. △ Less

Submitted 21 April, 2024; v1 submitted 2 October, 2023; originally announced October 2023.

Comments: 91 pages, 34 figures; this version adds general proof outlines (in Sections 4.3 and 5.3), add more experiments with non-Gaussian data (in Sections D and E), relaxes an assumption (in Section A.7), clarifies explanations at several places, and corrects minor typos at several places

arXiv:2305.17825 [pdf, other]

Multinomial Logistic Regression: Asymptotic Normality on Null Covariates in High-Dimensions

Authors: Kai Tan, Pierre C. Bellec

Abstract: This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the bi… ▽ More This paper investigates the asymptotic distribution of the maximum-likelihood estimate (MLE) in multinomial logistic models in the high-dimensional regime where dimension and sample size are of the same order. While classical large-sample theory provides asymptotic normality of the MLE under certain conditions, such classical results are expected to fail in high-dimensions as documented for the binary logistic case in the seminal work of Sur and Candès [2019]. We address this issue in classification problems with 3 or more classes, by develo** asymptotic normality and asymptotic chi-square results for the multinomial logistic MLE (also known as cross-entropy minimizer) on null covariates. Our theory leads to a new methodology to test the significance of a given feature. Extensive simulation studies on synthetic data corroborate these asymptotic results and confirm the validity of proposed p-values for testing the significance of a given feature. △ Less

Submitted 28 May, 2023; originally announced May 2023.

arXiv:2206.07256 [pdf, other]

Noise Covariance Estimation in Multi-Task High-dimensional Linear Models

Authors: Kai Tan, Gabriel Romon, Pierre C Bellec

Abstract: This paper studies the multi-task high-dimensional linear regression models where the noise among different tasks is correlated, in the moderately high dimensional regime where sample size $n$ and dimension $p$ are of the same order. Our goal is to estimate the covariance matrix of the noise random vectors, or equivalently the correlation of the noise variables on any pair of two tasks. Treating t… ▽ More This paper studies the multi-task high-dimensional linear regression models where the noise among different tasks is correlated, in the moderately high dimensional regime where sample size $n$ and dimension $p$ are of the same order. Our goal is to estimate the covariance matrix of the noise random vectors, or equivalently the correlation of the noise variables on any pair of two tasks. Treating the regression coefficients as a nuisance parameter, we leverage the multi-task elastic-net and multi-task lasso estimators to estimate the nuisance. By precisely understanding the bias of the squared residual matrix and by correcting this bias, we develop a novel estimator of the noise covariance that converges in Frobenius norm at the rate $n^{-1/2}$ when the covariates are Gaussian. This novel estimator is efficiently computable. Under suitable conditions, the proposed estimator of the noise covariance attains the same rate of convergence as the "oracle" estimator that knows in advance the regression coefficients of the multi-task model. The Frobenius error bounds obtained in this paper also illustrate the advantage of this new estimator compared to a method-of-moments estimator that does not attempt to estimate the nuisance. As a byproduct of our techniques, we obtain an estimate of the generalization error of the multi-task elastic-net and multi-task lasso estimators. Extensive simulation studies are carried out to illustrate the numerical performance of the proposed method. △ Less

Submitted 14 June, 2022; originally announced June 2022.

arXiv:2204.06990 [pdf, other]

Observable adjustments in single-index models for regularized M-estimators

Authors: Pierre C Bellec

Abstract: We consider observations $(X,y)$ from single index models with unknown link function, Gaussian covariates and a regularized M-estimator $\hatβ$ constructed from convex loss function and regularizer. In the regime where sample size $n$ and dimension $p$ are both increasing such that $p/n$ has a finite limit, the behavior of the empirical distribution of $\hatβ$ and the predicted values $X\hatβ$ has… ▽ More We consider observations $(X,y)$ from single index models with unknown link function, Gaussian covariates and a regularized M-estimator $\hatβ$ constructed from convex loss function and regularizer. In the regime where sample size $n$ and dimension $p$ are both increasing such that $p/n$ has a finite limit, the behavior of the empirical distribution of $\hatβ$ and the predicted values $X\hatβ$ has been previously characterized in a number of models: The empirical distributions are known to converge to proximal operators of the loss and penalty in a related Gaussian sequence model, which captures the interplay between ratio $p/n$, loss, regularization and the data generating process. This connection between$(\hatβ,X\hatβ)$ and the corresponding proximal operators require solving fixed-point equations that typically involve unobservable quantities such as the prior distribution on the index or the link function. This paper develops a different theory to describe the empirical distribution of $\hatβ$ and $X\hatβ$: Approximations of $(\hatβ,X\hatβ)$ in terms of proximal operators are provided that only involve observable adjustments. These proposed observable adjustments are data-driven, e.g., do not require prior knowledge of the index or the link function. These new adjustments yield confidence intervals for individual components of the index, as well as estimators of the correlation of $\hatβ$ with the index. The interplay between loss, regularization and the model is thus captured in a data-driven manner, without solving the fixed-point equations studied in previous works. The results apply to both strongly convex regularizers and unregularized M-estimation. Simulations are provided for the square and logistic loss in single index models including logistic regression and 1-bit compressed sensing with 20\% corrupted bits. △ Less

Submitted 3 January, 2024; v1 submitted 14 April, 2022; originally announced April 2022.

arXiv:2107.07828 [pdf, other]

Chi-square and normal inference in high-dimensional multi-task regression

Authors: Pierre C Bellec, Gabriel Romon

Abstract: The paper proposes chi-square and normal inference methodologies for the unknown coefficient matrix $B^*$ of size $p\times T$ in a Multi-Task (MT) linear model with $p$ covariates, $T$ tasks and $n$ observations under a row-sparse assumption on $B^*$. The row-sparsity $s$, dimension $p$ and number of tasks $T$ are allowed to grow with $n$. In the high-dimensional regime $p\ggg n$, in order to leve… ▽ More The paper proposes chi-square and normal inference methodologies for the unknown coefficient matrix $B^*$ of size $p\times T$ in a Multi-Task (MT) linear model with $p$ covariates, $T$ tasks and $n$ observations under a row-sparse assumption on $B^*$. The row-sparsity $s$, dimension $p$ and number of tasks $T$ are allowed to grow with $n$. In the high-dimensional regime $p\ggg n$, in order to leverage row-sparsity, the MT Lasso is considered. We build upon the MT Lasso with a de-biasing scheme to correct for the bias induced by the penalty. This scheme requires the introduction of a new data-driven object, coined the interaction matrix, that captures effective correlations between noise vector and residuals on different tasks. This matrix is psd, of size $T\times T$ and can be computed efficiently. The interaction matrix lets us derive asymptotic normal and $χ^2_T$ results under Gaussian design and $\frac{sT+s\log(p/s)}{n}\to0$ which corresponds to consistency in Frobenius norm. These asymptotic distribution results yield valid confidence intervals for single entries of $B^*$ and valid confidence ellipsoids for single rows of $B^*$, for both known and unknown design covariance $Σ$. While previous proposals in grouped-variables regression require row-sparsity $s\lesssim\sqrt n$ up to constants depending on $T$ and logarithmic factors in $n,p$, the de-biasing scheme using the interaction matrix provides confidence intervals and $χ^2_T$ confidence ellipsoids under the conditions ${\min(T^2,\log^8p)}/{n}\to 0$ and $$ \frac{sT+s\log(p/s)+\|Σ^{-1}e_j\|_0\log p}{n}\to0, \quad \frac{\min(s,\|Σ^{-1}e_j\|_0)}{\sqrt n} \sqrt{[T+\log(p/s)]\log p}\to 0, $$ allowing row-sparsity $s\ggg\sqrt n$ when $\|Σ^{-1}e_j\|_0 \sqrt T\lll \sqrt{n}$ up to logarithmic factors. △ Less

Submitted 16 July, 2021; originally announced July 2021.

arXiv:2107.05143 [pdf, other]

Derivatives and residual distribution of regularized M-estimators with application to adaptive tuning

Authors: Pierre C Bellec, Yiwei Shen

Abstract: This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formula… ▽ More This paper studies M-estimators with gradient-Lipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust M-estimator constructed with the Huber loss and the Elastic-Net penalty and the noise distribution has heavy-tails. Our main contributions are three-fold. (i) We provide general formulae for the derivatives of regularized M-estimators $\hatβ(y,X)$ where differentiation is taken with respect to both $y$ and $X$; this reveals a simple differentiability structure shared by all convex regularized M-estimators. (ii) Using these derivatives, we characterize the distribution of the residual $r_i = y_i-x_i^\top\hatβ$ in the intermediate high-dimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized M-estimators. The criterion approximates the out-of-sample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the out-of-sample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion as a proxy of the out-of-sample error. Finally our results reveal new relationships between the derivatives of $\hatβ(y,X)$ and the effective degrees of freedom of the M-estimator, which are of independent interest. △ Less

Submitted 11 July, 2021; originally announced July 2021.

arXiv:2107.03826 [pdf, other]

Asymptotic normality of robust $M$-estimators with convex penalty

Authors: Pierre C Bellec, Yiwei Shen, Cun-Hui Zhang

Abstract: This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\leγ$ for some fixed constant $γ>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions… ▽ More This paper develops asymptotic normality results for individual coordinates of robust M-estimators with convex penalty in high-dimensions, where the dimension $p$ is at most of the same order as the sample size $n$, i.e, $p/n\leγ$ for some fixed constant $γ>0$. The asymptotic normality requires a bias correction and holds for most coordinates of the M-estimator for a large class of loss functions including the Huber loss and its smoothed versions regularized with a strongly convex penalty. The asymptotic variance that characterizes the width of the resulting confidence intervals is estimated with data-driven quantities. This estimate of the variance adapts automatically to low ($p/n\to0)$ or high ($p/n \le γ$) dimensions and does not involve the proximal operators seen in previous works on asymptotic normality of M-estimators. For the Huber loss, the estimated variance has a simple expression involving an effective degrees-of-freedom as well as an effective sample size. The case of the Huber loss with Elastic-Net penalty is studied in details and a simulation study confirms the theoretical findings. The asymptotic normality results follow from Stein formulae for high-dimensional random vectors on the sphere developed in the paper which are of independent interest. △ Less

Submitted 8 July, 2021; originally announced July 2021.

arXiv:2008.11840 [pdf, other]

Out-of-sample error estimate for robust M-estimators with convex penalty

Authors: Pierre C Bellec

Abstract: A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(X,y)$ is observed and $p,n$ are of the same order. If $ψ$ is the derivative of the robust data-fitting loss $ρ$, the estimate depends on the observed data only through the quantities $\hatψ= ψ(y-X\hatβ)$, $X^\top \hatψ$ and the derivatives… ▽ More A generic out-of-sample error estimate is proposed for robust $M$-estimators regularized with a convex penalty in high-dimensional linear regression where $(X,y)$ is observed and $p,n$ are of the same order. If $ψ$ is the derivative of the robust data-fitting loss $ρ$, the estimate depends on the observed data only through the quantities $\hatψ= ψ(y-X\hatβ)$, $X^\top \hatψ$ and the derivatives $(\partial/\partial y) \hatψ$ and $(\partial/\partial y) X\hatβ$ for fixed $X$. The out-of-sample error estimate enjoys a relative error of order $n^{-1/2}$ in a linear model with Gaussian covariates and independent noise, either non-asymptotically when $p/n\le γ$ or asymptotically in the high-dimensional asymptotic regime $p/n\toγ'\in(0,\infty)$. General differentiable loss functions $ρ$ are allowed provided that $ψ=ρ'$ is 1-Lipschitz. The validity of the out-of-sample error estimate holds either under a strong convexity assumption, or for the $\ell_1$-penalized Huber M-estimator if the number of corrupted observations and sparsity of the true $β$ are bounded from above by $s_*n$ for some small enough constant $s_*\in(0,1)$ independent of $n,p$. For the square loss and in the absence of corruption in the response, the results additionally yield $n^{-1/2}$-consistent estimates of the noise variance and of the generalization error. This generalizes, to arbitrary convex penalty, estimates that were previously known for the Lasso. △ Less

Submitted 30 March, 2023; v1 submitted 26 August, 2020; originally announced August 2020.

Comments: This version adds simulations for the nuclear norm penalty

arXiv:1912.11943 [pdf, other]

De-biasing convex regularized estimators and interval estimation in linear models

Authors: Pierre C Bellec, Cun-Hui Zhang

Abstract: New upper bounds are developed for the $L_2$ distance between $ξ/\text{Var}[ξ]^{1/2}$ and linear and quadratic functions of $z\sim N(0,I_n)$ for random variables of the form $ξ=bz^\top f(z) - \text{div} f(z)$. The linear approximation yields a central limit theorem when the squared norm of $f(z)$ dominates the squared Frobenius norm of $\nabla f(z)$ in expectation. Applications of this normal appr… ▽ More New upper bounds are developed for the $L_2$ distance between $ξ/\text{Var}[ξ]^{1/2}$ and linear and quadratic functions of $z\sim N(0,I_n)$ for random variables of the form $ξ=bz^\top f(z) - \text{div} f(z)$. The linear approximation yields a central limit theorem when the squared norm of $f(z)$ dominates the squared Frobenius norm of $\nabla f(z)$ in expectation. Applications of this normal approximation are given for the asymptotic normality of de-biased estimators in linear regression with correlated design and convex penalty in the regime $p/n \le γ$ for constant $γ\in(0,{\infty})$. For the estimation of linear functions $\langle a_0,β\rangle$ of the unknown coefficient vector $β$, this analysis leads to asymptotic normality of the de-biased estimate for most normalized directions $a_0$, where ``most'' is quantified in a precise sense. This asymptotic normality holds for any convex penalty if $γ<1$ and for any strongly convex penalty if $γ\ge 1$. In particular the penalty needs not be separable or permutation invariant. By allowing arbitrary regularizers, the results vastly broaden the scope of applicability of de-biasing methodologies to obtain confidence intervals in high-dimensions. In the absence of strong convexity for $p>n$, asymptotic normality of the de-biased estimate is obtained for the Lasso and the group Lasso under additional conditions. For general convex penalties, our analysis also provides prediction and estimation error bounds of independent interest. △ Less

Submitted 28 September, 2021; v1 submitted 26 December, 2019; originally announced December 2019.

Comments: Manuscript title was updated; see former title at arXiv:912.11943v3

arXiv:1910.05480 [pdf, ps, other]

First order expansion of convex regularized estimators

Authors: Pierre C Bellec, Arun K Kuchibhotla

Abstract: We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. Our setting includes linear regression and logistic regression as special cases. For a given penalty function $h$ and the corresponding penalized estimator $\hatβ$, we construct a quantity $η$, the first order expansion of $\hatβ$, such that the distance between $\hatβ$ an… ▽ More We consider first order expansions of convex penalized estimators in high-dimensional regression problems with random designs. Our setting includes linear regression and logistic regression as special cases. For a given penalty function $h$ and the corresponding penalized estimator $\hatβ$, we construct a quantity $η$, the first order expansion of $\hatβ$, such that the distance between $\hatβ$ and $η$ is an order of magnitude smaller than the estimation error $\|\hatβ - β^*\|$. In this sense, the first order expansion $η$ can be thought of as a generalization of influence functions from the mathematical statistics literature to regularized estimators in high-dimensions. Such first order expansion implies that the risk of $\hatβ$ is asymptotically the same as the risk of $η$ which leads to a precise characterization of the MSE of $\hatβ$; this characterization takes a particularly simple form for isotropic design. Such first order expansion also leads to inference results based on $\hatβ$. We provide sufficient conditions for the existence of such first order expansion for three regularizers: the Lasso in its constrained form, the lasso in its penalized form, and the Group-Lasso. The results apply to general loss functions under some conditions and those conditions are satisfied for the squared loss in linear regression and for the logistic loss in the logistic model. △ Less

Submitted 8 March, 2020; v1 submitted 11 October, 2019; originally announced October 2019.

Comments: Accepted at NeurIPS 2019 and published at https://papers.nips.cc/paper/8606-first-order-expansion-of-convex-regularized-estimators . The version here includes the supplementary material

arXiv:1905.12517 [pdf, ps, other]

The cost-free nature of optimally tuning Tikhonov regularizers and other ordered smoothers

Authors: Pierre C Bellec, Dana Yang

Abstract: We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achie… ▽ More We consider the problem of selecting the best estimator among a family of Tikhonov regularized estimators, or, alternatively, to select a linear combination of these regularizers that is as good as the best regularizer in the family. Our theory reveals that if the Tikhonov regularizers share the same penalty matrix with different tuning parameters, a convex procedure based on $Q$-aggregation achieves the mean square error of the best estimator, up to a small error term no larger than $Cσ^2$, where $σ^2$ is the noise level and $C>0$ is an absolute constant. Remarkably, the error term does not depend on the penalty matrix or the number of estimators as long as they share the same penalty matrix, i.e., it applies to any grid of tuning parameters, no matter how large the cardinality of the grid is. This reveals the surprising "cost-free" nature of optimally tuning Tikhonov regularizers, in striking contrast with the existing literature on aggregation of estimators where one typically has to pay a cost of $σ^2\log(M)$ where $M$ is the number of estimators in the family. The result holds, more generally, for any family of ordered linear smoothers. This encompasses Ridge regression as well as Principal Component Regression. The result is extended to the problem of tuning Tikhonov regularizers with different penalty matrices. △ Less

Submitted 29 May, 2019; originally announced May 2019.

arXiv:1902.08885 [pdf, other]

De-Biasing The Lasso With Degrees-of-Freedom Adjustment

Authors: Pierre C. Bellec, Cun-Hui Zhang

Abstract: This paper studies schemes to de-bias the Lasso in a linear model $y=Xβ+ε$ where the goal is to construct confidence intervals for $a_0^Tβ$ in a direction $a_0$, where $X$ has iid $N(0,Σ)$ rows. We show that previously analyzed propositions to de-bias the Lasso require a modification in order to enjoy efficiency in a full range of sparsity. This modification takes the form of a degrees-of-freedom… ▽ More This paper studies schemes to de-bias the Lasso in a linear model $y=Xβ+ε$ where the goal is to construct confidence intervals for $a_0^Tβ$ in a direction $a_0$, where $X$ has iid $N(0,Σ)$ rows. We show that previously analyzed propositions to de-bias the Lasso require a modification in order to enjoy efficiency in a full range of sparsity. This modification takes the form of a degrees-of-freedom adjustment that accounts for the dimension of the model selected by Lasso. Let $s_0$ be the true sparsity. If $Σ$ is known and the ideal score vector proportional to $XΣ^{-1}a_0$ is used, the unadjusted de-biasing schemes proposed previously enjoy efficiency if $s_0\lll n^{2/3}$. However, if $s_0\ggg n^{2/3}$, the unadjusted schemes cannot be efficient in certain $a_0$: then it is necessary to modify existing procedures by a degrees-of-freedom adjustment. This modification grants asymptotic efficiency for any $a_0$ when $s_0/p\to 0$ and $s_0\log(p/s_0)/n \to 0$. If $Σ$ is unknown, efficiency is granted for general $a_0$ when $$\frac{s_0\log p}{n}+\min\Big\{\frac{s_Ω\log p}{n},\frac{\|Σ^{-1}a_0\|_1\sqrt{\log p}}{\|Σ^{-1/2}a_0\|_2 \sqrt n}\Big\}+\frac{\min(s_Ω,s_0)\log p}{\sqrt n}\to0$$ where $s_Ω=\|Σ^{-1}a_0\|_0$, provided that the de-biased estimate is modified with the degrees-of-freedom adjustment. The dependence in $s_0,s_Ω$ and $\|Σ^{-1}a_0\|_1$ is optimal. Our estimated score vector provides a novel methodology to handle dense $a_0$. Our analysis shows that the degrees-of-freedom adjustment is not needed when the initial bias in direction $a_0$ is small, which is granted under stringent conditions on $Σ^{-1}$. The main proof argument is an interpolation path similar to that typically used to derive Slepian's lemma. It yields a new $\ell_\infty$ error bound for the Lasso which is of independent interest. △ Less

Submitted 8 July, 2021; v1 submitted 23 February, 2019; originally announced February 2019.

arXiv:1901.08736 [pdf, ps, other]

Concentration of quadratic forms under a Bernstein moment assumption

Authors: Pierre C Bellec

Abstract: A concentration result for quadratic form of independent subgaussian random variables is derived. If the moments of the random variables satisfy a "Bernstein condition", then the variance term of the Hanson-Wright inequality can be improved. The Bernstein condition is satisfied, for instance, by all log-concave subgaussian distributions. A concentration result for quadratic form of independent subgaussian random variables is derived. If the moments of the random variables satisfy a "Bernstein condition", then the variance term of the Hanson-Wright inequality can be improved. The Bernstein condition is satisfied, for instance, by all log-concave subgaussian distributions. △ Less

Submitted 24 January, 2019; originally announced January 2019.

Comments: This short note presents a result that initially appeared in arXiv:1410.0346v1 (see Assumption 3.3). The result was later removed from arXiv:1410.0346 and the published version https://projecteuclid.org/euclid.aos/1519268423 due to space constraints

arXiv:1811.04121 [pdf, other]

Second order Stein: SURE for SURE and other applications in high-dimensional inference

Authors: Pierre C Bellec, Cun-Hui Zhang

Abstract: Stein's formula states that a random variable of the form $z^\top f(z) - \text{div} f(z)$ is mean-zero for functions $f$ with integrable gradient. Here, $\text{div} f$ is the divergence of the function $f$ and $z$ is a standard normal vector. This paper aims to propose a Second Order Stein formula to characterize the variance of such random variables for all functions $f(z)$ with square integrable… ▽ More Stein's formula states that a random variable of the form $z^\top f(z) - \text{div} f(z)$ is mean-zero for functions $f$ with integrable gradient. Here, $\text{div} f$ is the divergence of the function $f$ and $z$ is a standard normal vector. This paper aims to propose a Second Order Stein formula to characterize the variance of such random variables for all functions $f(z)$ with square integrable gradient, and to demonstrate the usefulness of this formula in various applications. In the Gaussian sequence model, a consequence of Stein's formula is Stein's Unbiased Risk Estimate (SURE), an unbiased estimate of the mean squared risk for almost any estimator $\hatμ$ of the unknown mean. A first application of the Second Order Stein formula is an Unbiased Risk Estimate for SURE itself (SURE for SURE): an unbiased estimate {providing} information about the squared distance between SURE and the squared estimation error of $\hatμ$. SURE for SURE has a simple form as a function of the data and is applicable to all $\hatμ$ with square integrable gradient, e.g. the Lasso and the Elastic Net. In addition to SURE for SURE, the following applications are developed: (1) Upper bounds on the risk of SURE when the estimation target is the mean squared error; (2) Confidence regions based on SURE; (3) Oracle inequalities satisfied by SURE-tuned estimates; (4) An upper bound on the variance of the size of the model selected by the Lasso; (5) Explicit expressions of SURE for SURE for the Lasso and the Elastic-Net; (6) In the linear model, a general semi-parametric scheme to de-bias a differentiable initial estimator for inference of a low-dimensional projection of the unknown $β$, with a characterization of the variance after de-biasing; and (7) An accuracy analysis of a Gaussian Monte Carlo scheme to approximate the divergence of functions $f: R^n\to R^n$. △ Less

Submitted 6 February, 2020; v1 submitted 9 November, 2018; originally announced November 2018.

arXiv:1804.01230 [pdf, ps, other]

The noise barrier and the large signal bias of the Lasso and other convex estimators

Authors: Pierre C Bellec

Abstract: Convex estimators such as the Lasso, the matrix Lasso and the group Lasso have been studied extensively in the last two decades, demonstrating great success in both theory and practice. Two quantities are introduced, the noise barrier and the large scale bias, that provides insights on the performance of these convex regularized estimators. It is now well understood that the Lasso achieves fast pr… ▽ More Convex estimators such as the Lasso, the matrix Lasso and the group Lasso have been studied extensively in the last two decades, demonstrating great success in both theory and practice. Two quantities are introduced, the noise barrier and the large scale bias, that provides insights on the performance of these convex regularized estimators. It is now well understood that the Lasso achieves fast prediction rates, provided that the correlations of the design satisfy some Restricted Eigenvalue or Compatibility condition, and provided that the tuning parameter is large enough. Using the two quantities introduced in the paper, we show that the compatibility condition on the design matrix is actually unavoidable to achieve fast prediction rates with the Lasso. The Lasso must incur a loss due to the correlations of the design matrix, measured in terms of the compatibility constant. This results holds for any design matrix, any active subset of covariates, and any tuning parameter. It is now well known that the Lasso enjoys a dimension reduction property: the prediction error is of order $λ\sqrt k$ where $k$ is the sparsity; even if the ambient dimension $p$ is much larger than $k$. Such results require that the tuning parameters is greater than some universal threshold. We characterize sharp phase transitions for the tuning parameter of the Lasso around a critical threshold dependent on $k$. If $λ$ is equal or larger than this critical threshold, the Lasso is minimax over $k$-sparse target vectors. If $λ$ is equal or smaller than critical threshold, the Lasso incurs a loss of order $σ\sqrt k$ --which corresponds to a model of size $k$-- even if the target vector has fewer than $k$ nonzero coefficients. Remarkably, the lower bounds obtained in the paper also apply to random, data-driven tuning parameters. The results extend to convex penalties beyond the Lasso. △ Less

Submitted 27 October, 2018; v1 submitted 3 April, 2018; originally announced April 2018.

Comments: This paper supersedes the previous article arXiv:1703.01332

arXiv:1706.06977 [pdf, other]

A sharp oracle inequality for Graph-Slope

Authors: Pierre C Bellec, Joseph Salmon, Samuel Vaiter

Abstract: Following recent success on the analysis of the Slope estimator, we provide a sharp oracle inequality in term of prediction error for Graph-Slope, a generalization of Slope to signals observed over a graph. In addition to improving upon best results obtained so far for the Total Variation denoiser (also referred to as Graph-Lasso or Generalized Lasso), we propose an efficient algorithm to compute… ▽ More Following recent success on the analysis of the Slope estimator, we provide a sharp oracle inequality in term of prediction error for Graph-Slope, a generalization of Slope to signals observed over a graph. In addition to improving upon best results obtained so far for the Total Variation denoiser (also referred to as Graph-Lasso or Generalized Lasso), we propose an efficient algorithm to compute Graph-Slope. The proposed algorithm is obtained by applying the forward-backward method to the dual formulation of the Graph-Slope optimization problem. We also provide experiments showing the interest of the method. △ Less

Submitted 20 November, 2017; v1 submitted 21 June, 2017; originally announced June 2017.

arXiv:1705.10696 [pdf, ps, other]

Localized Gaussian width of $M$-convex hulls with applications to Lasso and convex aggregation

Authors: Pierre C Bellec

Abstract: Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of $M$ points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property. This… ▽ More Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of $M$ points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property. This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation. △ Less

Submitted 26 September, 2017; v1 submitted 30 May, 2017; originally announced May 2017.

arXiv:1703.01332 [pdf, ps, other]

Optimistic lower bounds for convex regularized least-squares

Authors: Pierre C Bellec

Abstract: Minimax lower bounds are pessimistic in nature: for any given estimator, minimax lower bounds yield the existence of a worst-case target vector $β^*_{worst}$ for which the prediction error of the given estimator is bounded from below. However, minimax lower bounds shed no light on the prediction error of the given estimator for target vectors different than $β^*_{worst}$. A characterization of the… ▽ More Minimax lower bounds are pessimistic in nature: for any given estimator, minimax lower bounds yield the existence of a worst-case target vector $β^*_{worst}$ for which the prediction error of the given estimator is bounded from below. However, minimax lower bounds shed no light on the prediction error of the given estimator for target vectors different than $β^*_{worst}$. A characterization of the prediction error of any convex regularized least-squares is given. This characterization provide both a lower bound and an upper bound on the prediction error. This produces lower bounds that are applicable for any target vector and not only for a single, worst-case $β^*_{worst}$. Finally, these lower and upper bounds on the prediction error are applied to the Lasso is sparse linear regression. We obtain a lower bound involving the compatibility constant for any tuning parameter, matching upper and lower bounds for the universal choice of the tuning parameter, and a lower bound for the Lasso with small tuning parameter. △ Less

Submitted 6 October, 2017; v1 submitted 3 March, 2017; originally announced March 2017.

arXiv:1701.09120 [pdf, ps, other]

Towards the study of least squares estimators with convex penalty

Authors: Pierre C. Bellec, Guillaume Lecué, Alexandre B. Tsybakov

Abstract: Penalized least squares estimation is a popular technique in high-dimensional statistics. It includes such methods as the LASSO, the group LASSO, and the nuclear norm penalized least squares. The existing theory of these methods is not fully satisfying since it allows one to prove oracle inequalities with fixed high probability only for the estimators depending on this probability. Furthermore, th… ▽ More Penalized least squares estimation is a popular technique in high-dimensional statistics. It includes such methods as the LASSO, the group LASSO, and the nuclear norm penalized least squares. The existing theory of these methods is not fully satisfying since it allows one to prove oracle inequalities with fixed high probability only for the estimators depending on this probability. Furthermore, the control of compatibility factors appearing in the oracle bounds is often not explicit. Some very recent developments suggest that the theory of oracle inequalities can be revised in an improved way. In this paper, we provide an overview of ideas and tools leading to such an improved theory. We show that, along with overcoming the disadvantages mentioned above, the methodology extends to the hilbertian framework and it applies to a large class of convex penalties. This paper is partly expository. In particular, we provide adapted proofs of some results from other recent work. △ Less

Submitted 7 July, 2017; v1 submitted 31 January, 2017; originally announced January 2017.

arXiv:1609.06675 [pdf, ps, other]

Bounds on the prediction error of penalized least squares estimators with convex penalty

Authors: Pierre C. Bellec, Alexandre B. Tsybakov

Abstract: This paper considers the penalized least squares estimator with arbitrary convex penalty. When the observation noise is Gaussian, we show that the prediction error is a subgaussian random variable concentrated around its median. We apply this concentration property to derive sharp oracle inequalities for the prediction error of the LASSO, the group LASSO and the SLOPE estimators, both in probabili… ▽ More This paper considers the penalized least squares estimator with arbitrary convex penalty. When the observation noise is Gaussian, we show that the prediction error is a subgaussian random variable concentrated around its median. We apply this concentration property to derive sharp oracle inequalities for the prediction error of the LASSO, the group LASSO and the SLOPE estimators, both in probability and in expectation. In contrast to the previous work on the LASSO type methods, our oracle inequalities in probability are obtained at any confidence level for estimators with tuning parameters that do not depend on the confidence level. This is also the reason why we are able to establish sparsity oracle bounds in expectation for the LASSO type estimators, while the previously known techniques did not allow for the control of the expected risk. In addition, we show that the concentration rate in the oracle inequalities is better than it was commonly understood before. △ Less

Submitted 21 September, 2016; originally announced September 2016.

arXiv:1606.06179 [pdf, ps, other]

On the prediction loss of the lasso in the partially labeled setting

Authors: Pierre C. Bellec, Arnak S. Dalalyan, Edwin Grappin, Quentin Paris

Abstract: In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-… ▽ More In this paper we revisit the risk bounds of the lasso estimator in the context of transductive and semi-supervised learning. In other terms, the setting under consideration is that of regression with random design under partial labeling. The main goal is to obtain user-friendly bounds on the off-sample prediction risk. To this end, the simple setting of bounded response variable and bounded (high-dimensional) covariates is considered. We propose some new adaptations of the lasso to these settings and establish oracle inequalities both in expectation and in deviation. These results provide non-asymptotic upper bounds on the risk that highlight the interplay between the bias due to the mis-specification of the linear model, the bias due to the approximate sparsity and the variance. They also demonstrate that the presence of a large number of unlabeled features may have significant positive impact in the situations where the restricted eigenvalue of the design matrix vanishes or is very small. △ Less

Submitted 8 November, 2016; v1 submitted 20 June, 2016; originally announced June 2016.

Comments: 25 pages

arXiv:1605.08651 [pdf, ps, other]

Slope meets Lasso: improved oracle bounds and optimality

Authors: Pierre C. Bellec, Guillaume Lecué, Alexandre B. Tsybakov

Abstract: We show that two polynomial time methods, a Lasso estimator with adaptively chosen tuning parameter and a Slope estimator, adaptively achieve the exact minimax prediction and $\ell_2$ estimation rate $(s/n)\log (p/s)$ in high-dimensional linear regression on the class of $s$-sparse target vectors in $\mathbb R^p$. This is done under the Restricted Eigenvalue (RE) condition for the Lasso and under… ▽ More We show that two polynomial time methods, a Lasso estimator with adaptively chosen tuning parameter and a Slope estimator, adaptively achieve the exact minimax prediction and $\ell_2$ estimation rate $(s/n)\log (p/s)$ in high-dimensional linear regression on the class of $s$-sparse target vectors in $\mathbb R^p$. This is done under the Restricted Eigenvalue (RE) condition for the Lasso and under a slightly more constraining assumption on the design for the Slope. The main results have the form of sharp oracle inequalities accounting for the model misspecification error. The minimax optimal bounds are also obtained for the $\ell_q$ estimation errors with $1\le q\le 2$ when the model is well-specified. The results are non-asymptotic, and hold both in probability and in expectation. The assumptions that we impose on the design are satisfied with high probability for a large class of random matrices with independent and possibly anisotropically distributed rows. We give a comparative analysis of conditions, under which oracle bounds for the Lasso and Slope estimators can be obtained. In particular, we show that several known conditions, such as the RE condition and the sparse eigenvalue condition are equivalent if the $\ell_2$-norms of regressors are uniformly bounded. △ Less

Submitted 24 May, 2017; v1 submitted 27 May, 2016; originally announced May 2016.

arXiv:1602.03427 [pdf, ps, other]

Aggregation of supports along the Lasso path

Authors: Pierre C. Bellec

Abstract: In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the $2^p$ possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the… ▽ More In linear regression with fixed design, we propose two procedures that aggregate a data-driven collection of supports. The collection is a subset of the $2^p$ possible supports and both its cardinality and its elements can depend on the data. The procedures satisfy oracle inequalities with no assumption on the design matrix. Then we use these procedures to aggregate the supports that appear on the regularization path of the Lasso in order to construct an estimator that mimics the best Lasso estimator. If the restricted eigenvalue condition on the design matrix is satisfied, then this estimator achieves optimal prediction bounds. Finally, we discuss the computational cost of these procedures. △ Less

Submitted 31 May, 2016; v1 submitted 10 February, 2016; originally announced February 2016.

arXiv:1601.05766 [pdf, ps, other]

Adaptive confidence sets in shape restricted regression

Authors: Pierre C. Bellec

Abstract: A simple construction of adaptive confidence sets is proposed in isotonic, convex and unimodal regression. In univariate isotonic regression, the proposed confidence set enjoys uniform coverage over all non-decreasing regression functions. Furthermore, the diameter of the proposed confidence set automatically adapts to the unknown number of pieces of the true parameter, in the sense that the diame… ▽ More A simple construction of adaptive confidence sets is proposed in isotonic, convex and unimodal regression. In univariate isotonic regression, the proposed confidence set enjoys uniform coverage over all non-decreasing regression functions. Furthermore, the diameter of the proposed confidence set automatically adapts to the unknown number of pieces of the true parameter, in the sense that the diameter is bounded from above by the minimax risk over the class of $k$-piecewise constant functions. The diameter of the confidence set is a simple increasing function of the number of jumps of the isotonic least-squares estimate. A similar construction is proposed in convex regression where the true regression function is convex and piecewise affine. Here, the confidence set enjoys uniform coverage and its diameter automatically adapt to the number of affine pieces of the true regression function. The diameter of the confidence set is an increasing function of the number of affine pieces of the convex least-squares estimate. We explain how to extend this technique to a non-convex set by proposing a similar adaptive confidence set in unimodal regression. The confidence set automatically adapts to the number of jumps of the true unimodal regression function and its diameter is an increasing function of the number of jumps of the unimodal least-squares estimate. △ Less

Submitted 9 April, 2019; v1 submitted 21 January, 2016; originally announced January 2016.

arXiv:1510.08029 [pdf, ps, other]

Sharp oracle inequalities for Least Squares estimators in shape restricted regression

Authors: Pierre C. Bellec

Abstract: The performance of Least Squares (LS) estimators is studied in isotonic, unimodal and convex regression. Our results have the form of sharp oracle inequalities that account for the model misspecification error. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate $n^{-2/3}$ as well as a parametric rate of order $k/n$ up to logarithmic factors, where $k$ is the numb… ▽ More The performance of Least Squares (LS) estimators is studied in isotonic, unimodal and convex regression. Our results have the form of sharp oracle inequalities that account for the model misspecification error. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate $n^{-2/3}$ as well as a parametric rate of order $k/n$ up to logarithmic factors, where $k$ is the number of constant pieces of the true parameter. In univariate convex regression, the LS estimator satisfies an adaptive risk bound of order $q/n$ up to logarithmic factors, where $q$ is the number of affine pieces of the true regression function. This adaptive risk bound holds for any design points. While Guntuboyina and Sen (2013) established that the nonparametric rate of convex regression is of order $n^{-4/5}$ for equispaced design points, we show that the nonparametric rate of convex regression can be as slow as $n^{-2/3}$ for some worst-case design points. This phenomenon can be explained as follows: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both $n^{-2/3}$. △ Less

Submitted 7 August, 2016; v1 submitted 27 October, 2015; originally announced October 2015.

arXiv:1506.08724 [pdf, ps, other]

Sharp oracle bounds for monotone and convex regression through aggregation

Authors: Pierre C. Bellec, Alexandre B. Tsybakov

Abstract: We derive oracle inequalities for the problems of isotonic and convex regression using the combination of $Q$-aggregation procedure and sparsity pattern aggregation. This improves upon the previous results including the oracle inequalities for the constrained least squares estimator. One of the improvements is that our oracle inequalities are sharp, i.e., with leading constant 1. It allows us to o… ▽ More We derive oracle inequalities for the problems of isotonic and convex regression using the combination of $Q$-aggregation procedure and sparsity pattern aggregation. This improves upon the previous results including the oracle inequalities for the constrained least squares estimator. One of the improvements is that our oracle inequalities are sharp, i.e., with leading constant 1. It allows us to obtain bounds for the minimax regret thus accounting for model misspecification, which was not possible based on the previous results. Another improvement is that we obtain oracle inequalities both with high probability and in expectation. △ Less

Submitted 30 September, 2015; v1 submitted 29 June, 2015; originally announced June 2015.

arXiv:1410.0346 [pdf, other]

doi 10.1214/17-AOS1540

Optimal bounds for aggregation of affine estimators

Authors: Pierre C. Bellec

Abstract: We study the problem of aggregation of estimators when the estimators are not independent of the data used for aggregation and no sample splitting is allowed. If the estimators are deterministic vectors, it is well known that the minimax rate of aggregation is of order $\log(M)$, where $M$ is the number of estimators to aggregate. It is proved that for affine estimators, the minimax rate of aggreg… ▽ More We study the problem of aggregation of estimators when the estimators are not independent of the data used for aggregation and no sample splitting is allowed. If the estimators are deterministic vectors, it is well known that the minimax rate of aggregation is of order $\log(M)$, where $M$ is the number of estimators to aggregate. It is proved that for affine estimators, the minimax rate of aggregation is unchanged: it is possible to handle the linear dependence between the affine estimators and the data used for aggregation at no extra cost. The minimax rate is not impacted either by the variance of the affine estimators, or any other measure of their statistical complexity. The minimax rate is attained with a penalized procedure over the convex hull of the estimators, for a penalty that is inspired from the $Q$-aggregation procedure. The results follow from the interplay between the penalty, strong convexity and concentration. △ Less

Submitted 27 February, 2018; v1 submitted 1 October, 2014; originally announced October 2014.

Comments: Published at https://projecteuclid.org/euclid.aos/1519268423 in the Annals of Statistics (http://imstat.org/aos/ ) by the Institute of Mathematical Statistics (http://imstat.org/ )

Journal ref: Ann. Statist. Volume 46, Number 1 (2018), 30-59

arXiv:1405.3907 [pdf, ps, other]

doi 10.3150/15-BEJ742

Optimal exponential bounds for aggregation of density estimators

Authors: Pierre C. Bellec

Abstract: We consider the problem of model selection type aggregation in the context of density estimation. We first show that empirical risk minimization is sub-optimal for this problem and it shares this property with the exponential weights aggregate, empirical risk minimization over the convex hull of the dictionary functions, and all selectors. Using a penalty inspired by recent works on the $Q$-aggreg… ▽ More We consider the problem of model selection type aggregation in the context of density estimation. We first show that empirical risk minimization is sub-optimal for this problem and it shares this property with the exponential weights aggregate, empirical risk minimization over the convex hull of the dictionary functions, and all selectors. Using a penalty inspired by recent works on the $Q$-aggregation procedure, we derive a sharp oracle inequality in deviation under a simple boundedness assumption and we show that the rate is optimal in a minimax sense. Unlike the procedures based on exponential weights, this estimator is fully adaptive under the uniform prior. In particular, its construction does not rely on the sup-norm of the unknown density. By providing lower bounds with exponential tails, we show that the deviation term appearing in the sharp oracle inequalities cannot be improved. △ Less

Submitted 28 September, 2016; v1 submitted 15 May, 2014; originally announced May 2014.

Comments: Published at http://dx.doi.org/10.3150/15-BEJ742 in the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)

Report number: IMS-BEJ-BEJ742

Journal ref: Bernoulli 2017, Vol. 23, No. 1, 219-248

Showing 1–32 of 32 results for author: Bellec, P C