-
One-Shot Safety Alignment for Large Language Models via Optimal Dualization
Authors:
Xinmeng Huang,
Shuo Li,
Edgar Dobriban,
Osbert Bastani,
Hamed Hassani,
Dongsheng Ding
Abstract:
The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are c…
▽ More
The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are computationally expensive and often unstable. This paper presents a dualization perspective that reduces constrained alignment to an equivalent unconstrained alignment problem. We do so by pre-optimizing a smooth and convex dual function that has a closed form. This shortcut eliminates the need for cumbersome primal-dual policy iterations, thus greatly reducing the computational burden and improving training stability. Our strategy leads to two practical algorithms in model-based and preference-based scenarios (MoCAN and PeCAN, respectively). A broad range of experiments demonstrate the effectiveness of our methods.
△ Less
Submitted 29 May, 2024;
originally announced May 2024.
-
Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms
Authors:
Leda Wang,
Zhixiang Zhang,
Edgar Dobriban
Abstract:
Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to…
▽ More
Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to the randomized algorithm. We propose statistical inference methods for a broad range of sketching distributions, such as the subsampled randomized Hadamard transform (SRHT), Sparse Sign Embeddings (SSE) and CountSketch, sketching matrices with i.i.d. entries, and uniform subsampling. To our knowledge, no comparable methods are available for SSE and for SRHT in PCA. Our novel theoretical approach rests on showing the asymptotic normality of certain quadratic forms. As a contribution of broader interest, we show central limit theorems for quadratic forms of the SRHT, relying on a novel proof via a dyadic expansion that leverages the recursive structure of the Hadamard transform. Numerical experiments using both synthetic and empirical datasets support the efficacy of our methods, and in particular suggest that sketching methods can have better computation-estimation tradeoffs than recently proposed optimal subsampling methods.
△ Less
Submitted 1 April, 2024;
originally announced April 2024.
-
Minimax Optimal Fair Classification with Bounded Demographic Disparity
Authors:
Xianli Zeng,
Guang Cheng,
Edgar Dobriban
Abstract:
Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic…
▽ More
Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic disparity, defined as the difference in acceptance rates between the groups. Although fairness may come at the cost of accuracy even with infinite data, we show that using a finite sample incurs additional costs due to the need to estimate group-specific acceptance thresholds. We study the minimax optimal classification error while constraining demographic disparity to a user-specified threshold. To quantify the impact of fairness constraints, we introduce a novel measure called \emph{fairness-aware excess risk} and derive a minimax lower bound on this measure that all classifiers must satisfy. Furthermore, we propose FairBayes-DDP+, a group-wise thresholding method with an offset that we show attains the minimax lower bound. Our lower bound proofs involve several innovations. Experiments support that FairBayes-DDP+ controls disparity at the user-specified level, while being faster and having a more favorable fairness-accuracy tradeoff than several baselines.
△ Less
Submitted 26 March, 2024;
originally announced March 2024.
-
SymmPI: Predictive Inference for Data with Group Symmetries
Authors:
Edgar Dobriban,
Mengxin Yu
Abstract:
Quantifying the uncertainty of predictions is a core problem in modern statistics. Methods for predictive inference have been developed under a variety of assumptions, often -- for instance, in standard conformal prediction -- relying on the invariance of the distribution of the data under special groups of transformations such as permutation groups. Moreover, many existing methods for predictive…
▽ More
Quantifying the uncertainty of predictions is a core problem in modern statistics. Methods for predictive inference have been developed under a variety of assumptions, often -- for instance, in standard conformal prediction -- relying on the invariance of the distribution of the data under special groups of transformations such as permutation groups. Moreover, many existing methods for predictive inference aim to predict unobserved outcomes in sequences of feature-outcome observations. Meanwhile, there is interest in predictive inference under more general observation models (e.g., for partially observed features) and for data satisfying more general distributional symmetries (e.g., rotationally invariant or coordinate-independent observations in physics). Here we propose SymmPI, a methodology for predictive inference when data distributions have general group symmetries in arbitrary observation models. Our methods leverage the novel notion of distributional equivariant transformations, which process the data while preserving their distributional invariances. We show that SymmPI has valid coverage under distributional invariance and characterize its performance under distribution shift, recovering recent results as special cases. We apply SymmPI to predict unobserved values associated to vertices in a network, where the distribution is unchanged under relabelings that keep the network structure unchanged. In several simulations in a two-layer hierarchical model, and in an empirical data analysis example, SymmPI performs favorably compared to existing methods.
△ Less
Submitted 28 December, 2023; v1 submitted 26 December, 2023;
originally announced December 2023.
-
Statistical Estimation Under Distribution Shift: Wasserstein Perturbations and Minimax Theory
Authors:
Patrick Chao,
Edgar Dobriban
Abstract:
Distribution shifts are a serious concern in modern statistical learning as they can systematically change the properties of the data away from the truth. We focus on Wasserstein distribution shifts, where every data point may undergo a slight perturbation, as opposed to the Huber contamination model where a fraction of observations are outliers. We consider perturbations that are either independe…
▽ More
Distribution shifts are a serious concern in modern statistical learning as they can systematically change the properties of the data away from the truth. We focus on Wasserstein distribution shifts, where every data point may undergo a slight perturbation, as opposed to the Huber contamination model where a fraction of observations are outliers. We consider perturbations that are either independent or coordinated joint shifts across data points. We analyze several important statistical problems, including location estimation, linear regression, and non-parametric density estimation. Under a squared loss for mean estimation and prediction error in linear regression, we find the exact minimax risk, a least favorable perturbation, and show that the sample mean and least squares estimators are respectively optimal. For other problems, we provide nearly optimal estimators and precise finite-sample bounds. We also introduce several tools for bounding the minimax risk under general distribution shifts, not just for Wasserstein perturbations, such as a smoothing technique for location families, and generalizations of classical tools including least favorable sequences of priors, the modulus of continuity, as well as Le Cam's, Fano's, and Assouad's methods.
△ Less
Submitted 9 October, 2023; v1 submitted 3 August, 2023;
originally announced August 2023.
-
A Framework for Statistical Inference via Randomized Algorithms
Authors:
Zhixiang Zhang,
Sokbae Lee,
Edgar Dobriban
Abstract:
Randomized algorithms, such as randomized sketching or projections, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized…
▽ More
Randomized algorithms, such as randomized sketching or projections, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized algorithms. We develop appropriate statistical methods -- sub-randomization, multi-run plug-in and multi-run aggregation inference -- by using multiple runs of the same randomized algorithm, or by estimating the unknown parameters of the limiting distribution. As an example, we develop methods for statistical inference for least squares parameters via random sketching using matrices with i.i.d.entries, or uniform partial orthogonal matrices. For this, we characterize the limiting distribution of estimators obtained via sketch-and-solve as well as partial sketching methods. The analysis of i.i.d. sketches uses a trigonometric interpolation argument to establish a differential equation for the limiting expected characteristic function and find the dependence on the kurtosis of the entries of the sketching matrix. The results are supported via a broad range of simulations.
△ Less
Submitted 28 September, 2023; v1 submitted 20 July, 2023;
originally announced July 2023.
-
Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift
Authors:
Hongxiang Qiu,
Eric Tchetgen Tchetgen,
Edgar Dobriban
Abstract:
Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer lea…
▽ More
Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer learning}. Despite extensive literature on dataset shift, limited works address how to efficiently use the auxiliary populations to improve the accuracy of risk evaluation for a given machine learning task in the target population.
In this paper, we study the general problem of efficiently estimating target population risk under various dataset shift conditions, leveraging semiparametric efficiency theory. We consider a general class of dataset shift conditions, which includes three popular conditions -- covariate, label and concept shift -- as special cases. We allow for partially non-overlap** support between the source and target populations. We develop efficient and multiply robust estimators along with a straightforward specification test of these dataset shift conditions. We also derive efficiency bounds for two other dataset shift conditions, posterior drift and location-scale shift. Simulation studies support the efficiency gains due to leveraging plausible dataset shift conditions.
△ Less
Submitted 7 June, 2024; v1 submitted 28 June, 2023;
originally announced June 2023.
-
Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning
Authors:
Tengyao Wang,
Edgar Dobriban,
Milana Gataric,
Richard J. Samworth
Abstract:
We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivate…
▽ More
We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivated by a generalized Rayleigh quotient, we score projections according to the traces of the estimated whitened between-class covariance matrices on the projected data. This enables us to assign an importance weight to each variable for a given projection, and to select our signal variables by aggregating these weights over high-scoring projections. Our theory shows that the resulting Sharp-SSL algorithm is able to recover the signal coordinates with high probability when we aggregate over sufficiently many random projections and when the base procedure estimates the whitened between-class covariance matrix sufficiently well. The Gaussian EM algorithm is a natural choice as a base procedure, and we provide a new analysis of its performance in semi-supervised settings that controls the parameter estimation error in terms of the proportion of labeled data in the sample. Numerical results on both simulated data and a real colon tumor dataset support the excellent empirical performance of the method.
△ Less
Submitted 18 April, 2023;
originally announced April 2023.
-
Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries
Authors:
Matteo Sesia,
Stefano Favaro,
Edgar Dobriban
Abstract:
This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and va…
▽ More
This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and variations thereof. After explaining how to achieve marginal coverage for exchangeable random queries, we extend our solution to provide stronger inferences that can account for the discreteness of the data and for heterogeneous query frequencies, increasing also robustness to possible distribution shifts. These results are facilitated by a novel conformal calibration technique that guarantees valid coverage for a large fraction of distinct random queries. Finally, we show our methods have improved empirical performance compared to existing frequentist and Bayesian alternatives in simulations as well as in examples of text and SARS-CoV-2 DNA data.
△ Less
Submitted 15 August, 2023; v1 submitted 8 November, 2022;
originally announced November 2022.
-
PAC-Wrap: Semi-Supervised PAC Anomaly Detection
Authors:
Shuo Li,
Xiayan Ji,
Edgar Dobriban,
Oleg Sokolsky,
Insup Lee
Abstract:
Anomaly detection is essential for preventing hazardous outcomes for safety-critical applications like autonomous driving. Given their safety-criticality, these applications benefit from provable bounds on various errors in anomaly detection. To achieve this goal in the semi-supervised setting, we propose to provide Probably Approximately Correct (PAC) guarantees on the false negative and false po…
▽ More
Anomaly detection is essential for preventing hazardous outcomes for safety-critical applications like autonomous driving. Given their safety-criticality, these applications benefit from provable bounds on various errors in anomaly detection. To achieve this goal in the semi-supervised setting, we propose to provide Probably Approximately Correct (PAC) guarantees on the false negative and false positive detection rates for anomaly detection algorithms. Our method (PAC-Wrap) can wrap around virtually any existing semi-supervised and unsupervised anomaly detection method, endowing it with rigorous guarantees. Our experiments with various anomaly detectors and datasets indicate that PAC-Wrap is broadly effective.
△ Less
Submitted 21 June, 2022; v1 submitted 22 May, 2022;
originally announced May 2022.
-
Prediction Sets Adaptive to Unknown Covariate Shift
Authors:
Hongxiang Qiu,
Edgar Dobriban,
Eric Tchetgen Tchetgen
Abstract:
Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious unsolved challenge. In this paper, we show that prediction sets with finite-sample co…
▽ More
Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious unsolved challenge. In this paper, we show that prediction sets with finite-sample coverage guarantee are uninformative and propose a novel flexible distribution-free method, PredSet-1Step, to efficiently construct prediction sets with an asymptotic coverage guarantee under unknown covariate shift. We formally show that our method is \textit{asymptotically probably approximately correct}, having well-calibrated coverage error with high confidence for large samples. We illustrate that it achieves nominal coverage in a number of experiments and a data set concerning HIV risk prediction in a South African cohort study. Our theory hinges on a new bound for the convergence rate of the coverage of Wald confidence intervals based on general asymptotically linear estimators.
△ Less
Submitted 17 June, 2023; v1 submitted 11 March, 2022;
originally announced March 2022.
-
Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?
Authors:
Dominic Richards,
Edgar Dobriban,
Patrick Rebeschini
Abstract:
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performanc…
▽ More
Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performance of the best method in the class. We consider the central problem of linear regression-with a random isotropic ground truth-and investigate the estimation performance of two fundamental methods, gradient descent and ridge regression. We unveil the following phenomena. (1) For general designs, constant stepsize gradient descent outperforms ridge regression when the eigenvalues of the empirical data covariance matrix decay slowly, as a power law with exponent less than unity. If instead the eigenvalues decay quickly, as a power law with exponent greater than unity or exponentially, we show that ridge regression outperforms gradient descent. (2) For orthogonal designs, we compute the exact minimax optimal class of estimators (achieving min-max-min optimality), showing it is equivalent to gradient descent with decaying learning rate. We find the sub-optimality of ridge regression and gradient descent with constant step size. Our results highlight that statistical performance can depend strongly on tuning parameters. In particular, while optimally tuned ridge regression is the best estimator in our setting, it can be outperformed by gradient descent by an arbitrary/unbounded amount when both methods are only tuned over finitely many regularization parameters.
△ Less
Submitted 12 June, 2022; v1 submitted 26 August, 2021;
originally announced August 2021.
-
Consistency of invariance-based randomization tests
Authors:
Edgar Dobriban
Abstract:
Invariance-based randomization tests -- such as permutation tests, rotation tests, or sign changes -- are an important and widely used class of statistical methods. They allow drawing inferences under weak assumptions on the data distribution. Most work focuses on their type I error control properties, while their consistency properties are much less understood.
We develop a general framework an…
▽ More
Invariance-based randomization tests -- such as permutation tests, rotation tests, or sign changes -- are an important and widely used class of statistical methods. They allow drawing inferences under weak assumptions on the data distribution. Most work focuses on their type I error control properties, while their consistency properties are much less understood.
We develop a general framework and a set of results on the consistency of invariance-based randomization tests in signal-plus-noise models. Our framework is grounded in the deep mathematical area of representation theory. We allow the transforms to be general compact topological groups, such as rotation groups, acting by general linear group representations. We study test statistics with a generalized sub-additivity property. We apply our framework to a number of fundamental and highly important problems in statistics, including sparse vector detection, testing for low-rank matrices in noise, sparse detection in linear regression, and two-sample testing. Comparing with minimax lower bounds, we find perhaps surprisingly that in some cases, randomization tests detect signals at the minimax optimal rate.
△ Less
Submitted 20 December, 2021; v1 submitted 25 April, 2021;
originally announced April 2021.
-
Selecting the number of components in PCA via random signflips
Authors:
David Hong,
Yue Sheng,
Edgar Dobriban
Abstract:
Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noi…
▽ More
Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noise variance. Moreover, it turns out that these methods, which are highly effective for homogeneous noise, can fail dramatically for data with heterogeneous noise. This paper proposes a new method called signflip parallel analysis (FlipPA) for the setting of approximately symmetric noise: it compares the data singular values to those of "empirical null" matrices generated by flip** the sign of each entry randomly with probability one-half. We develop a rigorous theory for FlipPA, showing that it has nonasymptotic type I error control and that it consistently selects the correct rank for signals rising above the noise floor in the large-dimensional limit (even when the noise is heterogeneous). We also rigorously explain why classical permutation-based parallel analysis degrades under heterogeneous noise. Finally, we illustrate that FlipPA compares favorably to state-of-the art methods via numerical simulations and an illustration on data coming from astronomy.
△ Less
Submitted 25 May, 2024; v1 submitted 5 December, 2020;
originally announced December 2020.
-
What causes the test error? Going beyond bias-variance via ANOVA
Authors:
Licong Lin,
Edgar Dobriban
Abstract:
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called "double descent". Recent work aimed to understand in greater depth why overparametrizati…
▽ More
Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called "double descent". Recent work aimed to understand in greater depth why overparametrization is helpful for generalization. This leads to discovering the unimodality of variance as a function of the level of parametrization, and to decomposing the variance into that arising from label noise, initialization, and randomness in the training data to understand the sources of the error.
In this work we develop a deeper understanding of this area. Specifically, we propose using the analysis of variance (ANOVA) to decompose the variance in the test error in a symmetric way, for studying the generalization performance of certain two-layer linear and non-linear networks. The advantage of the analysis of variance is that it reveals the effects of initialization, label noise, and training data more clearly than prior approaches. Moreover, we also study the monotonicity and unimodality of the variance components. While prior work studied the unimodality of the overall variance, we study the properties of each term in variance decomposition.
One key insight is that in typical settings, the interaction between training samples and initialization can dominate the variance; surprisingly being larger than their marginal effect. Also, we characterize "phase transitions" where the variance changes from unimodal to monotone. On a technical level, we leverage advanced deterministic equivalent techniques for Haar random matrices, that -- to our knowledge -- have not yet been used in the area. We also verify our results in numerical simulations and on empirical data examples.
△ Less
Submitted 9 June, 2021; v1 submitted 11 October, 2020;
originally announced October 2020.
-
How to reduce dimension with PCA and random projections?
Authors:
Fan Yang,
Sifan Liu,
Edgar Dobriban,
David P. Woodruff
Abstract:
In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data…
▽ More
In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data-adaptivity for PCA. In this work, we study how to combine them to get the best of both. We study "sketch and solve" methods that take a random projection (or sketch) first, and compute PCA after. We compute the performance of several popular sketching methods (random iid projections, random sampling, subsampled Hadamard transform, count sketch, etc) in a general "signal-plus-noise" (or spiked) data model. Compared to well-known works, our results (1) give asymptotically exact results, and (2) apply when the signal components are only slightly above the noise, but the projection dimension is non-negligible. We also study stronger signals allowing more general covariance structures. We find that (a) signal strength decreases under projection in a delicate way depending on the structure of the data and the sketching method, (b) orthogonal projections are more accurate, (c) randomization does not hurt too much, due to concentration of measure, (d) count sketch can be improved by a normalization method. Our results have implications for statistical learning and data analysis. We also illustrate that the results are highly accurate in simulations and in analyzing empirical data.
△ Less
Submitted 28 March, 2021; v1 submitted 1 May, 2020;
originally announced May 2020.
-
The Implicit Regularization of Stochastic Gradient Flow for Least Squares
Authors:
Alnur Ali,
Edgar Dobriban,
Ryan J. Tibshirani
Abstract:
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regre…
▽ More
We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regression with tuning parameter $λ= 1/t$. The bound may be computed from explicit constants (e.g., the mini-batch size, step size, number of iterations), revealing precisely how these quantities drive the excess risk. Numerical examples show the bound can be small, indicating a tight relationship between the two estimators. We give a similar result relating the coefficients of stochastic gradient flow and ridge. These results hold under no conditions on the data matrix $X$, and across the entire optimization path (not just at convergence).
△ Less
Submitted 19 June, 2020; v1 submitted 17 March, 2020;
originally announced March 2020.
-
Optimal Iterative Sketching with the Subsampled Randomized Hadamard Transform
Authors:
Jonathan Lacotte,
Sifan Liu,
Edgar Dobriban,
Mert Pilanci
Abstract:
Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for least-squares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and c…
▽ More
Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for least-squares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and compare the resulting algorithms to a level of precision that has not been possible before. Our technical contributions include a novel formula for the second moment of the inverse of projected matrices. We also find simple closed-form expressions for asymptotically optimal step-sizes and convergence rates. These show that the convergence rate for Haar and randomized Hadamard matrices are identical, and asymptotically improve upon Gaussian random projections. These techniques may be applied to other algorithms that employ randomized dimension reduction.
△ Less
Submitted 23 October, 2020; v1 submitted 3 February, 2020;
originally announced February 2020.
-
Implicit Regularization and Convergence for Weight Normalization
Authors:
Xiaoxia Wu,
Edgar Dobriban,
Tongzheng Ren,
Shanshan Wu,
Zhiyuan Li,
Suriya Gunasekar,
Rachel Ward,
Qiang Liu
Abstract:
Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-s…
▽ More
Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-squares regression. WN and rPGD reparametrize the weights with a scale g and a unit vector w and thus the objective function becomes non-convex. We show that this non-convex formulation has beneficial regularization effects compared to gradient descent on the original objective. These methods adaptively regularize the weights and converge close to the minimum l2 norm solution, even for initializations far from zero. For certain stepsizes of g and w , we show that they can converge close to the minimum norm solution. This is different from the behavior of gradient descent, which converges to the minimum norm solution only when started at a point in the range space of the feature matrix, and is thus more sensitive to initialization.
△ Less
Submitted 30 August, 2022; v1 submitted 18 November, 2019;
originally announced November 2019.
-
Ridge Regression: Structure, Cross-Validation, and Sketching
Authors:
Sifan Liu,
Edgar Dobriban
Abstract:
We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a c…
▽ More
We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a covariance matrix-dependent linear combination of the true parameter and the noise. We study the bias of $K$-fold cross-validation for choosing the regularization parameter, and propose a simple bias-correction. We analyze the accuracy of primal and dual sketching for ridge regression, showing they are surprisingly accurate. Our results are illustrated by simulations and by analyzing empirical data.
△ Less
Submitted 29 March, 2020; v1 submitted 6 October, 2019;
originally announced October 2019.
-
A Group-Theoretic Framework for Data Augmentation
Authors:
Shuxiao Chen,
Edgar Dobriban,
Jane H Lee
Abstract:
Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation…
▽ More
Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation is equivalent to an averaging operation over the orbits of a certain group that keeps the data distribution approximately invariant. We prove that it leads to variance reduction. We study empirical risk minimization, and the examples of exponential families, linear regression, and certain two-layer neural networks. We also discuss how data augmentation could be used in problems with symmetry where other approaches are prevalent, such as in cryo-electron microscopy (cryo-EM).
△ Less
Submitted 6 November, 2020; v1 submitted 25 July, 2019;
originally announced July 2019.
-
WONDER: Weighted one-shot distributed ridge regression in high dimensions
Authors:
Edgar Dobriban,
Yue Sheng
Abstract:
In many areas, practitioners need to analyze large datasets that challenge conventional single-machine computing. To scale up data analysis, distributed and parallel computing approaches are increasingly needed.
Here we study a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method…
▽ More
In many areas, practitioners need to analyze large datasets that challenge conventional single-machine computing. To scale up data analysis, distributed and parallel computing approaches are increasingly needed.
Here we study a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method for supervised learning, and has several optimality properties, thus it is important to study. We study one-shot methods that construct weighted combinations of ridge regression estimators computed on each machine. By analyzing the mean squared error in a high dimensional random-effects model where each predictor has a small effect, we discover several new phenomena.
1. Infinite-worker limit: The distributed estimator works well for very large numbers of machines, a phenomenon we call "infinite-worker limit".
2. Optimal weights: The optimal weights for combining local estimators sum to more than unity, due to the downward bias of ridge. Thus, all averaging methods are suboptimal.
We also propose a new Weighted ONe-shot DistributEd Ridge regression (WONDER) algorithm. We test WONDER in simulation studies and using the Million Song Dataset as an example. There it can save at least 100x in computation time, while nearly preserving test accuracy.
△ Less
Submitted 19 February, 2020; v1 submitted 21 March, 2019;
originally announced March 2019.
-
Asymptotics for Sketching in Least Squares Regression
Authors:
Edgar Dobriban,
Sifan Liu
Abstract:
We consider a least squares regression problem where the data has been generated from a linear model, and we are interested to learn the unknown regression parameters. We consider "sketch-and-solve" methods that randomly project the data first, and do regression after. Previous works have analyzed the statistical and computational performance of such methods. However, the existing analysis is not…
▽ More
We consider a least squares regression problem where the data has been generated from a linear model, and we are interested to learn the unknown regression parameters. We consider "sketch-and-solve" methods that randomly project the data first, and do regression after. Previous works have analyzed the statistical and computational performance of such methods. However, the existing analysis is not fine-grained enough to show the fundamental differences between various methods, such as the Subsampled Randomized Hadamard Transform (SRHT) and Gaussian projections. In this paper, we make progress on this problem, working in an asymptotic framework where the number of datapoints and dimension of features goes to infinity. We find the limits of the accuracy loss (for estimation and test error) incurred by popular sketching methods. We show separation between different methods, so that SRHT is better than Gaussian projections. Our theoretical results are verified on both real and synthetic data. The analysis of SRHT relies on novel methods from random matrix theory that may be of independent interest.
△ Less
Submitted 6 October, 2019; v1 submitted 14 October, 2018;
originally announced October 2018.
-
Distributed linear regression by averaging
Authors:
Edgar Dobriban,
Yue Sheng
Abstract:
Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each…
▽ More
Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each machine, send the results to a central server, and take a weighted average of the parameters. Optionally, we iterate, sending back the weighted average and doing local ridge regressions centered at it. How does this work compared to doing linear regression on the full data? Here we study the performance loss in estimation, test error, and confidence interval length in high dimensions, where the number of parameters is comparable to the training data size. We find the performance loss in one-step weighted averaging, and also give results for iterative averaging. We also find that different problems are affected differently by the distributed framework. Estimation error and confidence interval length increase a lot, while prediction error increases much less. We rely on recent results from random matrix theory, where we develop a new calculus of deterministic equivalents as a tool of broader interest.
△ Less
Submitted 22 October, 2022; v1 submitted 30 September, 2018;
originally announced October 2018.
-
Robust Inference Under Heteroskedasticity via the Hadamard Estimator
Authors:
Edgar Dobriban,
Weijie J. Su,
Yachong Yang,
Zhixiang Zhang
Abstract:
Drawing statistical inferences from large datasets in a model-robust way is an important problem in statistics and data science. In this paper, we propose methods that are robust to large and unequal noise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression. We leverage the Hadamard estimator, which is unbiased for the variances of ordinary le…
▽ More
Drawing statistical inferences from large datasets in a model-robust way is an important problem in statistics and data science. In this paper, we propose methods that are robust to large and unequal noise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression. We leverage the Hadamard estimator, which is unbiased for the variances of ordinary least-squares regression. This is in contrast to the popular White's sandwich estimator, which can be substantially biased in high dimensions. We propose to estimate the signal strength, noise level, signal-to-noise ratio, and mean squared error via the Hadamard estimator. We develop a new degrees of freedom adjustment that gives more accurate confidence intervals than variants of White's sandwich estimator. Moreover, we provide conditions ensuring the estimator is well-defined, by studying a new random matrix ensemble in which the entries of a random orthogonal projection matrix are squared. We also show approximate normality, using the second-order Poincare inequality. Our work provides improved statistical theory and methods for linear regression in high dimensions.
△ Less
Submitted 9 January, 2024; v1 submitted 1 July, 2018;
originally announced July 2018.
-
Permutation methods for factor analysis and PCA
Authors:
Edgar Dobriban
Abstract:
Researchers often have datasets measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because it has a large impact on all downstream data analysis. Consequently, many approaches have be…
▽ More
Researchers often have datasets measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because it has a large impact on all downstream data analysis. Consequently, many approaches have been developed to address it. Parallel Analysis is a popular permutation method. It works by randomly scrambling each feature of the data. It selects components if their singular values are larger than those of the permuted data. Despite widespread use in leading textbooks and scientific publications, as well as empirical evidence for its accuracy, it currently has no theoretical justification.
In this paper, we show that the parallel analysis permutation method consistently selects the large components in certain high-dimensional factor models. However, it does not select the smaller components. The intuition is that permutations keep the noise invariant, while "destroying" the low-rank signal. This provides justification for permutation methods in PCA and factor models under some conditions. Our work uncovers drawbacks of permutation methods, and paves the way to improvements.
△ Less
Submitted 13 September, 2019; v1 submitted 2 October, 2017;
originally announced October 2017.
-
Optimal prediction in the linearly transformed spiked model
Authors:
Edgar Dobriban,
William Leeb,
Amit Singer
Abstract:
We consider the linearly transformed spiked model, where observations $Y_i$ are noisy linear transforms of unobserved signals of interest $X_i$: \begin{align*}
Y_i = A_i X_i + \varepsilon_i, \end{align*} for $i=1,\ldots,n$. The transform matrices $A_i$ are also observed. We model $X_i$ as random vectors lying on an unknown low-dimensional space. How should we predict the unobserved signals (regr…
▽ More
We consider the linearly transformed spiked model, where observations $Y_i$ are noisy linear transforms of unobserved signals of interest $X_i$: \begin{align*}
Y_i = A_i X_i + \varepsilon_i, \end{align*} for $i=1,\ldots,n$. The transform matrices $A_i$ are also observed. We model $X_i$ as random vectors lying on an unknown low-dimensional space. How should we predict the unobserved signals (regression coefficients) $X_i$?
The naive approach of performing regression for each observation separately is inaccurate due to the large noise. Instead, we develop optimal linear empirical Bayes methods for predicting $X_i$ by "borrowing strength" across the different samples. Our methods are applicable to large datasets and rely on weak moment assumptions. The analysis is based on random matrix theory.
We discuss applications to signal processing, deconvolution, cryo-electron microscopy, and missing data in the high-noise regime. For missing data, we show in simulations that our methods are faster, more robust to noise and to unequal sampling than well-known matrix completion methods.
△ Less
Submitted 11 July, 2018; v1 submitted 7 September, 2017;
originally announced September 2017.
-
PCA from noisy, linearly reduced data: the diagonal case
Authors:
Edgar Dobriban,
William Leeb,
Amit Singer
Abstract:
Suppose we observe data of the form $Y_i = D_i (S_i + \varepsilon_i) \in \mathbb{R}^p$ or $Y_i = D_i S_i + \varepsilon_i \in \mathbb{R}^p$, $i=1,\ldots,n$, where $D_i \in \mathbb{R}^{p\times p}$ are known diagonal matrices, $\varepsilon_i$ are noise, and we wish to perform principal component analysis (PCA) on the unobserved signals $S_i \in \mathbb{R}^p$. The first model arises in missing data pr…
▽ More
Suppose we observe data of the form $Y_i = D_i (S_i + \varepsilon_i) \in \mathbb{R}^p$ or $Y_i = D_i S_i + \varepsilon_i \in \mathbb{R}^p$, $i=1,\ldots,n$, where $D_i \in \mathbb{R}^{p\times p}$ are known diagonal matrices, $\varepsilon_i$ are noise, and we wish to perform principal component analysis (PCA) on the unobserved signals $S_i \in \mathbb{R}^p$. The first model arises in missing data problems, where the $D_i$ are binary. The second model captures noisy deconvolution problems, where the $D_i$ are the Fourier transforms of the convolution kernels. It is often reasonable to assume the $S_i$ lie on an unknown low-dimensional linear space; however, because many coordinates can be suppressed by the $D_i$, this low-dimensional structure can be obscured.
We introduce diagonally reduced spiked covariance models to capture this setting. We characterize the behavior of the singular vectors and singular values of the data matrix under high-dimensional asymptotics where $n,p\to\infty$ such that $p/n\toγ>0$. Our results have the most general assumptions to date even without diagonal reduction. Using them, we develop optimal eigenvalue shrinkage methods for covariance matrix estimation and optimal singular value shrinkage methods for data denoising.
Finally, we characterize the error rates of the empirical Best Linear Predictor (EBLP) denoisers. We show that, perhaps surprisingly, their optimal tuning depends on whether we denoise in-sample or out-of-sample, but the optimally tuned mean squared error is the same in the two cases.
△ Less
Submitted 1 November, 2018; v1 submitted 30 November, 2016;
originally announced November 2016.
-
Sharp detection in PCA under correlations: all eigenvalues matter
Authors:
Edgar Dobriban
Abstract:
Principal component analysis (PCA) is a widely used method for dimension reduction. In high dimensional data, the "signal" eigenvalues corresponding to weak principal components (PCs) do not necessarily separate from the bulk of the "noise" eigenvalues. Therefore, popular tests based on the largest eigenvalue have little power to detect weak PCs. In the special case of the spiked model, certain te…
▽ More
Principal component analysis (PCA) is a widely used method for dimension reduction. In high dimensional data, the "signal" eigenvalues corresponding to weak principal components (PCs) do not necessarily separate from the bulk of the "noise" eigenvalues. Therefore, popular tests based on the largest eigenvalue have little power to detect weak PCs. In the special case of the spiked model, certain tests asymptotically equivalent to linear spectral statistics (LSS)---averaging effects over all eigenvalues---were recently shown to achieve some power.
We consider a nonparametric, non-Gaussian generalization of the spiked model to the setting of Marchenko and Pastur (1967). This allows a general bulk of the noise eigenvalues, accomodating correlated variables even under the null hypothesis of no significant PCs.
We develop new tests based on LSS to detect weak PCs in this model. We show using the CLT for LSS that the optimal LSS satisfy a Fredholm integral equation of the first kind. We develop algorithms to solve it, building on our recent method for computing the limit empirical spectrum. In contrast to the standard spiked model, we find that under "widely spread" null eigenvalue distributions, the new tests have a lot of power.
△ Less
Submitted 22 February, 2016;
originally announced February 2016.
-
High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification
Authors:
Edgar Dobriban,
Stefan Wager
Abstract:
We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limitin…
▽ More
We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limiting predictive risk, which depends only on the spectrum of the feature-covariance matrix, the signal strength, and the aspect ratio $γ$. Especially in the case of regularized discriminant analysis, we find that predictive accuracy has a nuanced dependence on the eigenvalue distribution of the covariance matrix, suggesting that analyses based on the operator norm of the covariance matrix may not be sharp. Our results also uncover several qualitative insights about both methods: for example, with ridge regression, there is an exact inverse relation between the limiting predictive risk and the limiting estimation risk given a fixed signal strength. Our analysis builds on recent advances in random matrix theory.
△ Less
Submitted 4 November, 2015; v1 submitted 10 July, 2015;
originally announced July 2015.
-
Efficient Computation of Limit Spectra of Sample Covariance Matrices
Authors:
Edgar Dobriban
Abstract:
Consider an $n \times p$ data matrix $X$ whose rows are independently sampled from a population with covariance $Σ$. When $n,p$ are both large, the eigenvalues of the sample covariance matrix are substantially different from those of the true covariance. Asymptotically, as $n,p \to \infty$ with $p/n \to γ$, there is a deterministic map** from the population spectral distribution (PSD) to the emp…
▽ More
Consider an $n \times p$ data matrix $X$ whose rows are independently sampled from a population with covariance $Σ$. When $n,p$ are both large, the eigenvalues of the sample covariance matrix are substantially different from those of the true covariance. Asymptotically, as $n,p \to \infty$ with $p/n \to γ$, there is a deterministic map** from the population spectral distribution (PSD) to the empirical spectral distribution (ESD) of the eigenvalues. The map** is characterized by a fixed-point equation for the Stieltjes transform.
We propose a new method to compute numerically the output ESD from an arbitrary input PSD. Our method, called Spectrode, finds the support and the density of the ESD to high precision; we prove this for finite discrete distributions. In computational experiments it outperforms existing methods by several orders of magnitude in speed and accuracy. We apply Spectrode to compute expectations and contour integrals of the ESD. These quantities are often central in applications of random matrix theory (RMT).
We illustrate that Spectrode is directly useful in statistical problems, such as estimation and hypothesis testing for covariance matrices. Our proposal may make it more convenient to use asymptotic RMT in aspects of high-dimensional data analysis.
△ Less
Submitted 6 July, 2015;
originally announced July 2015.
-
Regularity Properties for Sparse Regression
Authors:
Edgar Dobriban,
Jianqing Fan
Abstract:
Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and $\ell_q$ sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown i…
▽ More
Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and $\ell_q$ sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown if these conditions can be checked efficiently on any given data set. This is problematic, because they are at the core of the theory of sparse regression.
Here we provide a rigorous proof that these conditions are NP-hard to check. This shows that the conditions are computationally infeasible to verify, and raises some questions about their practical applications.
However, by taking an average-case perspective instead of the worst-case view of NP-hardness, we show that a particular condition, $\ell_q$ sensitivity, has certain desirable properties. This condition is weaker and more general than the others. We show that it holds with high probability in models where the parent population is well behaved, and that it is robust to certain data processing steps. These results are desirable, as they provide guidance about when the condition, and more generally the theory of sparse regression, may be relevant in the analysis of high-dimensional correlated observational data.
△ Less
Submitted 5 December, 2015; v1 submitted 22 May, 2013;
originally announced May 2013.
-
Certifying the restricted isometry property is hard
Authors:
Afonso S. Bandeira,
Edgar Dobriban,
Dustin G. Mixon,
William F. Sawin
Abstract:
This paper is concerned with an important matrix condition in compressed sensing known as the restricted isometry property (RIP). We demonstrate that testing whether a matrix satisfies RIP is NP-hard. As a consequence of our result, it is impossible to efficiently test for RIP provided P \neq NP.
This paper is concerned with an important matrix condition in compressed sensing known as the restricted isometry property (RIP). We demonstrate that testing whether a matrix satisfies RIP is NP-hard. As a consequence of our result, it is impossible to efficiently test for RIP provided P \neq NP.
△ Less
Submitted 10 September, 2012; v1 submitted 6 April, 2012;
originally announced April 2012.