Skip to main content

Showing 1–33 of 33 results for author: Dobriban, E

Searching in archive math. Search in all archives.
.
  1. arXiv:2405.19544  [pdf, other

    cs.AI cs.CL cs.LG math.OC stat.ML

    One-Shot Safety Alignment for Large Language Models via Optimal Dualization

    Authors: Xinmeng Huang, Shuo Li, Edgar Dobriban, Osbert Bastani, Hamed Hassani, Dongsheng Ding

    Abstract: The growing safety concerns surrounding Large Language Models (LLMs) raise an urgent need to align them with diverse human preferences to simultaneously enhance their helpfulness and safety. A promising approach is to enforce safety constraints through Reinforcement Learning from Human Feedback (RLHF). For such constrained RLHF, common Lagrangian-based primal-dual policy optimization methods are c… ▽ More

    Submitted 29 May, 2024; originally announced May 2024.

  2. arXiv:2404.00912  [pdf, other

    math.ST stat.CO stat.ME stat.ML

    Inference in Randomized Least Squares and PCA via Normality of Quadratic Forms

    Authors: Leda Wang, Zhixiang Zhang, Edgar Dobriban

    Abstract: Randomized algorithms can be used to speed up the analysis of large datasets. In this paper, we develop a unified methodology for statistical inference via randomized sketching or projections in two of the most fundamental problems in multivariate statistical analysis: least squares and PCA. The methodology applies to fixed datasets -- i.e., is data-conditional -- and the only randomness is due to… ▽ More

    Submitted 1 April, 2024; originally announced April 2024.

  3. arXiv:2403.18216  [pdf, other

    stat.ML cs.CY cs.LG math.ST

    Minimax Optimal Fair Classification with Bounded Demographic Disparity

    Authors: Xianli Zeng, Guang Cheng, Edgar Dobriban

    Abstract: Mitigating the disparate impact of statistical machine learning methods is crucial for ensuring fairness. While extensive research aims to reduce disparity, the effect of using a \emph{finite dataset} -- as opposed to the entire population -- remains unclear. This paper explores the statistical foundations of fair binary classification with two protected groups, focusing on controlling demographic… ▽ More

    Submitted 26 March, 2024; originally announced March 2024.

  4. arXiv:2312.16160  [pdf, other

    stat.ME cs.LG math.ST stat.ML

    SymmPI: Predictive Inference for Data with Group Symmetries

    Authors: Edgar Dobriban, Mengxin Yu

    Abstract: Quantifying the uncertainty of predictions is a core problem in modern statistics. Methods for predictive inference have been developed under a variety of assumptions, often -- for instance, in standard conformal prediction -- relying on the invariance of the distribution of the data under special groups of transformations such as permutation groups. Moreover, many existing methods for predictive… ▽ More

    Submitted 28 December, 2023; v1 submitted 26 December, 2023; originally announced December 2023.

    Comments: 45 pages

  5. arXiv:2308.01853  [pdf, other

    stat.ML cs.LG math.ST

    Statistical Estimation Under Distribution Shift: Wasserstein Perturbations and Minimax Theory

    Authors: Patrick Chao, Edgar Dobriban

    Abstract: Distribution shifts are a serious concern in modern statistical learning as they can systematically change the properties of the data away from the truth. We focus on Wasserstein distribution shifts, where every data point may undergo a slight perturbation, as opposed to the Huber contamination model where a fraction of observations are outliers. We consider perturbations that are either independe… ▽ More

    Submitted 9 October, 2023; v1 submitted 3 August, 2023; originally announced August 2023.

    Comments: 60 pages, 7 figures

  6. arXiv:2307.11255  [pdf, other

    stat.ME math.ST stat.CO

    A Framework for Statistical Inference via Randomized Algorithms

    Authors: Zhixiang Zhang, Sokbae Lee, Edgar Dobriban

    Abstract: Randomized algorithms, such as randomized sketching or projections, are a promising approach to ease the computational burden in analyzing large datasets. However, randomized algorithms also produce non-deterministic outputs, leading to the problem of evaluating their accuracy. In this paper, we develop a statistical inference framework for quantifying the uncertainty of the outputs of randomized… ▽ More

    Submitted 28 September, 2023; v1 submitted 20 July, 2023; originally announced July 2023.

  7. arXiv:2306.16406  [pdf, other

    stat.ME math.ST stat.ML

    Efficient and Multiply Robust Risk Estimation under General Forms of Dataset Shift

    Authors: Hongxiang Qiu, Eric Tchetgen Tchetgen, Edgar Dobriban

    Abstract: Statistical machine learning methods often face the challenge of limited data available from the population of interest. One remedy is to leverage data from auxiliary source populations, which share some conditional distributions or are linked in other ways with the target domain. Techniques leveraging such \emph{dataset shift} conditions are known as \emph{domain adaptation} or \emph{transfer lea… ▽ More

    Submitted 7 June, 2024; v1 submitted 28 June, 2023; originally announced June 2023.

  8. arXiv:2304.09154  [pdf, other

    stat.ME math.ST stat.ML

    Sharp-SSL: Selective high-dimensional axis-aligned random projections for semi-supervised learning

    Authors: Tengyao Wang, Edgar Dobriban, Milana Gataric, Richard J. Samworth

    Abstract: We propose a new method for high-dimensional semi-supervised learning problems based on the careful aggregation of the results of a low-dimensional procedure applied to many axis-aligned random projections of the data. Our primary goal is to identify important variables for distinguishing between the classes; existing low-dimensional methods can then be applied for final class assignment. Motivate… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

    Comments: 49 pages, 4 figures

    MSC Class: 62H30

  9. arXiv:2211.04612  [pdf, other

    stat.ME math.ST stat.ML

    Conformal Frequency Estimation using Discrete Sketched Data with Coverage for Distinct Queries

    Authors: Matteo Sesia, Stefano Favaro, Edgar Dobriban

    Abstract: This paper develops conformal inference methods to construct a confidence interval for the frequency of a queried object in a very large discrete data set, based on a sketch with a lower memory footprint. This approach requires no knowledge of the data distribution and can be combined with any sketching algorithm, including but not limited to the renowned count-min sketch, the count-sketch, and va… ▽ More

    Submitted 15 August, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

    Comments: 79 pages, 47 figures, 2 tables. Extended version of arXiv:2204.04270

  10. arXiv:2205.10798  [pdf, other

    cs.LG math.ST stat.ML

    PAC-Wrap: Semi-Supervised PAC Anomaly Detection

    Authors: Shuo Li, Xiayan Ji, Edgar Dobriban, Oleg Sokolsky, Insup Lee

    Abstract: Anomaly detection is essential for preventing hazardous outcomes for safety-critical applications like autonomous driving. Given their safety-criticality, these applications benefit from provable bounds on various errors in anomaly detection. To achieve this goal in the semi-supervised setting, we propose to provide Probably Approximately Correct (PAC) guarantees on the false negative and false po… ▽ More

    Submitted 21 June, 2022; v1 submitted 22 May, 2022; originally announced May 2022.

    Comments: Accepted by SIGKDD 2022

  11. arXiv:2203.06126  [pdf, other

    stat.ME math.ST stat.ML

    Prediction Sets Adaptive to Unknown Covariate Shift

    Authors: Hongxiang Qiu, Edgar Dobriban, Eric Tchetgen Tchetgen

    Abstract: Predicting sets of outcomes -- instead of unique outcomes -- is a promising solution to uncertainty quantification in statistical learning. Despite a rich literature on constructing prediction sets with statistical guarantees, adapting to unknown covariate shift -- a prevalent issue in practice -- poses a serious unsolved challenge. In this paper, we show that prediction sets with finite-sample co… ▽ More

    Submitted 17 June, 2023; v1 submitted 11 March, 2022; originally announced March 2022.

  12. arXiv:2108.11872  [pdf, other

    math.ST cs.LG math.OC stat.ML

    Comparing Classes of Estimators: When does Gradient Descent Beat Ridge Regression in Linear Models?

    Authors: Dominic Richards, Edgar Dobriban, Patrick Rebeschini

    Abstract: Methods for learning from data depend on various types of tuning parameters, such as penalization strength or step size. Since performance can depend strongly on these parameters, it is important to compare classes of estimators-by considering prescribed finite sets of tuning parameters-not just particularly tuned methods. In this work, we investigate classes of methods via the relative performanc… ▽ More

    Submitted 12 June, 2022; v1 submitted 26 August, 2021; originally announced August 2021.

  13. arXiv:2104.12260  [pdf, other

    math.ST stat.ME

    Consistency of invariance-based randomization tests

    Authors: Edgar Dobriban

    Abstract: Invariance-based randomization tests -- such as permutation tests, rotation tests, or sign changes -- are an important and widely used class of statistical methods. They allow drawing inferences under weak assumptions on the data distribution. Most work focuses on their type I error control properties, while their consistency properties are much less understood. We develop a general framework an… ▽ More

    Submitted 20 December, 2021; v1 submitted 25 April, 2021; originally announced April 2021.

    Comments: This version improves the results from the previous one

    Journal ref: Annals of Statistics, 2022+

  14. arXiv:2012.02985  [pdf, other

    math.ST stat.ME

    Selecting the number of components in PCA via random signflips

    Authors: David Hong, Yue Sheng, Edgar Dobriban

    Abstract: Principal component analysis (PCA) is a foundational tool in modern data analysis, and a crucial step in PCA is selecting the number of components to keep. However, classical selection methods (e.g., scree plots, parallel analysis, etc.) lack statistical guarantees in the increasingly common setting of large-dimensional data with heterogeneous noise, i.e., where each entry may have a different noi… ▽ More

    Submitted 25 May, 2024; v1 submitted 5 December, 2020; originally announced December 2020.

    Comments: 38 pages, 14 figures

  15. arXiv:2010.05170  [pdf, other

    stat.ML cs.LG math.ST

    What causes the test error? Going beyond bias-variance via ANOVA

    Authors: Licong Lin, Edgar Dobriban

    Abstract: Modern machine learning methods are often overparametrized, allowing adaptation to the data at a fine level. This can seem puzzling; in the worst case, such models do not need to generalize. This puzzle inspired a great amount of work, arguing when overparametrization reduces test error, in a phenomenon called "double descent". Recent work aimed to understand in greater depth why overparametrizati… ▽ More

    Submitted 9 June, 2021; v1 submitted 11 October, 2020; originally announced October 2020.

  16. arXiv:2005.00511  [pdf, other

    math.ST math.PR

    How to reduce dimension with PCA and random projections?

    Authors: Fan Yang, Sifan Liu, Edgar Dobriban, David P. Woodruff

    Abstract: In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction are "data-oblivious" methods such as random projections and sketching, and "data-aware" methods such as principal component analysis (PCA). Both have their strengths, such as speed for random projections, and data… ▽ More

    Submitted 28 March, 2021; v1 submitted 1 May, 2020; originally announced May 2020.

    Comments: 56 pages, 12 figures

  17. arXiv:2003.07802  [pdf, other

    stat.ML cs.LG math.OC

    The Implicit Regularization of Stochastic Gradient Flow for Least Squares

    Authors: Alnur Ali, Edgar Dobriban, Ryan J. Tibshirani

    Abstract: We study the implicit regularization of mini-batch stochastic gradient descent, when applied to the fundamental problem of least squares regression. We leverage a continuous-time stochastic differential equation having the same moments as stochastic gradient descent, which we call stochastic gradient flow. We give a bound on the excess risk of stochastic gradient flow at time $t$, over ridge regre… ▽ More

    Submitted 19 June, 2020; v1 submitted 17 March, 2020; originally announced March 2020.

    Comments: ICML 2020

  18. arXiv:2002.00864  [pdf, other

    math.OC cs.LG

    Optimal Iterative Sketching with the Subsampled Randomized Hadamard Transform

    Authors: Jonathan Lacotte, Sifan Liu, Edgar Dobriban, Mert Pilanci

    Abstract: Random projections or sketching are widely used in many algorithmic and learning contexts. Here we study the performance of iterative Hessian sketch for least-squares problems. By leveraging and extending recent results from random matrix theory on the limiting spectrum of matrices randomly projected with the subsampled randomized Hadamard transform, and truncated Haar matrices, we can study and c… ▽ More

    Submitted 23 October, 2020; v1 submitted 3 February, 2020; originally announced February 2020.

  19. arXiv:1911.07956  [pdf, other

    cs.LG cs.CV math.OC stat.ML

    Implicit Regularization and Convergence for Weight Normalization

    Authors: Xiaoxia Wu, Edgar Dobriban, Tongzheng Ren, Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

    Abstract: Normalization methods such as batch [Ioffe and Szegedy, 2015], weight [Salimansand Kingma, 2016], instance [Ulyanov et al., 2016], and layer normalization [Baet al., 2016] have been widely used in modern machine learning. Here, we study the weight normalization (WN) method [Salimans and Kingma, 2016] and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least-s… ▽ More

    Submitted 30 August, 2022; v1 submitted 18 November, 2019; originally announced November 2019.

    Comments: NeurIPS 2020

  20. arXiv:1910.02373  [pdf, other

    math.ST stat.ML

    Ridge Regression: Structure, Cross-Validation, and Sketching

    Authors: Sifan Liu, Edgar Dobriban

    Abstract: We study the following three fundamental problems about ridge regression: (1) what is the structure of the estimator? (2) how to correctly use cross-validation to choose the regularization parameter? and (3) how to accelerate computation without losing too much accuracy? We consider the three problems in a unified large-data linear model. We give a precise representation of ridge regression as a c… ▽ More

    Submitted 29 March, 2020; v1 submitted 6 October, 2019; originally announced October 2019.

    Comments: Published as a conference paper at ICLR 2020

  21. arXiv:1907.10905  [pdf, other

    stat.ML cs.LG math.ST

    A Group-Theoretic Framework for Data Augmentation

    Authors: Shuxiao Chen, Edgar Dobriban, Jane H Lee

    Abstract: Data augmentation is a widely used trick when training deep neural networks: in addition to the original data, properly transformed data are also added to the training set. However, to the best of our knowledge, a clear mathematical framework to explain the performance benefits of data augmentation is not available. In this paper, we develop such a theoretical framework. We show data augmentation… ▽ More

    Submitted 6 November, 2020; v1 submitted 25 July, 2019; originally announced July 2019.

    Comments: To appear in Journal of Machine Learning Research

  22. arXiv:1903.09321  [pdf, other

    math.ST cs.DC cs.LG stat.CO

    WONDER: Weighted one-shot distributed ridge regression in high dimensions

    Authors: Edgar Dobriban, Yue Sheng

    Abstract: In many areas, practitioners need to analyze large datasets that challenge conventional single-machine computing. To scale up data analysis, distributed and parallel computing approaches are increasingly needed. Here we study a fundamental and highly important problem in this area: How to do ridge regression in a distributed computing environment? Ridge regression is an extremely popular method… ▽ More

    Submitted 19 February, 2020; v1 submitted 21 March, 2019; originally announced March 2019.

    Comments: Gave the name "Wonder" to the algorithm, updated title, added algorithm for general non-isotropic design

    Report number: Journal of Machine Learning Research 21(66) p. 1-52 2020. Short version at ICML 2020

  23. arXiv:1810.06089  [pdf, other

    math.ST cs.LG math.NA stat.ME stat.ML

    Asymptotics for Sketching in Least Squares Regression

    Authors: Edgar Dobriban, Sifan Liu

    Abstract: We consider a least squares regression problem where the data has been generated from a linear model, and we are interested to learn the unknown regression parameters. We consider "sketch-and-solve" methods that randomly project the data first, and do regression after. Previous works have analyzed the statistical and computational performance of such methods. However, the existing analysis is not… ▽ More

    Submitted 6 October, 2019; v1 submitted 14 October, 2018; originally announced October 2018.

    Journal ref: Updated manuscript to be consistent with version at NeurIPS 2019

  24. arXiv:1810.00412  [pdf, other

    math.ST stat.CO stat.ME stat.ML

    Distributed linear regression by averaging

    Authors: Edgar Dobriban, Yue Sheng

    Abstract: Distributed statistical learning problems arise commonly when dealing with large datasets. In this setup, datasets are partitioned over machines, which compute locally, and communicate short messages. Communication is often the bottleneck. In this paper, we study one-step and iterative weighted parameter averaging in statistical linear models under data parallelism. We do linear regression on each… ▽ More

    Submitted 22 October, 2022; v1 submitted 30 September, 2018; originally announced October 2018.

    Comments: Fixing a typo

    Journal ref: Ann. Statist. 49(2): 918-943 (April 2021)

  25. arXiv:1807.00347  [pdf, other

    math.ST stat.ME

    Robust Inference Under Heteroskedasticity via the Hadamard Estimator

    Authors: Edgar Dobriban, Weijie J. Su, Yachong Yang, Zhixiang Zhang

    Abstract: Drawing statistical inferences from large datasets in a model-robust way is an important problem in statistics and data science. In this paper, we propose methods that are robust to large and unequal noise in different observational units (i.e., heteroskedasticity) for statistical inference in linear regression. We leverage the Hadamard estimator, which is unbiased for the variances of ordinary le… ▽ More

    Submitted 9 January, 2024; v1 submitted 1 July, 2018; originally announced July 2018.

  26. arXiv:1710.00479  [pdf, other

    math.ST stat.ME

    Permutation methods for factor analysis and PCA

    Authors: Edgar Dobriban

    Abstract: Researchers often have datasets measuring features $x_{ij}$ of samples, such as test scores of students. In factor analysis and PCA, these features are thought to be influenced by unobserved factors, such as skills. Can we determine how many components affect the data? This is an important problem, because it has a large impact on all downstream data analysis. Consequently, many approaches have be… ▽ More

    Submitted 13 September, 2019; v1 submitted 2 October, 2017; originally announced October 2017.

    Comments: To appear in the Annals of Statistics

  27. arXiv:1709.03393  [pdf, other

    math.ST

    Optimal prediction in the linearly transformed spiked model

    Authors: Edgar Dobriban, William Leeb, Amit Singer

    Abstract: We consider the linearly transformed spiked model, where observations $Y_i$ are noisy linear transforms of unobserved signals of interest $X_i$: \begin{align*} Y_i = A_i X_i + \varepsilon_i, \end{align*} for $i=1,\ldots,n$. The transform matrices $A_i$ are also observed. We model $X_i$ as random vectors lying on an unknown low-dimensional space. How should we predict the unobserved signals (regr… ▽ More

    Submitted 11 July, 2018; v1 submitted 7 September, 2017; originally announced September 2017.

    Comments: This paper replaces the preprint "PCA from noisy, linearly reduced data: the diagonal case" by Edgar Dobriban, William Leeb, and Amit Singer (arXiv:1611.10333)

  28. arXiv:1611.10333  [pdf, other

    math.ST math.PR

    PCA from noisy, linearly reduced data: the diagonal case

    Authors: Edgar Dobriban, William Leeb, Amit Singer

    Abstract: Suppose we observe data of the form $Y_i = D_i (S_i + \varepsilon_i) \in \mathbb{R}^p$ or $Y_i = D_i S_i + \varepsilon_i \in \mathbb{R}^p$, $i=1,\ldots,n$, where $D_i \in \mathbb{R}^{p\times p}$ are known diagonal matrices, $\varepsilon_i$ are noise, and we wish to perform principal component analysis (PCA) on the unobserved signals $S_i \in \mathbb{R}^p$. The first model arises in missing data pr… ▽ More

    Submitted 1 November, 2018; v1 submitted 30 November, 2016; originally announced November 2016.

    Comments: This technical report has been largely superseded by our later paper arXiv:1709.03393. Please cite that one instead of this one. This paper has a slightly different approach, so we want to keep it publicly available

  29. Sharp detection in PCA under correlations: all eigenvalues matter

    Authors: Edgar Dobriban

    Abstract: Principal component analysis (PCA) is a widely used method for dimension reduction. In high dimensional data, the "signal" eigenvalues corresponding to weak principal components (PCs) do not necessarily separate from the bulk of the "noise" eigenvalues. Therefore, popular tests based on the largest eigenvalue have little power to detect weak PCs. In the special case of the spiked model, certain te… ▽ More

    Submitted 22 February, 2016; originally announced February 2016.

    Comments: 46 pages, 9 figures

    MSC Class: 62H25

    Journal ref: Ann. Statist. Volume 45, Number 4 (2017), 1810-1833

  30. arXiv:1507.03003  [pdf, other

    math.ST stat.ML

    High-Dimensional Asymptotics of Prediction: Ridge Regression and Classification

    Authors: Edgar Dobriban, Stefan Wager

    Abstract: We provide a unified analysis of the predictive risk of ridge regression and regularized discriminant analysis in a dense random effects model. We work in a high-dimensional asymptotic regime where $p, n \to \infty$ and $p/n \to γ\in (0, \, \infty)$, and allow for arbitrary covariance among the features. For both methods, we provide an explicit and efficiently computable expression for the limitin… ▽ More

    Submitted 4 November, 2015; v1 submitted 10 July, 2015; originally announced July 2015.

    Comments: Added a section on prediction versus estimation for ridge regression. Rewrote introduction. Other results unchanged

  31. Efficient Computation of Limit Spectra of Sample Covariance Matrices

    Authors: Edgar Dobriban

    Abstract: Consider an $n \times p$ data matrix $X$ whose rows are independently sampled from a population with covariance $Σ$. When $n,p$ are both large, the eigenvalues of the sample covariance matrix are substantially different from those of the true covariance. Asymptotically, as $n,p \to \infty$ with $p/n \to γ$, there is a deterministic map** from the population spectral distribution (PSD) to the emp… ▽ More

    Submitted 6 July, 2015; originally announced July 2015.

  32. Regularity Properties for Sparse Regression

    Authors: Edgar Dobriban, Jianqing Fan

    Abstract: Statistical and machine learning theory has developed several conditions ensuring that popular estimators such as the Lasso or the Dantzig selector perform well in high-dimensional sparse regression, including the restricted eigenvalue, compatibility, and $\ell_q$ sensitivity properties. However, some of the central aspects of these conditions are not well understood. For instance, it is unknown i… ▽ More

    Submitted 5 December, 2015; v1 submitted 22 May, 2013; originally announced May 2013.

    Comments: Manuscript shortened and more motivation added. To appear in Communications in Mathematics and Statistics

    MSC Class: 62J05; 68Q17; 62H12

  33. arXiv:1204.1580  [pdf, ps, other

    math.FA cs.CC cs.IT

    Certifying the restricted isometry property is hard

    Authors: Afonso S. Bandeira, Edgar Dobriban, Dustin G. Mixon, William F. Sawin

    Abstract: This paper is concerned with an important matrix condition in compressed sensing known as the restricted isometry property (RIP). We demonstrate that testing whether a matrix satisfies RIP is NP-hard. As a consequence of our result, it is impossible to efficiently test for RIP provided P \neq NP.

    Submitted 10 September, 2012; v1 submitted 6 April, 2012; originally announced April 2012.