Skip to main content

Showing 1–16 of 16 results for author: Frei, S

Searching in archive stat. Search in all archives.
.
  1. arXiv:2404.00522  [pdf, other

    cs.LG stat.ML

    Minimum-Norm Interpolation Under Covariate Shift

    Authors: Neil Mallinar, Austin Zane, Spencer Frei, Bin Yu

    Abstract: Transfer learning is a critical part of real-world machine learning deployments and has been extensively studied in experimental works with overparameterized neural networks. However, even in the simplest setting of linear regression a notable gap still exists in the theoretical understanding of transfer learning. In-distribution research on high-dimensional linear regression has led to the identi… ▽ More

    Submitted 30 March, 2024; originally announced April 2024.

  2. arXiv:2310.02541  [pdf, other

    cs.LG stat.ML

    Benign Overfitting and Grokking in ReLU Networks for XOR Cluster Data

    Authors: Zhiwei Xu, Yutong Wang, Spencer Frei, Gal Vardi, Wei Hu

    Abstract: Neural networks trained by gradient descent (GD) have exhibited a number of surprising generalization behaviors. First, they can achieve a perfect fit to noisy training data and still generalize near-optimally, showing that overfitting can sometimes be benign. Second, they can undergo a period of classical, harmful overfitting -- achieving a perfect fit to training data with near-random performanc… ▽ More

    Submitted 3 October, 2023; originally announced October 2023.

  3. arXiv:2308.03215  [pdf, other

    stat.ML cs.LG

    The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

    Authors: Nikhil Ghosh, Spencer Frei, Wooseok Ha, Bin Yu

    Abstract: In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size… ▽ More

    Submitted 6 August, 2023; originally announced August 2023.

  4. arXiv:2306.09927  [pdf, other

    stat.ML cs.AI cs.CL cs.LG

    Trained Transformers Learn Linear Models In-Context

    Authors: Ruiqi Zhang, Spencer Frei, Peter L. Bartlett

    Abstract: Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformer… ▽ More

    Submitted 19 October, 2023; v1 submitted 16 June, 2023; originally announced June 2023.

    Comments: 50 pages, revised definition 3.2 and corollary 4.3

  5. arXiv:2303.01462  [pdf, ps, other

    cs.LG stat.ML

    Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

    Abstract: Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estim… ▽ More

    Submitted 2 March, 2023; originally announced March 2023.

    Comments: 53 pages

  6. arXiv:2303.01456  [pdf, ps, other

    cs.LG stat.ML

    The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro

    Abstract: In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial e… ▽ More

    Submitted 31 October, 2023; v1 submitted 2 March, 2023; originally announced March 2023.

    Comments: 42 pages; NeurIPS 2023 camera ready

  7. arXiv:2210.07082  [pdf, other

    cs.LG stat.ML

    Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

    Authors: Spencer Frei, Gal Vardi, Peter L. Bartlett, Nathan Srebro, Wei Hu

    Abstract: The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient… ▽ More

    Submitted 13 October, 2022; originally announced October 2022.

    Comments: 54 pages

  8. arXiv:2202.07626  [pdf, other

    cs.LG math.ST stat.ML

    Random Feature Amplification: Feature Learning and Generalization in Neural Networks

    Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

    Abstract: In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although line… ▽ More

    Submitted 13 September, 2023; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: 46 pages; JMLR camera ready revision

  9. arXiv:2202.05928  [pdf, ps, other

    cs.LG math.ST stat.ML

    Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

    Authors: Spencer Frei, Niladri S. Chatterji, Peter L. Bartlett

    Abstract: Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We… ▽ More

    Submitted 13 September, 2023; v1 submitted 11 February, 2022; originally announced February 2022.

    Comments: 39 pages; minor corrections

  10. arXiv:2106.13805  [pdf, other

    cs.LG math.OC stat.ML

    Self-training Converts Weak Learners to Strong Learners in Mixture Models

    Authors: Spencer Frei, Difan Zou, Zixiang Chen, Quanquan Gu

    Abstract: We consider a binary classification problem when the data comes from a mixture of two rotationally symmetric distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbolβ_{\mathrm{pl}}$ can achieve classification error at most… ▽ More

    Submitted 25 August, 2021; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: 23 pages. This version has added more detailed comparisons with related work, fixed a technical issue in the original proof, and improved the convergence guarantee to be about the last iterate of stochastic gradient descent

  11. arXiv:2106.13792  [pdf, ps, other

    cs.LG math.OC stat.ML

    Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent

    Authors: Spencer Frei, Quanquan Gu

    Abstract: Although the optimization objectives for learning neural networks are highly non-convex, gradient-based methods have been wildly successful at learning neural networks in practice. This juxtaposition has led to a number of recent studies on provable guarantees for neural networks trained by gradient descent. Unfortunately, the techniques in these works are often highly specific to the particular s… ▽ More

    Submitted 13 September, 2022; v1 submitted 25 June, 2021; originally announced June 2021.

    Comments: 16 pages. Updated presentation, changed results from online SGD to batch GD

  12. arXiv:2104.09437  [pdf, other

    cs.LG cs.CR math.OC stat.ML

    Provable Robustness of Adversarial Training for Learning Halfspaces with Noise

    Authors: Difan Zou, Spencer Frei, Quanquan Gu

    Abstract: We analyze the properties of adversarial training for learning adversarially robust halfspaces in the presence of agnostic label noise. Denoting $\mathsf{OPT}_{p,r}$ as the best robust classification error achieved by a halfspace that is robust to perturbations of $\ell_{p}$ balls of radius $r$, we show that adversarial training on the standard binary cross-entropy loss yields adversarially robust… ▽ More

    Submitted 19 April, 2021; originally announced April 2021.

    Comments: 42 pages, 2 figures

  13. arXiv:2101.01152  [pdf, other

    cs.LG math.OC stat.ML

    Provable Generalization of SGD-trained Neural Networks of Any Width in the Presence of Adversarial Label Noise

    Authors: Spencer Frei, Yuan Cao, Quanquan Gu

    Abstract: We consider a one-hidden-layer leaky ReLU network of arbitrary width trained by stochastic gradient descent (SGD) following an arbitrary initialization. We prove that SGD produces neural networks that have classification accuracy competitive with that of the best halfspace over the distribution for a broad class of distributions that includes log-concave isotropic and hard margin distributions. Eq… ▽ More

    Submitted 15 February, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

    Comments: 30 pages, 10 figures

  14. arXiv:2010.00539  [pdf, other

    cs.LG math.OC stat.ML

    Agnostic Learning of Halfspaces with Gradient Descent via Soft Margins

    Authors: Spencer Frei, Yuan Cao, Quanquan Gu

    Abstract: We analyze the properties of gradient descent on convex surrogates for the zero-one loss for the agnostic learning of linear halfspaces. If $\mathsf{OPT}$ is the best classification error achieved by a halfspace, by appealing to the notion of soft margins we are able to show that gradient descent finds halfspaces with classification error $\tilde O(\mathsf{OPT}^{1/2}) + \varepsilon$ in… ▽ More

    Submitted 13 February, 2021; v1 submitted 1 October, 2020; originally announced October 2020.

    Comments: 25 pages, 1 table

  15. arXiv:2005.14426  [pdf, other

    cs.LG math.OC stat.ML

    Agnostic Learning of a Single Neuron with Gradient Descent

    Authors: Spencer Frei, Yuan Cao, Quanquan Gu

    Abstract: We consider the problem of learning the best-fitting single neuron as measured by the expected square loss $\mathbb{E}_{(x,y)\sim \mathcal{D}}[(σ(w^\top x)-y)^2]$ over some unknown joint distribution $\mathcal{D}$ by using gradient descent to minimize the empirical risk induced by a set of i.i.d. samples $S\sim \mathcal{D}^n$. The activation function $σ$ is an arbitrary Lipschitz and non-decreasin… ▽ More

    Submitted 31 August, 2020; v1 submitted 29 May, 2020; originally announced May 2020.

    Comments: 31 pages, 3 tables. This version improves the risk bound from O(OPT^1/2) to O(OPT) for strictly increasing activation functions

  16. arXiv:1910.02934  [pdf, other

    cs.LG math.OC stat.ML

    Algorithm-Dependent Generalization Bounds for Overparameterized Deep Residual Networks

    Authors: Spencer Frei, Yuan Cao, Quanquan Gu

    Abstract: The skip-connections used in residual networks have become a standard architecture choice in deep learning due to the increased training stability and generalization performance with this architecture, although there has been limited theoretical understanding for this improvement. In this work, we analyze overparameterized deep residual networks trained by gradient descent following random initial… ▽ More

    Submitted 7 October, 2019; originally announced October 2019.

    Comments: 37 pages. In NeurIPS 2019