Skip to main content

Showing 1–23 of 23 results for author: Nakkiran, P

Searching in archive stat. Search in all archives.
.
  1. arXiv:2406.08929  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Step-by-Step Diffusion: An Elementary Tutorial

    Authors: Preetum Nakkiran, Arwen Bradley, Hattie Zhou, Madhu Advani

    Abstract: We present an accessible first course on diffusion models and flow matching for machine learning, aimed at a technical audience with no diffusion experience. We try to simplify the mathematical details as much as possible (sometimes heuristically), while retaining enough precision to derive correct algorithms.

    Submitted 23 June, 2024; v1 submitted 13 June, 2024; originally announced June 2024.

    Comments: 35 pages, 11 figures

  2. arXiv:2310.20703  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    Vanishing Gradients in Reinforcement Finetuning of Language Models

    Authors: Noam Razin, Hattie Zhou, Omid Saremi, Vimal Thilak, Arwen Bradley, Preetum Nakkiran, Joshua Susskind, Etai Littwin

    Abstract: Pretrained language models are commonly aligned with human preferences and downstream tasks via reinforcement finetuning (RFT), which refers to maximizing a (possibly learned) reward function using policy gradient algorithms. This work identifies a fundamental optimization obstacle in RFT: we prove that the expected gradient for an input vanishes when its reward standard deviation under the model… ▽ More

    Submitted 14 March, 2024; v1 submitted 31 October, 2023; originally announced October 2023.

    Comments: Accepted to ICLR 2024

  3. arXiv:2310.16028  [pdf, other

    cs.LG cs.AI cs.CL stat.ML

    What Algorithms can Transformers Learn? A Study in Length Generalization

    Authors: Hattie Zhou, Arwen Bradley, Etai Littwin, Noam Razin, Omid Saremi, Josh Susskind, Samy Bengio, Preetum Nakkiran

    Abstract: Large language models exhibit surprising emergent generalization properties, yet also struggle on many simple reasoning tasks such as arithmetic and parity. This raises the question of if and when Transformer models can learn the true algorithm for solving a task. We study the scope of Transformers' abilities in the specific setting of length generalization on algorithmic tasks. Here, we propose a… ▽ More

    Submitted 24 October, 2023; originally announced October 2023.

    Comments: Preprint

  4. arXiv:2305.18764  [pdf, other

    cs.LG math.ST stat.ML

    When Does Optimizing a Proper Loss Yield Calibration?

    Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Preetum Nakkiran

    Abstract: Optimizing proper loss functions is popularly believed to yield predictors with good calibration properties; the intuition being that for such losses, the global optimum is to predict the ground-truth probabilities, which is indeed calibrated. However, typical machine learning models are trained to approximately minimize loss over restricted families of predictors, that are unlikely to contain the… ▽ More

    Submitted 8 December, 2023; v1 submitted 30 May, 2023; originally announced May 2023.

    Comments: In NeurIPS 2023. Selected for spotlight presentation

  5. arXiv:2304.09424  [pdf, other

    cs.LG cs.AI stat.ML

    Loss Minimization Yields Multicalibration for Large Neural Networks

    Authors: Jarosław Błasiok, Parikshit Gopalan, Lunjia Hu, Adam Tauman Kalai, Preetum Nakkiran

    Abstract: Multicalibration is a notion of fairness for predictors that requires them to provide calibrated predictions across a large set of protected groups. Multicalibration is known to be a distinct goal than loss minimization, even for simple predictors such as linear functions. In this work, we consider the setting where the protected groups can be represented by neural networks of size $k$, and the… ▽ More

    Submitted 7 December, 2023; v1 submitted 19 April, 2023; originally announced April 2023.

    Comments: In ITCS 2024

  6. arXiv:2210.01964  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    The Calibration Generalization Gap

    Authors: A. Michael Carrell, Neil Mallinar, James Lucas, Preetum Nakkiran

    Abstract: Calibration is a fundamental property of a good predictive model: it requires that the model predicts correctly in proportion to its confidence. Modern neural networks, however, provide no strong guarantees on their calibration -- and can be either poorly calibrated or well-calibrated depending on the setting. It is currently unclear which factors contribute to good calibration (architecture, data… ▽ More

    Submitted 6 October, 2022; v1 submitted 4 October, 2022; originally announced October 2022.

    Comments: Appeared at ICML 2022 Workshop on Distribution-Free Uncertainty Quantification

  7. arXiv:2207.06569  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Benign, Tempered, or Catastrophic: A Taxonomy of Overfitting

    Authors: Neil Mallinar, James B. Simon, Amirhesam Abedsoltan, Parthe Pandit, Mikhail Belkin, Preetum Nakkiran

    Abstract: The practical success of overparameterized neural networks has motivated the recent scientific study of interpolating methods, which perfectly fit their training data. Certain interpolating methods, including neural networks, can fit noisy training data without catastrophically bad test performance, in defiance of standard intuitions from statistical learning theory. Aiming to explain this, a body… ▽ More

    Submitted 20 October, 2022; v1 submitted 13 July, 2022; originally announced July 2022.

    Comments: NM and JS co-first authors

  8. arXiv:2204.03230  [pdf, other

    cs.LG cs.AI cs.CR cs.CV stat.ML

    What You See is What You Get: Principled Deep Learning via Distributional Generalization

    Authors: Bogdan Kulynych, Yao-Yuan Yang, Yaodong Yu, Jarosław Błasiok, Preetum Nakkiran

    Abstract: Having similar behavior at training time and test time $-$ what we call a "What You See Is What You Get" (WYSIWYG) property $-$ is desirable in machine learning. Models trained with standard stochastic gradient descent (SGD), however, do not necessarily have this property, as their complex behaviors such as robustness or subgroup performance can differ drastically between training and test time. I… ▽ More

    Submitted 17 October, 2022; v1 submitted 7 April, 2022; originally announced April 2022.

    Comments: First two authors contributed equally. To appear in NeurIPS 2022

  9. arXiv:2203.14649  [pdf, other

    cs.LG cs.AI stat.ML

    Knowledge Distillation: Bad Models Can Be Good Role Models

    Authors: Gal Kaplun, Eran Malach, Preetum Nakkiran, Shai Shalev-Shwartz

    Abstract: Large neural networks trained in the overparameterized regime are able to fit noise to zero train error. Recent work \citep{nakkiran2020distributional} has empirically observed that such networks behave as "conditional samplers" from the noisy distribution. That is, they replicate the noise in the train data to unseen examples. We give a theoretical framework for studying this conditional sampling… ▽ More

    Submitted 28 March, 2022; originally announced March 2022.

  10. arXiv:2202.09931  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Deconstructing Distributions: A Pointwise Framework of Learning

    Authors: Gal Kaplun, Nikhil Ghosh, Saurabh Garg, Boaz Barak, Preetum Nakkiran

    Abstract: In machine learning, we traditionally evaluate the performance of a single model, averaged over a collection of test inputs. In this work, we propose a new approach: we measure the performance of a collection of models when evaluated on a $\textit{single input point}$. Specifically, we study a point's $\textit{profile}$: the relationship between models' average performance on the test distribution… ▽ More

    Submitted 7 June, 2022; v1 submitted 20 February, 2022; originally announced February 2022.

    Comments: GK and NG contributed equally. v2: Added Figures 4, 5

  11. arXiv:2202.08384  [pdf, other

    cs.LG cs.CV stat.ML

    Limitations of Neural Collapse for Understanding Generalization in Deep Learning

    Authors: Like Hui, Mikhail Belkin, Preetum Nakkiran

    Abstract: The recent work of Papyan, Han, & Donoho (2020) presented an intriguing "Neural Collapse" phenomenon, showing a structural property of interpolating classifiers in the late stage of training. This opened a rich area of exploration studying this phenomenon. Our motivation is to study the upper limits of this research program: How far will understanding Neural Collapse take us in understanding deep… ▽ More

    Submitted 16 February, 2022; originally announced February 2022.

  12. arXiv:2111.05321  [pdf, ps, other

    cs.LG cs.AI cs.CC math.ST stat.ML

    Turing-Universal Learners with Optimal Scaling Laws

    Authors: Preetum Nakkiran

    Abstract: For a given distribution, learning algorithm, and performance metric, the rate of convergence (or data-scaling law) is the asymptotic behavior of the algorithm's test performance as a function of number of train samples. Many learning methods in both theory and practice have power-law rates, i.e. performance scales as $n^{-α}$ for some $α> 0$. Moreover, both theoreticians and practitioners are con… ▽ More

    Submitted 9 November, 2021; originally announced November 2021.

  13. arXiv:2106.07682  [pdf, other

    cs.LG stat.ML

    Revisiting Model Stitching to Compare Neural Representations

    Authors: Yamini Bansal, Preetum Nakkiran, Boaz Barak

    Abstract: We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models $A$ and $B$, we consider a "stitched model'' formed by connecting the bottom-layers of $A$ to the top-layers of $B$, with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-apprec… ▽ More

    Submitted 14 June, 2021; originally announced June 2021.

  14. arXiv:2010.08127  [pdf, other

    cs.LG cs.CV cs.NE math.ST stat.ML

    The Deep Bootstrap Framework: Good Online Learners are Good Offline Generalizers

    Authors: Preetum Nakkiran, Behnam Neyshabur, Hanie Sedghi

    Abstract: We propose a new framework for reasoning about generalization in deep learning. The core idea is to couple the Real World, where optimizers take stochastic gradient steps on the empirical loss, to an Ideal World, where optimizers take steps on the population loss. This leads to an alternate decomposition of test error into: (1) the Ideal World test error plus (2) the gap between the two worlds. If… ▽ More

    Submitted 18 February, 2021; v1 submitted 15 October, 2020; originally announced October 2020.

    Comments: Accepted to ICLR 2021

  15. arXiv:2009.08092  [pdf, other

    cs.LG cs.NE math.ST stat.ML

    Distributional Generalization: A New Kind of Generalization

    Authors: Preetum Nakkiran, Yamini Bansal

    Abstract: We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed to close in just their average error. For example, if we mislabel 30% of dogs as cats in the train set of CIFAR-10, then a ResNet trained to interpolation will in fact mislabel roughly 30% of dogs as cats o… ▽ More

    Submitted 14 October, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

    Comments: Co-first authors. V2: Intro shortened; no new results

  16. arXiv:2005.07360  [pdf, other

    cs.LG cs.NE stat.ML

    Learning Rate Annealing Can Provably Help Generalization, Even for Convex Problems

    Authors: Preetum Nakkiran

    Abstract: Learning rate schedule can significantly affect generalization performance in modern neural networks, but the reasons for this are not yet understood. Li-Wei-Ma (2019) recently proved this behavior can exist in a simplified non-convex neural-network setting. In this note, we show that this phenomenon can exist even for convex learning problems -- in particular, linear regression in 2 dimensions.… ▽ More

    Submitted 15 May, 2020; originally announced May 2020.

    Comments: 4 pages plus appendix

  17. arXiv:2003.01897  [pdf, other

    cs.LG cs.NE math.ST stat.ML

    Optimal Regularization Can Mitigate Double Descent

    Authors: Preetum Nakkiran, Prayaag Venkat, Sham Kakade, Tengyu Ma

    Abstract: Recent empirical and theoretical studies have shown that many learning algorithms -- from linear regression to neural networks -- can have test performance that is non-monotonic in quantities such the sample size and model size. This striking phenomenon, often referred to as "double descent", has raised questions of if we need to re-think our current understanding of generalization. In this work,… ▽ More

    Submitted 29 April, 2021; v1 submitted 4 March, 2020; originally announced March 2020.

    Comments: v2: Accepted to ICLR 2021. Minor edits to Intro and Appendix

  18. arXiv:1912.07242  [pdf, other

    stat.ML cs.LG cs.NE math.ST

    More Data Can Hurt for Linear Regression: Sample-wise Double Descent

    Authors: Preetum Nakkiran

    Abstract: In this expository note we describe a surprising phenomenon in overparameterized linear regression, where the dimension exceeds the number of samples: there is a regime where the test risk of the estimator found by gradient descent increases with additional samples. In other words, more data actually hurts the estimator. This behavior is implicit in a recent line of theoretical works analyzing "do… ▽ More

    Submitted 16 December, 2019; originally announced December 2019.

  19. arXiv:1912.02292  [pdf, other

    cs.LG cs.CV cs.NE stat.ML

    Deep Double Descent: Where Bigger Models and More Data Hurt

    Authors: Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, Ilya Sutskever

    Abstract: We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effect… ▽ More

    Submitted 4 December, 2019; originally announced December 2019.

    Comments: G.K. and Y.B. contributed equally

  20. arXiv:1905.11604  [pdf, other

    cs.LG cs.NE stat.ML

    SGD on Neural Networks Learns Functions of Increasing Complexity

    Authors: Preetum Nakkiran, Gal Kaplun, Dimitris Kalimeris, Tristan Yang, Benjamin L. Edelman, Fred Zhang, Boaz Barak

    Abstract: We perform an experimental study of the dynamics of Stochastic Gradient Descent (SGD) in learning deep neural networks for several real and synthetic classification tasks. We show that in the initial epochs, almost all of the performance improvement of the classifier obtained by SGD can be explained by a linear classifier. More generally, we give evidence for the hypothesis that, as iterations pro… ▽ More

    Submitted 28 May, 2019; originally announced May 2019.

    Comments: Submitted to NeurIPS 2019

  21. arXiv:1902.01086  [pdf, other

    stat.ML cs.LG

    Computational Limitations in Robust Classification and Win-Win Results

    Authors: Akshay Degwekar, Preetum Nakkiran, Vinod Vaikuntanathan

    Abstract: We continue the study of statistical/computational tradeoffs in learning robust classifiers, following the recent work of Bubeck, Lee, Price and Razenshteyn who showed examples of classification tasks where (a) an efficient robust classifier exists, in the small-perturbation regime; (b) a non-robust classifier can be learned efficiently; but (c) it is computationally hard to learn a robust classif… ▽ More

    Submitted 5 June, 2019; v1 submitted 4 February, 2019; originally announced February 2019.

    Comments: Merge of [DegwekarVaikuntanathan19](arXiv:1902.01086) and [Nakkiran19](arXiv:1901.00532)

  22. arXiv:1901.00532  [pdf, ps, other

    cs.LG cs.CC stat.ML

    Adversarial Robustness May Be at Odds With Simplicity

    Authors: Preetum Nakkiran

    Abstract: Current techniques in machine learning are so far are unable to learn classifiers that are robust to adversarial perturbations. However, they are able to learn non-robust classifiers with very high accuracy, even in the presence of random perturbations. Towards explaining this gap, we highlight the hypothesis that… ▽ More

    Submitted 2 January, 2019; originally announced January 2019.

    Comments: welcome

  23. arXiv:1809.05596  [pdf, ps, other

    stat.ME cs.LG math.ST

    The Generic Holdout: Preventing False-Discoveries in Adaptive Data Science

    Authors: Preetum Nakkiran, Jarosław Błasiok

    Abstract: Adaptive data analysis has posed a challenge to science due to its ability to generate false hypotheses on moderately large data sets. In general, with non-adaptive data analyses (where queries to the data are generated without being influenced by answers to previous queries) a data set containing $n$ samples may support exponentially many queries in $n$. This number reduces to linearly many under… ▽ More

    Submitted 14 September, 2018; originally announced September 2018.