Skip to main content

Showing 1–50 of 87 results for author: Sohl-dickstein, J

.
  1. arXiv:2404.03626  [pdf, other

    cs.CL cs.LG

    Training LLMs over Neurally Compressed Text

    Authors: Brian Lester, Jaehoon Lee, Alex Alemi, Jeffrey Pennington, Adam Roberts, Jascha Sohl-Dickstein, Noah Constant

    Abstract: In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier h… ▽ More

    Submitted 4 April, 2024; originally announced April 2024.

  2. arXiv:2402.06184  [pdf, other

    cs.LG cs.NE nlin.CD

    The boundary of neural network trainability is fractal

    Authors: Jascha Sohl-Dickstein

    Abstract: Some fractals -- for instance those associated with the Mandelbrot and quadratic Julia sets -- are computed by iterating a function, and identifying the boundary between hyperparameters for which the resulting series diverges or remains bounded. Neural network training similarly involves iterating an update function (e.g. repeated steps of gradient descent), can result in convergent or divergent b… ▽ More

    Submitted 8 February, 2024; originally announced February 2024.

    Comments: 3 pages, mesmerizing fractals

  3. arXiv:2312.06585  [pdf, other

    cs.LG

    Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models

    Authors: Avi Singh, John D. Co-Reyes, Rishabh Agarwal, Ankesh Anand, Piyush Patil, Xavier Garcia, Peter J. Liu, James Harrison, Jaehoon Lee, Kelvin Xu, Aaron Parisi, Abhishek Kumar, Alex Alemi, Alex Rizkowsky, Azade Nova, Ben Adlam, Bernd Bohnet, Gamaleldin Elsayed, Hanie Sedghi, Igor Mordatch, Isabelle Simpson, Izzeddin Gur, Jasper Snoek, Jeffrey Pennington, Jiri Hron , et al. (16 additional authors not shown)

    Abstract: Fine-tuning language models~(LMs) on human-generated data remains a prevalent practice. However, the performance of such models is often limited by the quantity and diversity of high-quality human data. In this paper, we explore whether we can go beyond human data on tasks where we have access to scalar feedback, for example, on math problems where one can verify correctness. To do so, we investig… ▽ More

    Submitted 17 April, 2024; v1 submitted 11 December, 2023; originally announced December 2023.

    Comments: Accepted to TMLR. Camera-ready version. First three authors contributed equally

  4. arXiv:2311.07587  [pdf, other

    cs.CL cs.AI cs.CY cs.LG

    Frontier Language Models are not Robust to Adversarial Arithmetic, or "What do I need to say so you agree 2+2=5?

    Authors: C. Daniel Freeman, Laura Culp, Aaron Parisi, Maxwell L Bileschi, Gamaleldin F Elsayed, Alex Rizkowsky, Isabelle Simpson, Alex Alemi, Azade Nova, Ben Adlam, Bernd Bohnet, Gaurav Mishra, Hanie Sedghi, Igor Mordatch, Izzeddin Gur, Jaehoon Lee, JD Co-Reyes, Jeffrey Pennington, Kelvin Xu, Kevin Swersky, Kshiteej Mahajan, Lechao Xiao, Rosanne Liu, Simon Kornblith, Noah Constant , et al. (5 additional authors not shown)

    Abstract: We introduce and study the problem of adversarial arithmetic, which provides a simple yet challenging testbed for language model alignment. This problem is comprised of arithmetic questions posed in natural language, with an arbitrary adversarial string inserted before the question is complete. Even in the simple setting of 1-digit addition problems, it is easy to find adversarial prompts that mak… ▽ More

    Submitted 15 November, 2023; v1 submitted 8 November, 2023; originally announced November 2023.

  5. arXiv:2311.02462  [pdf, ps, other

    cs.AI

    Levels of AGI for Operationalizing Progress on the Path to AGI

    Authors: Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, Shane Legg

    Abstract: We propose a framework for classifying the capabilities and behavior of Artificial General Intelligence (AGI) models and their precursors. This framework introduces levels of AGI performance, generality, and autonomy, providing a common language to compare models, assess risks, and measure progress along the path to AGI. To develop our framework, we analyze existing definitions of AGI, and distill… ▽ More

    Submitted 5 June, 2024; v1 submitted 4 November, 2023; originally announced November 2023.

    Comments: version 4 - Position Paper accepted to ICML 2024. Note that due to ICML position paper titling format requirements, the title has changed slightly from that of the original arXiv pre-print. The original pre-print title was "Levels of AGI: Operationalizing Progress on the Path to AGI" but the official published title for ICML 2024 is "Levels of AGI for Operationalizing Progress on the Path to AGI"

    Journal ref: Proceedings of ICML 2024

  6. arXiv:2309.14322  [pdf, other

    cs.LG

    Small-scale proxies for large-scale Transformer training instabilities

    Authors: Mitchell Wortsman, Peter J. Liu, Lechao Xiao, Katie Everett, Alex Alemi, Ben Adlam, John D. Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, Simon Kornblith

    Abstract: Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study train… ▽ More

    Submitted 16 October, 2023; v1 submitted 25 September, 2023; originally announced September 2023.

  7. arXiv:2304.12180  [pdf, other

    cs.NE cs.AI cs.LG

    Variance-Reduced Gradient Estimation via Noise-Reuse in Online Evolution Strategies

    Authors: Oscar Li, James Harrison, Jascha Sohl-Dickstein, Virginia Smith, Luke Metz

    Abstract: Unrolled computation graphs are prevalent throughout machine learning but present challenges to automatic differentiation (AD) gradient estimation methods when their loss functions exhibit extreme local sensitivtiy, discontinuity, or blackbox characteristics. In such scenarios, online evolution strategies methods are a more capable alternative, while being more parallelizable than vanilla evolutio… ▽ More

    Submitted 9 December, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

    Comments: NeurIPS 2023. 41 pages. Code available at https://github.com/OscarcarLi/Noise-Reuse-Evolution-Strategies

  8. arXiv:2302.11552  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    Reduce, Reuse, Recycle: Compositional Generation with Energy-Based Diffusion Models and MCMC

    Authors: Yilun Du, Conor Durkan, Robin Strudel, Joshua B. Tenenbaum, Sander Dieleman, Rob Fergus, Jascha Sohl-Dickstein, Arnaud Doucet, Will Grathwohl

    Abstract: Since their introduction, diffusion models have quickly become the prevailing approach to generative modeling in many domains. They can be interpreted as learning the gradients of a time-varying sequence of log-probability density functions. This interpretation has motivated classifier-based and classifier-free guidance as methods for post-hoc control of diffusion models. In this work, we build up… ▽ More

    Submitted 18 November, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

    Comments: ICML 2023, Project Webpage: https://energy-based-model.github.io/reduce-reuse-recycle/

  9. arXiv:2212.04458  [pdf, other

    cs.LG cs.AI cs.NE stat.ML

    General-Purpose In-Context Learning by Meta-Learning Transformers

    Authors: Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, Luke Metz

    Abstract: Modern machine learning requires system designers to specify aspects of the learning pipeline, such as losses, architectures, and optimizers. Meta-learning, or learning-to-learn, instead aims to learn those aspects, and promises to unlock greater capabilities with less manual effort. One particularly ambitious goal of meta-learning is to train general-purpose in-context learning algorithms from sc… ▽ More

    Submitted 9 January, 2024; v1 submitted 8 December, 2022; originally announced December 2022.

    Comments: Published at the NeurIPS 2022 Workshop on Meta-Learning. Full version currently under review

  10. arXiv:2211.09760  [pdf, other

    cs.LG math.OC stat.ML

    VeLO: Training Versatile Learned Optimizers by Scaling Up

    Authors: Luke Metz, James Harrison, C. Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, Jascha Sohl-Dickstein

    Abstract: While deep learning models have replaced hand-designed features across many domains, these models are still trained with hand-designed optimizers. In this work, we leverage the same scaling approach behind the success of deep learning to learn versatile optimizers. We train an optimizer for deep learning which is itself a small neural network that ingests gradients and outputs parameter updates. M… ▽ More

    Submitted 17 November, 2022; originally announced November 2022.

  11. arXiv:2209.11208  [pdf, other

    cs.LG math.OC stat.ML

    A Closer Look at Learned Optimization: Stability, Robustness, and Inductive Biases

    Authors: James Harrison, Luke Metz, Jascha Sohl-Dickstein

    Abstract: Learned optimizers -- neural networks that are trained to act as optimizers -- have the potential to dramatically accelerate training of machine learning models. However, even when meta-trained across thousands of tasks at huge computational expense, blackbox learned optimizers often struggle with stability and generalization when applied to tasks unlike those in their meta-training set. In this p… ▽ More

    Submitted 22 September, 2022; originally announced September 2022.

    Comments: NeurIPS 2022

  12. arXiv:2207.10342  [pdf, ps, other

    cs.CL cs.AI

    Language Model Cascades

    Authors: David Dohan, Winnie Xu, Aitor Lewkowycz, Jacob Austin, David Bieber, Raphael Gontijo Lopes, Yuhuai Wu, Henryk Michalewski, Rif A. Saurous, Jascha Sohl-dickstein, Kevin Murphy, Charles Sutton

    Abstract: Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a single model, or the composition of multiple models together, further expands capabilities. These compositions are probabilistic models, and may be expressed in the language of graphical models with random variables whose values are complex data types such as strings. Cases with cont… ▽ More

    Submitted 28 July, 2022; v1 submitted 21 July, 2022; originally announced July 2022.

    Comments: Presented as spotlight at the Beyond Bases workshop at ICML 2022 (https://beyond-bayes.github.io)

  13. arXiv:2206.08720  [pdf, other

    cs.LG cs.AI stat.ML

    Fast Finite Width Neural Tangent Kernel

    Authors: Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: The Neural Tangent Kernel (NTK), defined as $Θ_θ^f(x_1, x_2) = \left[\partial f(θ, x_1)\big/\partial θ\right] \left[\partial f(θ, x_2)\big/\partial θ\right]^T$ where $\left[\partial f(θ, \cdot)\big/\partial θ\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful… ▽ More

    Submitted 17 June, 2022; originally announced June 2022.

    Comments: Published as a conference paper at ICML 2022

  14. arXiv:2206.07673  [pdf, other

    stat.ML cs.LG

    Wide Bayesian neural networks have a simple weight posterior: theory and accelerated sampling

    Authors: Jiri Hron, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: We introduce repriorisation, a data-dependent reparameterisation which transforms a Bayesian neural network (BNN) posterior to a distribution whose KL divergence to the BNN prior vanishes as layer widths grow. The repriorisation map acts directly on parameters, and its analytic simplicity complements the known neural network Gaussian process (NNGP) behaviour of wide BNNs in function space. Exploit… ▽ More

    Submitted 15 June, 2022; originally announced June 2022.

    Comments: ICML 2022

  15. arXiv:2206.04615  [pdf, other

    cs.CL cs.AI cs.CY cs.LG stat.ML

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

    Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More

    Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

    Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

    Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

  16. arXiv:2203.11860  [pdf, other

    cs.LG cs.NE math.OC stat.ML

    Practical tradeoffs between memory, compute, and performance in learned optimizers

    Authors: Luke Metz, C. Daniel Freeman, James Harrison, Niru Maheswaranathan, Jascha Sohl-Dickstein

    Abstract: Optimization plays a costly and crucial role in develo** machine learning systems. In learned optimizers, the few hyperparameters of commonly used hand-designed optimizers, e.g. Adam or SGD, are replaced with flexible parametric functions. The parameters of these functions are then optimized so that the resulting learned optimizer minimizes a target loss on a chosen class of models. Learned opti… ▽ More

    Submitted 16 July, 2022; v1 submitted 22 March, 2022; originally announced March 2022.

  17. arXiv:2112.13835  [pdf, other

    cs.LG stat.ML

    Unbiased Gradient Estimation in Unrolled Computation Graphs with Persistent Evolution Strategies

    Authors: Paul Vicol, Luke Metz, Jascha Sohl-Dickstein

    Abstract: Unrolled computation graphs arise in many scenarios, including training RNNs, tuning hyperparameters through unrolled optimization, and training learned optimizers. Current approaches to optimizing parameters in such computation graphs suffer from high variance gradients, bias, slow updates, or large memory usage. We introduce a method called Persistent Evolution Strategies (PES), which divides th… ▽ More

    Submitted 27 December, 2021; originally announced December 2021.

    Comments: ICML 2021

  18. arXiv:2112.02721  [pdf, other

    cs.CL cs.AI cs.LG

    NL-Augmenter: A Framework for Task-Sensitive Natural Language Augmentation

    Authors: Kaustubh D. Dhole, Varun Gangal, Sebastian Gehrmann, Aadesh Gupta, Zhenhao Li, Saad Mahamood, Abinaya Mahendiran, Simon Mille, Ashish Shrivastava, Samson Tan, Tongshuang Wu, Jascha Sohl-Dickstein, **ho D. Choi, Eduard Hovy, Ondrej Dusek, Sebastian Ruder, Sajant Anand, Nagender Aneja, Rabin Banjade, Lisa Barthe, Hanna Behnke, Ian Berlot-Attwell, Connor Boyle, Caroline Brun, Marco Antonio Sobrevilla Cabezudo , et al. (101 additional authors not shown)

    Abstract: Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Python-based natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data split… ▽ More

    Submitted 11 October, 2022; v1 submitted 5 December, 2021; originally announced December 2021.

    Comments: 39 pages, repository at https://github.com/GEM-benchmark/NL-Augmenter

  19. arXiv:2110.01765  [pdf, other

    cs.LG cs.AI cs.NE

    Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Sha**

    Authors: James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

  20. arXiv:2101.07367  [pdf, other

    cs.LG cs.NE

    Training Learned Optimizers with Randomly Initialized Learned Optimizers

    Authors: Luke Metz, C. Daniel Freeman, Niru Maheswaranathan, Jascha Sohl-Dickstein

    Abstract: Learned optimizers are increasingly effective, with performance exceeding that of hand designed optimizers such as Adam~\citep{kingma2014adam} on specific tasks \citep{metz2019understanding}. Despite the potential gains available, in current work the meta-training (or `outer-training') of the learned optimizer is performed by a hand-designed optimizer, or by an optimizer trained by a hand-designed… ▽ More

    Submitted 14 January, 2021; originally announced January 2021.

  21. arXiv:2012.03837  [pdf, other

    cs.LG cs.AI cs.NE

    Parallel Training of Deep Networks with Local Updates

    Authors: Michael Laskin, Luke Metz, Seth Nabarro, Mark Saroufim, Badreddine Noune, Carlo Luschi, Jascha Sohl-Dickstein, Pieter Abbeel

    Abstract: Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep… ▽ More

    Submitted 15 June, 2021; v1 submitted 7 December, 2020; originally announced December 2020.

    Comments: First two authors - Michael Laskin and Luke Metz - contributed equally. Order was determined by a coin flip

  22. arXiv:2011.13456  [pdf, other

    cs.LG stat.ML

    Score-Based Generative Modeling through Stochastic Differential Equations

    Authors: Yang Song, Jascha Sohl-Dickstein, Diederik P. Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

    Abstract: Creating noise from data is easy; creating data from noise is generative modeling. We present a stochastic differential equation (SDE) that smoothly transforms a complex data distribution to a known prior distribution by slowly injecting noise, and a corresponding reverse-time SDE that transforms the prior distribution back into the data distribution by slowly removing the noise. Crucially, the re… ▽ More

    Submitted 10 February, 2021; v1 submitted 26 November, 2020; originally announced November 2020.

    Comments: ICLR 2021 (Oral)

  23. arXiv:2011.06006  [pdf, other

    cs.LG

    Towards NNGP-guided Neural Architecture Search

    Authors: Daniel S. Park, Jaehoon Lee, Daiyi Peng, Yuan Cao, Jascha Sohl-Dickstein

    Abstract: The predictions of wide Bayesian neural networks are described by a Gaussian process, known as the Neural Network Gaussian Process (NNGP). Analytic forms for NNGP kernels are known for many models, but computing the exact kernel for convolutional architectures is prohibitively expensive. One can obtain effective approximations of these kernels through Monte-Carlo estimation using finite networks a… ▽ More

    Submitted 11 November, 2020; originally announced November 2020.

    Comments: 13 + 6 pages, 19 figures; open-source code available at https://github.com/google-research/google-research/tree/master/nngp_nas

  24. arXiv:2011.02159  [pdf, other

    cs.LG cs.NE stat.ML

    Reverse engineering learned optimizers reveals known and novel mechanisms

    Authors: Niru Maheswaranathan, David Sussillo, Luke Metz, Ruoxi Sun, Jascha Sohl-Dickstein

    Abstract: Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a… ▽ More

    Submitted 7 December, 2021; v1 submitted 4 November, 2020; originally announced November 2020.

    Comments: Thirty-Fifth Conference on Neural Information Processing Systems. 2021

  25. arXiv:2010.10687  [pdf, other

    cs.LG cs.NE

    Is Batch Norm unique? An empirical investigation and prescription to emulate the best properties of common normalizers without batch dependence

    Authors: Vinay Rao, Jascha Sohl-Dickstein

    Abstract: We perform an extensive empirical study of the statistical properties of Batch Norm and other common normalizers. This includes an examination of the correlation between representations of minibatches, gradient norms, and Hessian spectra both at initialization and over the course of training. Through this analysis, we identify several statistical properties which appear linked to Batch Norm's supe… ▽ More

    Submitted 20 October, 2020; originally announced October 2020.

  26. arXiv:2009.11243  [pdf, other

    cs.LG cs.NE stat.ML

    Tasks, stability, architecture, and compute: Training more effective learned optimizers, and using them to train themselves

    Authors: Luke Metz, Niru Maheswaranathan, C. Daniel Freeman, Ben Poole, Jascha Sohl-Dickstein

    Abstract: Much as replacing hand-designed features with learned functions has revolutionized how we solve perceptual tasks, we believe learned algorithms will transform how we train models. In this work we focus on general-purpose learned optimizers capable of training a wide variety of problems with no user-specified hyperparameters. We introduce a new, neural network parameterized, hierarchical optimizer… ▽ More

    Submitted 23 September, 2020; originally announced September 2020.

  27. arXiv:2008.07545  [pdf, other

    cs.LG stat.ML

    Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization

    Authors: Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

    Abstract: Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second momen… ▽ More

    Submitted 19 July, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

    Comments: 13+10 pages, 10 figures; minor textual changes and some reorganization, one new figure and a new proof of main theorem added

  28. arXiv:2007.15801  [pdf, other

    cs.LG stat.ML

    Finite Versus Infinite Neural Networks: an Empirical Study

    Authors: Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

    Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neu… ▽ More

    Submitted 8 September, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

    Comments: 17+11 pages; v2 references added, minor improvements

  29. arXiv:2007.09240  [pdf, other

    cs.LG stat.ML

    A new method for parameter estimation in probabilistic models: Minimum probability flow

    Authors: Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

    Abstract: Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function. We propose a new parameter fitting method, Minimum Probability Flow (MPF), which is applicable to any parametric model. We demonstrate parameter estimation using MPF in two cases: a continuous state space model, and an Ising spin glass. In the latter case it outperforms current tec… ▽ More

    Submitted 17 July, 2020; originally announced July 2020.

    Comments: Originally published 2011. Uploaded to arXiv 2020. arXiv admin note: text overlap with arXiv:0906.4779, arXiv:1205.4295

  30. arXiv:2006.10541  [pdf, other

    stat.ML cs.LG

    Exact posterior distributions of wide Bayesian neural networks

    Authors: Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it… ▽ More

    Submitted 26 November, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

  31. arXiv:2006.10540  [pdf, other

    stat.ML cs.LG

    Infinite attention: NNGP and NTK for deep attention networks

    Authors: Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

    Abstract: There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly… ▽ More

    Submitted 18 June, 2020; originally announced June 2020.

    Comments: ICML 2020

  32. arXiv:2005.06553  [pdf, other

    stat.CO

    Two equalities expressing the determinant of a matrix in terms of expectations over matrix-vector products

    Authors: Jascha Sohl-Dickstein

    Abstract: We introduce two equations expressing the inverse determinant of a full rank matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$ in terms of expectations over matrix-vector products. The first relationship is $|\mathrm{det} (\mathbf{A})|^{-1} = \mathbb{E}_{\mathbf{s} \sim \mathcal{S}^{n-1}}\bigl[\, \Vert \mathbf{As}\Vert^{-n} \bigr]$, where expectations are over vectors drawn uniformly on the surface… ▽ More

    Submitted 19 June, 2020; v1 submitted 13 May, 2020; originally announced May 2020.

  33. arXiv:2003.06060  [pdf, other

    cs.LG cs.AI stat.ML

    Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling

    Authors: Tong Che, Ruixiang Zhang, Jascha Sohl-Dickstein, Hugo Larochelle, Liam Paull, Yuan Cao, Yoshua Bengio

    Abstract: We show that the sum of the implicit generator log-density $\log p_g$ of a GAN with the logit score of the discriminator defines an energy function which yields the true data density when the generator is imperfect but the discriminator is optimal, thus making it possible to improve on the typical generator (with implicit density $p_g$). To make that practical, we show that sampling from this modi… ▽ More

    Submitted 7 July, 2021; v1 submitted 12 March, 2020; originally announced March 2020.

  34. arXiv:2003.02218  [pdf, other

    stat.ML cs.LG

    The large learning rate phase of deep learning: the catapult mechanism

    Authors: Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

    Abstract: The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small l… ▽ More

    Submitted 4 March, 2020; originally announced March 2020.

    Comments: 25 pages, 19 figures

  35. arXiv:2002.11887  [pdf, other

    cs.LG stat.ML

    Using a thousand optimization tasks to learn hyperparameter search strategies

    Authors: Luke Metz, Niru Maheswaranathan, Ruoxi Sun, C. Daniel Freeman, Ben Poole, Jascha Sohl-Dickstein

    Abstract: We present TaskSet, a dataset of tasks for use in training and evaluating optimizers. TaskSet is unique in its size and diversity, containing over a thousand tasks ranging from image classification with fully connected or convolutional neural networks, to variational autoencoders, to non-volume preserving flows on a variety of datasets. As an example application of such a dataset we explore meta-l… ▽ More

    Submitted 31 March, 2020; v1 submitted 26 February, 2020; originally announced February 2020.

  36. arXiv:2001.07301  [pdf, other

    cs.LG stat.ML

    On the infinite width limit of neural networks with a standard parameterization

    Authors: Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee

    Abstract: There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK paramet… ▽ More

    Submitted 18 April, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

  37. arXiv:1912.02803  [pdf, other

    stat.ML cs.LG

    Neural Tangents: Fast and Easy Infinite Neural Networks in Python

    Authors: Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using… ▽ More

    Submitted 5 December, 2019; originally announced December 2019.

  38. arXiv:1909.04240  [pdf, other

    cs.LG cs.NE stat.ML

    Neural reparameterization improves structural optimization

    Authors: Stephan Hoyer, Jascha Sohl-Dickstein, Sam Greydanus

    Abstract: Structural optimization is a popular method for designing objects such as bridge trusses, airplane wings, and optical devices. Unfortunately, the quality of solutions depends heavily on how the problem is parameterized. In this paper, we propose using the implicit bias over functions induced by neural networks to improve the parameterization of structural optimization. Rather than directly optimiz… ▽ More

    Submitted 13 September, 2019; v1 submitted 9 September, 2019; originally announced September 2019.

  39. arXiv:1906.03367  [pdf, other

    cs.LG stat.ML

    Using learned optimizers to make models robust to input noise

    Authors: Luke Metz, Niru Maheswaranathan, Jonathon Shlens, Jascha Sohl-Dickstein, Ekin D. Cubuk

    Abstract: State-of-the art vision models can achieve superhuman performance on image classification tasks when testing and training data come from the same distribution. However, when models are tested on corrupted images (e.g. due to scale changes, translations, or shifts in brightness or contrast), performance degrades significantly. Here, we explore the possibility of meta-training a learned optimizer th… ▽ More

    Submitted 7 June, 2019; originally announced June 2019.

  40. arXiv:1905.03776  [pdf, other

    cs.LG cs.AI cs.CV stat.ML

    The Effect of Network Width on Stochastic Gradient Descent and Generalization: an Empirical Study

    Authors: Daniel S. Park, Jascha Sohl-Dickstein, Quoc V. Le, Samuel L. Smith

    Abstract: We investigate how the final parameters found by stochastic gradient descent are influenced by over-parameterization. We generate families of models by increasing the number of channels in a base network, and then perform a large hyper-parameter search to study how the test error depends on learning rate, batch size, and network width. We find that the optimal SGD hyper-parameters are determined b… ▽ More

    Submitted 9 May, 2019; originally announced May 2019.

    Comments: 17 pages, 3 tables, 17 figures; accepted to ICML 2019

  41. arXiv:1903.07714  [pdf, other

    cs.LG stat.ML

    A RAD approach to deep mixture models

    Authors: Laurent Dinh, Jascha Sohl-Dickstein, Hugo Larochelle, Razvan Pascanu

    Abstract: Flow based models such as Real NVP are an extremely powerful approach to density estimation. However, existing flow based models are restricted to transforming continuous densities over a continuous input space into similarly continuous distributions over continuous latent variables. This makes them poorly suited for modeling and representing discrete structures in data distributions, for example… ▽ More

    Submitted 25 August, 2020; v1 submitted 18 March, 2019; originally announced March 2019.

    Comments: 18.5 pages of main content, 3 pages of appendices

  42. arXiv:1902.08129  [pdf, other

    cs.NE cond-mat.dis-nn cs.LG math.DS

    A Mean Field Theory of Batch Normalization

    Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

    Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initi… ▽ More

    Submitted 5 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

    Comments: To appear in ICLR 2019

  43. Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

    Authors: Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

    Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained… ▽ More

    Submitted 8 December, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

    Comments: 12+16 pages; open-source code available at https://github.com/google/neural-tangents; accepted to NeurIPS 2019

  44. arXiv:1901.03909  [pdf, other

    stat.ML cs.LG

    Eliminating all bad Local Minima from Loss Landscapes without even adding an Extra Unit

    Authors: Jascha Sohl-Dickstein, Kenji Kawaguchi

    Abstract: Recent work has noted that all bad local minima can be removed from neural network loss landscapes, by adding a single unit with a particular parameterization. We show that the core technique from these papers can be used to remove all bad local minima from any loss landscape, so long as the global minimum has a loss of zero. This procedure does not require the addition of auxiliary units, or even… ▽ More

    Submitted 12 January, 2019; originally announced January 2019.

  45. arXiv:1811.03600  [pdf, other

    cs.LG stat.ML

    Measuring the Effects of Data Parallelism on Neural Network Training

    Authors: Christopher J. Shallue, Jaehoon Lee, Joseph Antognini, Jascha Sohl-Dickstein, Roy Frostig, George E. Dahl

    Abstract: Recent hardware developments have dramatically increased the scale of data parallelism available for neural network training. Among the simplest ways to harness next-generation hardware is to increase the batch size in standard mini-batch neural network training algorithms. In this work, we aim to experimentally characterize the effects of increasing the batch size on training time, as measured by… ▽ More

    Submitted 18 July, 2019; v1 submitted 8 November, 2018; originally announced November 2018.

    Journal ref: Journal of Machine Learning Research 20 (2019) 1-49

  46. arXiv:1810.10180  [pdf, other

    cs.NE stat.ML

    Understanding and correcting pathologies in the training of learned optimizers

    Authors: Luke Metz, Niru Maheswaranathan, Jeremy Nixon, C. Daniel Freeman, Jascha Sohl-Dickstein

    Abstract: Deep learning has shown that learned functions can dramatically outperform hand-designed functions on perceptual tasks. Analogously, this suggests that learned optimizers may similarly outperform current hand-designed optimizers, especially for specific problems. However, learned optimizers are notoriously difficult to train and have yet to demonstrate wall-clock speedups over hand-designed optimi… ▽ More

    Submitted 7 June, 2019; v1 submitted 24 October, 2018; originally announced October 2018.

  47. arXiv:1810.05148  [pdf, other

    stat.ML cs.AI cs.LG cs.NE

    Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

    Authors: Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

    Abstract: There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous… ▽ More

    Submitted 21 August, 2020; v1 submitted 11 October, 2018; originally announced October 2018.

    Comments: Published as a conference paper at ICLR 2019

  48. arXiv:1806.11146  [pdf, other

    cs.LG cs.CR cs.CV stat.ML

    Adversarial Reprogramming of Neural Networks

    Authors: Gamaleldin F. Elsayed, Ian Goodfellow, Jascha Sohl-Dickstein

    Abstract: Deep neural networks are susceptible to \emph{adversarial} attacks. In computer vision, well-crafted perturbations to images can cause neural networks to make mistakes such as confusing a cat with a computer. Previous adversarial attacks have been designed to degrade performance of models or cause machine learning models to produce specific outputs chosen ahead of time by the attacker. We introduc… ▽ More

    Submitted 29 November, 2018; v1 submitted 28 June, 2018; originally announced June 2018.

    Journal ref: International Conference on Learning Representations 2019

  49. arXiv:1806.10230  [pdf, other

    cs.NE cs.LG stat.ML

    Guided evolutionary strategies: Augmenting random search with surrogate gradients

    Authors: Niru Maheswaranathan, Luke Metz, George Tucker, Dami Choi, Jascha Sohl-Dickstein

    Abstract: Many applications in machine learning require optimizing a function whose true gradient is unknown, but where surrogate gradient information (directions that may be correlated with, but not necessarily identical to, the true gradient) is available instead. This arises when an approximate gradient is easier to compute than the full gradient (e.g. in meta-learning or unrolled optimization), or when… ▽ More

    Submitted 10 June, 2019; v1 submitted 26 June, 2018; originally announced June 2018.

    Comments: Published at ICML 2019

  50. arXiv:1806.09597  [pdf, other

    cs.LG cs.AI stat.ML

    Stochastic natural gradient descent draws posterior samples in function space

    Authors: Samuel L. Smith, Daniel Duckworth, Semon Rezchikov, Quoc V. Le, Jascha Sohl-Dickstein

    Abstract: Recent work has argued that stochastic gradient descent can approximate the Bayesian uncertainty in model parameters near local minima. In this work we develop a similar correspondence for minibatch natural gradient descent (NGD). We prove that for sufficiently small learning rates, if the model predictions on the training set approach the true conditional distribution of labels given inputs, the… ▽ More

    Submitted 28 November, 2018; v1 submitted 25 June, 2018; originally announced June 2018.

    Comments: Workshop on Bayesian Deep Learning (NeurIPS 2018)