Search | arXiv e-print repository

Inferring interaction potentials from stochastic particle trajectories

Authors: Ella M. King, Megan C. Engel, Caroline Martin, Alp M. Sunol, Qian-Ze Zhu, Sam S. Schoenholz, Vinothan N. Manoharan, Michael P. Brenner

Abstract: Accurate interaction potentials between microscopic components such as colloidal particles or cells are crucial to understanding a range of processes, including colloidal crystallization, bacterial colony formation, and cancer metastasis. Even in systems where the precise interaction mechanisms are unknown, effective interactions can be measured to inform simulation and design. However, these meas… ▽ More Accurate interaction potentials between microscopic components such as colloidal particles or cells are crucial to understanding a range of processes, including colloidal crystallization, bacterial colony formation, and cancer metastasis. Even in systems where the precise interaction mechanisms are unknown, effective interactions can be measured to inform simulation and design. However, these measurements are difficult and time-intensive, and often require conditions that are drastically different from in situ conditions of the system of interest. Moreover, existing methods of measuring interparticle potentials rely on constraining a small number of particles at equilibrium, placing limits on which interactions can be measured. We introduce a method for inferring interaction potentials directly from trajectory data of interacting particles. We explicitly solve the equations of motion to find a form of the potential that maximizes the probability of observing a known trajectory. Our method is valid for systems both in and out of equilibrium, is well-suited to large numbers of particles interacting in typical system conditions, and does not assume a functional form of the interaction potential. We apply our method to infer the interactions of colloidal spheres from experimental data, successfully extracting the range and strength of a depletion interaction from the motion of the particles. △ Less

Submitted 3 June, 2024; originally announced June 2024.

arXiv:2312.05360 [pdf, other]

doi 10.1073/pnas.2311891121

Programmable patchy particles for materials design

Authors: Ella M. King, Chrisy Xiyu Du, Qian-Ze Zhu, Samuel S. Schoenholz, Michael P. Brenner

Abstract: Direct design of complex functional materials would revolutionize technologies ranging from printable organs to novel clean energy devices. However, even incremental steps towards designing functional materials have proven challenging. If the material is constructed from highly complex components, the design space of materials properties rapidly becomes too computationally expensive to search. On… ▽ More Direct design of complex functional materials would revolutionize technologies ranging from printable organs to novel clean energy devices. However, even incremental steps towards designing functional materials have proven challenging. If the material is constructed from highly complex components, the design space of materials properties rapidly becomes too computationally expensive to search. On the other hand, very simple components such as uniform spherical particles are not powerful enough to capture rich functional behavior. Here, we introduce a differentiable materials design model with components that are simple enough to design yet powerful enough to capture complex materials properties: rigid bodies composed of spherical particles with directional interactions (patchy particles). We showcase the method with self-assembly designs ranging from open lattices to self-limiting clusters, all of which are notoriously challenging design goals to achieve using purely isotropic particles. By directly optimizing over the location and interaction of the patches on patchy particles using gradient descent, we dramatically reduce the computation time for finding the optimal building blocks. △ Less

Submitted 8 December, 2023; originally announced December 2023.

arXiv:2210.05546 [pdf, other]

What does a deep neural network confidently perceive? The effective dimension of high certainty class manifolds and their low confidence boundaries

Authors: Stanislav Fort, Ekin Dogus Cubuk, Surya Ganguli, Samuel S. Schoenholz

Abstract: Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon's escape theorem to tractably estimate the effective dimension of CMs and their boundaries t… ▽ More Deep neural network classifiers partition input space into high confidence regions for each class. The geometry of these class manifolds (CMs) is widely studied and intimately related to model performance; for example, the margin depends on CM boundaries. We exploit the notions of Gaussian width and Gordon's escape theorem to tractably estimate the effective dimension of CMs and their boundaries through tomographic intersections with random affine subspaces of varying dimension. We show several connections between the dimension of CMs, generalization, and robustness. In particular we investigate how CM dimension depends on 1) the dataset, 2) architecture (including ResNet, WideResNet \& Vision Transformer), 3) initialization, 4) stage of training, 5) class, 6) network width, 7) ensemble size, 8) label randomization, 9) training set size, and 10) robustness to data corruption. Together a picture emerges that higher performing and more robust models have higher dimensional CMs. Moreover, we offer a new perspective on ensembling via intersections of CMs. Our code is at https://github.com/stanislavfort/slice-dice-optimize/ △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: An extended version of /Slice, Dice, and Optimize: Measuring the Dimension of Neural Network Class Manifolds/

arXiv:2207.09432 [pdf, other]

Deep equilibrium networks are sensitive to initialization statistics

Authors: Atish Agarwala, Samuel S. Schoenholz

Abstract: Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute. However, theoretical understanding of these models is still lacking compared to traditional networks, in part because of the repeated application of a single set of weights. We show that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized… ▽ More Deep equilibrium networks (DEQs) are a promising way to construct models which trade off memory for compute. However, theoretical understanding of these models is still lacking compared to traditional networks, in part because of the repeated application of a single set of weights. We show that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized. In particular, initializing with orthogonal or symmetric matrices allows for greater stability in training. This gives us a practical prescription for initializations which allow for training with a broader range of initial weight scales. △ Less

Submitted 19 July, 2022; originally announced July 2022.

arXiv:2206.08720 [pdf, other]

Fast Finite Width Neural Tangent Kernel

Authors: Roman Novak, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Abstract: The Neural Tangent Kernel (NTK), defined as $Θ_θ^f(x_1, x_2) = \left[\partial f(θ, x_1)\big/\partial θ\right] \left[\partial f(θ, x_2)\big/\partial θ\right]^T$ where $\left[\partial f(θ, \cdot)\big/\partial θ\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful… ▽ More The Neural Tangent Kernel (NTK), defined as $Θ_θ^f(x_1, x_2) = \left[\partial f(θ, x_1)\big/\partial θ\right] \left[\partial f(θ, x_2)\big/\partial θ\right]^T$ where $\left[\partial f(θ, \cdot)\big/\partial θ\right]$ is a neural network (NN) Jacobian, has emerged as a central object of study in deep learning. In the infinite width limit, the NTK can sometimes be computed analytically and is useful for understanding training and generalization of NN architectures. At finite widths, the NTK is also used to better initialize NNs, compare the conditioning across models, perform architecture search, and do meta-learning. Unfortunately, the finite width NTK is notoriously expensive to compute, which severely limits its practical utility. We perform the first in-depth analysis of the compute and memory requirements for NTK computation in finite width networks. Leveraging the structure of neural networks, we further propose two novel algorithms that change the exponent of the compute and memory requirements of the finite width NTK, dramatically improving efficiency. Our algorithms can be applied in a black box fashion to any differentiable function, including those implementing neural networks. We open-source our implementations within the Neural Tangents package (arXiv:1912.02803) at https://github.com/google/neural-tangents. △ Less

Submitted 17 June, 2022; originally announced June 2022.

Comments: Published as a conference paper at ICML 2022

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2111.05803 [pdf, other]

Gradients are Not All You Need

Authors: Luke Metz, C. Daniel Freeman, Samuel S. Schoenholz, Tal Kachman

Abstract: Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics sim… ▽ More Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under study, and provide criteria for when a practitioner might expect this failure to spoil their differentiation based optimization algorithms. △ Less

Submitted 20 January, 2022; v1 submitted 10 November, 2021; originally announced November 2021.

arXiv:2110.01765 [pdf, other]

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Sha**

Authors: James Martens, Andy Ballard, Guillaume Desjardins, Grzegorz Swirszcz, Valentin Dalibard, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Abstract: Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a… ▽ More Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the "shape" of the network's initialization-time kernel function. We then develop a method called Deep Kernel Sha** (DKS), which accomplishes this using a combination of precise parameter initialization, activation function transformations, and small architectural tweaks, all of which preserve the model class. In our experiments we show that DKS enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance. And when using K-FAC as the optimizer, we achieve similar results for networks without skip connections. Our results apply for a large variety of activation functions, including those which traditionally perform very badly, such as the logistic sigmoid. In addition to DKS, we contribute a detailed analysis of skip connections, normalization layers, special activation functions like RELU and SELU, and various initialization schemes, explaining their effectiveness as alternative (and ultimately incomplete) ways of "sha**" the network's initialization-time kernel. △ Less

Submitted 4 October, 2021; originally announced October 2021.

arXiv:2105.06305 [pdf, other]

doi 10.1016/j.cpc.2021.108232

dPV: An End-to-End Differentiable Solar-Cell Simulator

Authors: Sean Mann, Eric Fadel, Samuel S. Schoenholz, Ekin D. Cubuk, Steven G. Johnson, Giuseppe Romano

Abstract: We introduce dPV, an end-to-end differentiable photovoltaic (PV) cell simulator based on the drift-diffusion model and Beer-Lambert law for optical absorption. dPV is programmed in Python using JAX, an automatic differentiation (AD) library for scientific computing. Using AD coupled with the implicit function theorem, dPV computes the power conversion efficiency (PCE) of an input PV design as well… ▽ More We introduce dPV, an end-to-end differentiable photovoltaic (PV) cell simulator based on the drift-diffusion model and Beer-Lambert law for optical absorption. dPV is programmed in Python using JAX, an automatic differentiation (AD) library for scientific computing. Using AD coupled with the implicit function theorem, dPV computes the power conversion efficiency (PCE) of an input PV design as well as the derivative of the PCE with respect to any input parameters, all within comparable time of solving the forward problem. We show an example of perovskite solar-cell optimization and multi-parameter discovery, and compare results with random search and finite differences. The simulator can be integrated with optimization algorithms and neural networks, opening up possibilities for data-efficient optimization and parameter discovery. △ Less

Submitted 9 December, 2021; v1 submitted 13 May, 2021; originally announced May 2021.

arXiv:2102.03793 [pdf, other]

Tilting the playing field: Dynamical loss functions for machine learning

Authors: Miguel Ruiz-Garcia, Ge Zhang, Samuel S. Schoenholz, Andrea J. Liu

Abstract: We show that learning can be improved by using loss functions that evolve cyclically during training to emphasize one class at a time. In underparameterized networks, such dynamical loss functions can lead to successful training for networks that fail to find a deep minima of the standard cross-entropy loss. In overparameterized networks, dynamical loss functions can lead to better generalization.… ▽ More We show that learning can be improved by using loss functions that evolve cyclically during training to emphasize one class at a time. In underparameterized networks, such dynamical loss functions can lead to successful training for networks that fail to find a deep minima of the standard cross-entropy loss. In overparameterized networks, dynamical loss functions can lead to better generalization. Improvement arises from the interplay of the changing loss landscape with the dynamics of the system as it evolves to minimize the loss. In particular, as the loss function oscillates, instabilities develop in the form of bifurcation cascades, which we study using the Hessian and Neural Tangent Kernel. Valleys in the landscape widen and deepen, and then narrow and rise as the loss landscape changes during a cycle. As the landscape narrows, the learning rate becomes too large and the network becomes unstable and bounces around the valley. This process ultimately pushes the system into deeper and wider regions of the loss landscape and is characterized by decreasing eigenvalues of the Hessian. This results in better regularized models with improved generalization performance. △ Less

Submitted 23 June, 2021; v1 submitted 7 February, 2021; originally announced February 2021.

arXiv:2010.15175 [pdf, other]

Self-assembling kinetics: Accessing a new design space via differentiable statistical-physics models

Authors: Carl P. Goodrich, Ella M. King, Samuel S. Schoenholz, Ekin D. Cubuk, Michael Brenner

Abstract: The inverse problem of designing component interactions to target emergent structure is fundamental to numerous applications in biotechnology, materials science, and statistical physics. Equally important is the inverse problem of designing emergent kinetics, but this has received considerably less attention. Using recent advances in automatic differentiation, we show how kinetic pathways can be p… ▽ More The inverse problem of designing component interactions to target emergent structure is fundamental to numerous applications in biotechnology, materials science, and statistical physics. Equally important is the inverse problem of designing emergent kinetics, but this has received considerably less attention. Using recent advances in automatic differentiation, we show how kinetic pathways can be precisely designed by directly differentiating through statistical-physics models, namely free energy calculations and molecular dynamics simulations. We consider two systems that are crucial to our understanding of structural self-assembly: bulk crystallization and small nanoclusters. In each case we are able to assemble precise dynamical features. Using gradient information, we manipulate interactions among constituent particles to tune the rate at which these systems yield specific structures of interest. Moreover, we use this approach to learn non-trivial features about the high-dimensional design space, allowing us to accurately predict when multiple kinetic features can be simultaneously and independently controlled. These results provide a concrete and generalizable foundation for studying non-structural self-assembly, including kinetic properties as well as other complex emergent properties, in a vast array of systems. △ Less

Submitted 18 November, 2020; v1 submitted 28 October, 2020; originally announced October 2020.

Comments: 5 figures

arXiv:2008.09681 [pdf, other]

Unifying framework for strong and fragile liquids via machine learning: a study of liquid silica

Authors: Ekin D. Cubuk, Andrea J. Liu, Efthimios Kaxiras, Samuel S. Schoenholz

Abstract: The fragility of a glassforming liquid characterizes how rapidly its relaxation dynamics slow down with cooling. The viscosity of strong liquids follows an Arrhenius law with a temperature-independent barrier height to rearrangements responsible for relaxation, whereas fragile liquids experience a much faster increase in their dynamics, suggesting a barrier height that increases with decreasing te… ▽ More The fragility of a glassforming liquid characterizes how rapidly its relaxation dynamics slow down with cooling. The viscosity of strong liquids follows an Arrhenius law with a temperature-independent barrier height to rearrangements responsible for relaxation, whereas fragile liquids experience a much faster increase in their dynamics, suggesting a barrier height that increases with decreasing temperature. Strong glassformers are typically network glasses, while fragile glassformers are typically molecular or hard-sphere-like. As a result of these differences at the microscopic level, strong and fragile glassformers are usually treated separately from a theoretical point of view. Silica is the archetypal strong glassformer at low temperatures, but also exhibits a mysterious strong-to-fragile crossover at higher temperatures. Here we show that softness, a structure-based machine learned parameter that has previously been applied to fragile glassformers provides a useful description of model liquid silica in the strong and fragile regimes, and through the strong-to-fragile crossover. Just as for fragile glassformers, the relationship between softness and dynamics is invariant and Arrhenius in all regimes, but the average softness changes with temperature. The strong-to-fragile crossover in silica is not due to a sudden, qualitative change in structure, but can be explained by a simple Arrhenius form with a continuously and linearly changing local structure. Our results unify the study of liquid silica under a single simple conceptual picture. △ Less

Submitted 21 August, 2020; originally announced August 2020.

Comments: 6 pages, 4 figures

arXiv:2008.07545 [pdf, other]

Whitening and second order optimization both make information in the dataset unusable during training, and can reduce or prevent generalization

Authors: Neha S. Wadia, Daniel Duckworth, Samuel S. Schoenholz, Ethan Dyer, Jascha Sohl-Dickstein

Abstract: Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second momen… ▽ More Machine learning is predicated on the concept of generalization: a model achieving low error on a sufficiently large training set should also perform well on novel samples from the same distribution. We show that both data whitening and second order optimization can harm or entirely prevent generalization. In general, model training harnesses information contained in the sample-sample second moment matrix of a dataset. For a general class of models, namely models with a fully connected first layer, we prove that the information contained in this matrix is the only information which can be used to generalize. Models trained using whitened data, or with certain second order optimization schemes, have less access to this information, resulting in reduced or nonexistent generalization ability. We experimentally verify these predictions for several architectures, and further demonstrate that generalization continues to be harmed even when theoretical requirements are relaxed. However, we also show experimentally that regularized second order optimization can provide a practical tradeoff, where training is accelerated but less information is lost, and generalization can in some circumstances even improve. △ Less

Submitted 19 July, 2021; v1 submitted 17 August, 2020; originally announced August 2020.

Comments: 13+10 pages, 10 figures; minor textual changes and some reorganization, one new figure and a new proof of main theorem added

arXiv:2007.15801 [pdf, other]

Finite Versus Infinite Neural Networks: an Empirical Study

Authors: Jaehoon Lee, Samuel S. Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, Jascha Sohl-Dickstein

Abstract: We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neu… ▽ More We perform a careful, thorough, and large scale empirical study of the correspondence between wide neural networks and kernel methods. By doing so, we resolve a variety of open questions related to the study of infinitely wide neural networks. Our experimental results include: kernel methods outperform fully-connected finite-width networks, but underperform convolutional finite width networks; neural network Gaussian process (NNGP) kernels frequently outperform neural tangent (NT) kernels; centered and ensembled finite networks have reduced posterior variance and behave more similarly to infinite networks; weight decay and the use of a large learning rate break the correspondence between finite and infinite networks; the NTK parameterization outperforms the standard parameterization for finite width networks; diagonal regularization of kernels acts similarly to early stop**; floating point precision limits kernel performance beyond a critical dataset size; regularized ZCA whitening improves accuracy; finite network performance depends non-monotonically on width in ways not captured by double descent phenomena; equivariance of CNNs is only beneficial for narrow networks far from the kernel regime. Our experiments additionally motivate an improved layer-wise scaling for weight decay which improves generalization in finite-width networks. Finally, we develop improved best practices for using NNGP and NT kernels for prediction, including a novel ensembling technique. Using these best practices we achieve state-of-the-art results on CIFAR-10 classification for kernels corresponding to each architecture class we consider. △ Less

Submitted 8 September, 2020; v1 submitted 30 July, 2020; originally announced July 2020.

Comments: 17+11 pages; v2 references added, minor improvements

arXiv:2001.07301 [pdf, other]

On the infinite width limit of neural networks with a standard parameterization

Authors: Jascha Sohl-Dickstein, Roman Novak, Samuel S. Schoenholz, Jaehoon Lee

Abstract: There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK paramet… ▽ More There are currently two parameterizations used to derive fixed kernels corresponding to infinite width neural networks, the NTK (Neural Tangent Kernel) parameterization and the naive standard parameterization. However, the extrapolation of both of these parameterizations to infinite width is problematic. The standard parameterization leads to a divergent neural tangent kernel while the NTK parameterization fails to capture crucial aspects of finite width networks such as: the dependence of training dynamics on relative layer widths, the relative training dynamics of weights and biases, and overall learning rate scale. Here we propose an improved extrapolation of the standard parameterization that preserves all of these properties as width is taken to infinity and yields a well-defined neural tangent kernel. We show experimentally that the resulting kernels typically achieve similar accuracy to those resulting from an NTK parameterization, but with better correspondence to the parameterization of typical finite width networks. Additionally, with careful tuning of width parameters, the improved standard parameterization kernels can outperform those stemming from an NTK parameterization. We release code implementing this improved standard parameterization as part of the Neural Tangents library at https://github.com/google/neural-tangents. △ Less

Submitted 18 April, 2020; v1 submitted 20 January, 2020; originally announced January 2020.

arXiv:1912.13053 [pdf, other]

Disentangling Trainability and Generalization in Deep Neural Networks

Authors: Lechao Xiao, Jeffrey Pennington, Samuel S. Schoenholz

Abstract: A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradi… ▽ More A longstanding goal in the theory of deep learning is to characterize the conditions under which a given neural network architecture will be trainable, and if so, how well it might generalize to unseen data. In this work, we provide such a characterization in the limit of very wide and very deep networks, for which the analysis simplifies considerably. For wide networks, the trajectory under gradient descent is governed by the Neural Tangent Kernel (NTK), and for deep networks the NTK itself maintains only weak data dependence. By analyzing the spectrum of the NTK, we formulate necessary conditions for trainability and generalization across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs). We identify large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize. We find that CNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance. These theoretical results are corroborated experimentally on CIFAR10 for a variety of network architectures and we include a colab notebook that reproduces the essential results of the paper. △ Less

Submitted 13 July, 2020; v1 submitted 30 December, 2019; originally announced December 2019.

Comments: 22 pages, 3 figures, ICML 2020. Associated Colab notebook at https://colab.research.google.com/github/google/neural-tangents/blob/master/notebooks/Disentangling_Trainability_and_Generalization.ipynb

arXiv:1912.04232 [pdf, other]

JAX, M.D.: A Framework for Differentiable Physics

Authors: Samuel S. Schoenholz, Ekin D. Cubuk

Abstract: We introduce JAX MD, a software package for performing differentiable physics simulations with a focus on molecular dynamics. JAX MD includes a number of physics simulation environments, as well as interaction potentials and neural networks that can be integrated into these environments without writing any additional code. Since the simulations themselves are differentiable functions, entire traje… ▽ More We introduce JAX MD, a software package for performing differentiable physics simulations with a focus on molecular dynamics. JAX MD includes a number of physics simulation environments, as well as interaction potentials and neural networks that can be integrated into these environments without writing any additional code. Since the simulations themselves are differentiable functions, entire trajectories can be differentiated to perform meta-optimization. These features are built on primitive operations, such as spatial partitioning, that allow simulations to scale to hundreds-of-thousands of particles on a single GPU. These primitives are flexible enough that they can be used to scale up workloads outside of molecular dynamics. We present several examples that highlight the features of JAX MD including: integration of graph neural networks into traditional simulations, meta-optimization through minimization of particle packings, and a multi-agent flocking simulation. JAX MD is available at www.github.com/google/jax-md. △ Less

Submitted 3 December, 2020; v1 submitted 9 December, 2019; originally announced December 2019.

Journal ref: Advances in Neural Information Processing Systems 33 (2020)

arXiv:1912.02803 [pdf, other]

Neural Tangents: Fast and Easy Infinite Neural Networks in Python

Authors: Roman Novak, Lechao Xiao, Jiri Hron, Jaehoon Lee, Alexander A. Alemi, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Abstract: Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using… ▽ More Neural Tangents is a library designed to enable research into infinite-width neural networks. It provides a high-level API for specifying complex and hierarchical neural network architectures. These networks can then be trained and evaluated either at finite-width as usual or in their infinite-width limit. Infinite-width networks can be trained analytically using exact Bayesian inference or using gradient descent via the Neural Tangent Kernel. Additionally, Neural Tangents provides tools to study gradient descent training dynamics of wide but finite networks in either function space or weight space. The entire library runs out-of-the-box on CPU, GPU, or TPU. All computations can be automatically distributed over multiple accelerators with near-linear scaling in the number of devices. Neural Tangents is available at www.github.com/google/neural-tangents. We also provide an accompanying interactive Colab notebook. △ Less

Submitted 5 December, 2019; originally announced December 2019.

arXiv:1902.08129 [pdf, other]

A Mean Field Theory of Batch Normalization

Authors: Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, Samuel S. Schoenholz

Abstract: We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initi… ▽ More We develop a mean field theory for batch normalization in fully-connected feedforward neural networks. In so doing, we provide a precise characterization of signal propagation and gradient backpropagation in wide batch-normalized networks at initialization. Our theory shows that gradient signals grow exponentially in depth and that these exploding gradients cannot be eliminated by tuning the initial weight variances or by adjusting the nonlinear activation function. Indeed, batch normalization itself is the cause of gradient explosion. As a result, vanilla batch-normalized networks without skip connections are not trainable at large depths for common initialization schemes, a prediction that we verify with a variety of empirical simulations. While gradient explosion cannot be eliminated, it can be reduced by tuning the network close to the linear regime, which improves the trainability of deep batch-normalized networks without residual connections. Finally, we investigate the learning dynamics of batch-normalized networks and observe that after a single step of optimization the networks achieve a relatively stable equilibrium in which gradients have dramatically smaller dynamic range. Our theory leverages Laplace, Fourier, and Gegenbauer transforms and we derive new identities that may be of independent interest. △ Less

Submitted 5 March, 2019; v1 submitted 21 February, 2019; originally announced February 2019.

Comments: To appear in ICLR 2019

arXiv:1902.06720 [pdf, other]

doi 10.1088/1742-5468/abc62b

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Authors: Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained… ▽ More A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions. △ Less

Submitted 8 December, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

Comments: 12+16 pages; open-source code available at https://github.com/google/neural-tangents; accepted to NeurIPS 2019

arXiv:1901.08987 [pdf, other]

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

Authors: Dar Gilboa, Bo Chang, Minmin Chen, Greg Yang, Samuel S. Schoenholz, Ed H. Chi, Jeffrey Pennington

Abstract: Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and… ▽ More Training recurrent neural networks (RNNs) on long sequence tasks is plagued with difficulties arising from the exponential explosion or vanishing of signals as they propagate forward or backward through the network. Many techniques have been proposed to ameliorate these issues, including various algorithmic and architectural modifications. Two of the most successful RNN architectures, the LSTM and the GRU, do exhibit modest improvements over vanilla RNN cells, but they still suffer from instabilities when trained on very long sequences. In this work, we develop a mean field theory of signal propagation in LSTMs and GRUs that enables us to calculate the time scales for signal propagation as well as the spectral properties of the state-to-state Jacobians. By optimizing these quantities in terms of the initialization hyperparameters, we derive a novel initialization scheme that eliminates or reduces training instabilities. We demonstrate the efficacy of our initialization scheme on multiple sequence tasks, on which it enables successful training while a standard initialization either fails completely or is orders of magnitude slower. We also observe a beneficial effect on generalization performance using this new initialization. △ Less

Submitted 23 May, 2019; v1 submitted 25 January, 2019; originally announced January 2019.

arXiv:1808.06576 [pdf, other]

Peptide-Spectra Matching from Weak Supervision

Authors: Samuel S. Schoenholz, Sean Hackett, Laura Deming, Eugene Melamud, Navdeep Jaitly, Fiona McAllister, Jonathon O'Brien, George Dahl, Bryson Bennett, Andrew M. Dai, Daphne Koller

Abstract: As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets map** inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the pro… ▽ More As in many other scientific domains, we face a fundamental problem when using machine learning to identify proteins from mass spectrometry data: large ground truth datasets map** inputs to correct outputs are extremely difficult to obtain. Instead, we have access to imperfect hand-coded models crafted by domain experts. In this paper, we apply deep neural networks to an important step of the protein identification problem, the pairing of mass spectra with short sequences of amino acids called peptides. We train our model to differentiate between top scoring results from a state-of-the art classical system and hard-negative second and third place results. Our resulting model is much better at identifying peptides with spectra than the model used to generate its training data. In particular, we achieve a 43% improvement over standard matching methods and a 10% improvement over a combination of the matching method and an industry standard cross-spectra reranking tool. Importantly, in a more difficult experimental regime that reflects current challenges facing biologists, our advantage over the previous state-of-the-art grows to 15% even after reranking. We believe this approach will generalize to other challenging scientific problems. △ Less

Submitted 22 August, 2018; v1 submitted 20 August, 2018; originally announced August 2018.

arXiv:1806.05394 [pdf, other]

Dynamical Isometry and a Mean Field Theory of RNNs: Gating Enables Signal Propagation in Recurrent Neural Networks

Authors: Minmin Chen, Jeffrey Pennington, Samuel S. Schoenholz

Abstract: Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and r… ▽ More Recurrent neural networks have gained widespread use in modeling sequence data across various domains. While many successful recurrent architectures employ a notion of gating, the exact mechanism that enables such remarkable performance is not well understood. We develop a theory for signal propagation in recurrent networks after random initialization using a combination of mean field theory and random matrix theory. To simplify our discussion, we introduce a new RNN cell with a simple gating mechanism that we call the minimalRNN and compare it with vanilla RNNs. Our theory allows us to define a maximum timescale over which RNNs can remember an input. We show that this theory predicts trainability for both recurrent architectures. We show that gated recurrent networks feature a much broader, more robust, trainable region than vanilla RNNs, which corroborates recent experimental findings. Finally, we develop a closed-form critical initialization scheme that achieves dynamical isometry in both vanilla RNNs and minimalRNNs. We show that this results in significantly improvement in training dynamics. Finally, we demonstrate that the minimalRNN achieves comparable performance to its more complex counterparts, such as LSTMs or GRUs, on a language modeling task. △ Less

Submitted 15 August, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

Comments: ICML 2018 Conference Proceedings

arXiv:1806.05393 [pdf, other]

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Authors: Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

Abstract: In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enabl… ▽ More In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by develo** a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures. △ Less

Submitted 10 July, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

Comments: ICML 2018 Conference Proceedings

arXiv:1803.01416 [pdf, other]

doi 10.1073/pnas.1807176115

Machine learning determination of atomic dynamics at grain boundaries

Authors: Tristan A. Sharp, Spencer L. Thomas, Ekin D. Cubuk, Samuel S. Schoenholz, David J. Srolovitz, Andrea J. Liu

Abstract: In polycrystalline materials, grain boundaries are sites of enhanced atomic motion, but the complexity of the atomic structures within a grain boundary network makes it difficult to link the structure and atomic dynamics. Here we use a machine learning technique to establish a connection between local structure and dynamics of these materials. Following previous work on bulk glassy materials, we d… ▽ More In polycrystalline materials, grain boundaries are sites of enhanced atomic motion, but the complexity of the atomic structures within a grain boundary network makes it difficult to link the structure and atomic dynamics. Here we use a machine learning technique to establish a connection between local structure and dynamics of these materials. Following previous work on bulk glassy materials, we define a purely structural quantity, softness, that captures the propensity of an atom to rearrange. This approach correctly identifies crystalline regions, stacking faults, and twin boundaries as having low likelihood of atomic rearrangements, while finding a large variability within high-energy grain boundaries. As has been found in glasses [9,19,26], the probability that atoms of a given softness will rearrange is nearly Arrhenius. This indicates a well-defined energy barrier as well as a well-defined prefactor for the Arrhenius form for atoms of a given softness. The decrease in the prefactor for low-softness atoms indicates that variations in entropy exhibit a dominant influence on the atomic dynamics in grain boundaries. △ Less

Submitted 11 September, 2018; v1 submitted 4 March, 2018; originally announced March 2018.

arXiv:1802.09979 [pdf, other]

The Emergence of Spectral Universality in Deep Networks

Authors: Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

Abstract: Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools fro… ▽ More Recent work has shown that tight concentration of the entire spectrum of singular values of a deep network's input-output Jacobian around one at initialization can speed up learning by orders of magnitude. Therefore, to guide important design choices, it is important to build a full theoretical understanding of the spectra of Jacobians at initialization. To this end, we leverage powerful tools from free probability theory to provide a detailed analytic understanding of how a deep network's Jacobian spectrum depends on various hyperparameters including the nonlinearity, the weight and bias distributions, and the depth. For a variety of nonlinearities, our work reveals the emergence of new universal limiting spectral distributions that remain concentrated around one even as the depth goes to infinity. △ Less

Submitted 27 February, 2018; originally announced February 2018.

Comments: 17 pages, 4 figures. Appearing at the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018

arXiv:1801.02774 [pdf, other]

Adversarial Spheres

Authors: Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S. Schoenholz, Maithra Raghu, Martin Wattenberg, Ian Goodfellow

Abstract: State of the art computer vision models have been shown to be vulnerable to small adversarial perturbations of the input. In other words, most images in the data distribution are both correctly classified by the model and are very close to a visually similar misclassified image. Despite substantial research interest, the cause of the phenomenon is still poorly understood and remains unsolved. We h… ▽ More State of the art computer vision models have been shown to be vulnerable to small adversarial perturbations of the input. In other words, most images in the data distribution are both correctly classified by the model and are very close to a visually similar misclassified image. Despite substantial research interest, the cause of the phenomenon is still poorly understood and remains unsolved. We hypothesize that this counter intuitive behavior is a naturally occurring result of the high dimensional geometry of the data manifold. As a first step towards exploring this hypothesis, we study a simple synthetic dataset of classifying between two concentric high dimensional spheres. For this dataset we show a fundamental tradeoff between the amount of test error and the average distance to nearest error. In particular, we prove that any model which misclassifies a small constant fraction of a sphere will be vulnerable to adversarial perturbations of size $O(1/\sqrt{d})$. Surprisingly, when we train several different architectures on this dataset, all of their error sets naturally approach this theoretical bound. As a result of the theory, the vulnerability of neural networks to small adversarial perturbations is a logical consequence of the amount of test error observed. We hope that our theoretical analysis of this very simple case will point the way forward to explore how the geometry of complex real-world data sets leads to adversarial examples. △ Less

Submitted 10 September, 2018; v1 submitted 8 January, 2018; originally announced January 2018.

MSC Class: 68T45 ACM Class: I.2.6

arXiv:1712.08969 [pdf, other]

Mean Field Residual Networks: On the Edge of Chaos

Authors: Greg Yang, Samuel S. Schoenholz

Abstract: We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backw… ▽ More We study randomly initialized residual networks using mean field theory and the theory of difference equations. Classical feedforward neural networks, such as those with tanh activations, exhibit exponential behavior on the average when propagating inputs forward or gradients backward. The exponential forward dynamics causes rapid collapsing of the input space geometry, while the exponential backward dynamics causes drastic vanishing or exploding gradients. We show, in contrast, that by adding skip connections, the network will, depending on the nonlinearity, adopt subexponential forward and backward dynamics, and in many cases in fact polynomial. The exponents of these polynomials are obtained through analytic methods and proved and verified empirically to be correct. In terms of the "edge of chaos" hypothesis, these subexponential and polynomial laws allow residual networks to "hover over the boundary between stability and chaos," thus preserving the geometry of the input space and the gradient information flow. In our experiments, for each activation function we study here, we initialize residual networks with different hyperparameters and train them on MNIST. Remarkably, our initialization time theory can accurately predict test time performance of these networks, by tracking either the expected amount of gradient explosion or the expected squared distance between the images of two input vectors. Importantly, we show, theoretically as well as empirically, that common initializations such as the Xavier or the He schemes are not optimal for residual networks, because the optimal initialization variances depend on the depth. Finally, we have made mathematical contributions by deriving several new identities for the kernels of powers of ReLU functions by relating them to the zeroth Bessel function of the second kind. △ Less

Submitted 24 December, 2017; originally announced December 2017.

Comments: NIPS 2017

arXiv:1711.04735 [pdf, other]

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice

Authors: Jeffrey Pennington, Samuel S. Schoenholz, Surya Ganguli

Abstract: It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property… ▽ More It is well known that the initialization of weights in deep neural networks can have a dramatic impact on learning speed. For example, ensuring the mean squared singular value of a network's input-output Jacobian is $O(1)$ is essential for avoiding the exponential vanishing or explosion of gradients. The stronger condition that all singular values of the Jacobian concentrate near $1$ is a property known as dynamical isometry. For deep linear networks, dynamical isometry can be achieved through orthogonal weight initialization and has been shown to dramatically speed up learning; however, it has remained unclear how to extend these results to the nonlinear setting. We address this question by employing powerful tools from free probability theory to compute analytically the entire singular value distribution of a deep network's input-output Jacobian. We explore the dependence of the singular value distribution on the depth of the network, the weight initialization, and the choice of nonlinearity. Intriguingly, we find that ReLU networks are incapable of dynamical isometry. On the other hand, sigmoidal networks can achieve isometry, but only with orthogonal weight initialization. Moreover, we demonstrate empirically that deep nonlinear networks achieving dynamical isometry learn orders of magnitude faster than networks that do not. Indeed, we show that properly-initialized deep sigmoidal networks consistently outperform deep ReLU networks. Overall, our analysis reveals that controlling the entire distribution of Jacobian singular values is an important design consideration in deep learning. △ Less

Submitted 13 November, 2017; originally announced November 2017.

Comments: 13 pages, 6 figures. Appearing at the 31st Conference on Neural Information Processing Systems (NIPS 2017)

arXiv:1711.02846 [pdf, other]

Intriguing Properties of Adversarial Examples

Authors: Ekin D. Cubuk, Barret Zoph, Samuel S. Schoenholz, Quoc V. Le

Abstract: It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we argue that the origin of adversarial examples is primarily due to an inhe… ▽ More It is becoming increasingly clear that many machine learning classifiers are vulnerable to adversarial examples. In attempting to explain the origin of adversarial examples, previous studies have typically focused on the fact that neural networks operate on high dimensional data, they overfit, or they are too linear. Here we argue that the origin of adversarial examples is primarily due to an inherent uncertainty that neural networks have about their predictions. We show that the functional form of this uncertainty is independent of architecture, dataset, and training protocol; and depends only on the statistics of the logit differences of the network, which do not change significantly during training. This leads to adversarial error having a universal scaling, as a power-law, with respect to the size of the adversarial perturbation. We show that this universality holds for a broad range of datasets (MNIST, CIFAR10, ImageNet, and random data), models (including state-of-the-art deep networks, linear models, adversarially trained networks, and networks trained on randomly shuffled labels), and attacks (FGSM, step l.l., PGD). Motivated by these results, we study the effects of reducing prediction entropy on adversarial robustness. Finally, we study the effect of network architectures on adversarial sensitivity. To do this, we use neural architecture search with reinforcement learning to find adversarially robust architectures on CIFAR10. Our resulting architecture is more robust to white \emph{and} black box attacks compared to previous attempts. △ Less

Submitted 8 November, 2017; originally announced November 2017.

Comments: 17 pages

arXiv:1711.00165 [pdf, other]

Deep Neural Networks as Gaussian Processes

Authors: Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer… ▽ More It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network. In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs. We then use the resulting GPs to perform Bayesian inference for wide deep neural networks on MNIST and CIFAR-10. We observe that trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite-width networks. Finally we connect the performance of these GPs to the recent theory of signal propagation in random neural networks. △ Less

Submitted 2 March, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

Comments: Published version in ICLR 2018. 10 pages + appendix

arXiv:1710.06570 [pdf, other]

A Correspondence Between Random Neural Networks and Statistical Field Theory

Authors: Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto la… ▽ More A number of recent papers have provided evidence that practical design questions about neural networks may be tackled theoretically by studying the behavior of random networks. However, until now the tools available for analyzing random neural networks have been relatively ad-hoc. In this work, we show that the distribution of pre-activations in random neural networks can be exactly mapped onto lattice models in statistical physics. We argue that several previous investigations of stochastic networks actually studied a particular factorial approximation to the full lattice model. For random linear networks and random rectified linear networks we show that the corresponding lattice models in the wide network limit may be systematically approximated by a Gaussian distribution with covariance between the layers of the network. In each case, the approximate distribution can be diagonalized by Fourier transformation. We show that this approximation accurately describes the results of numerical simulations of wide random neural networks. Finally, we demonstrate that in each case the large scale behavior of the random networks can be approximated by an effective field theory. △ Less

Submitted 17 October, 2017; originally announced October 2017.

arXiv:1709.08015 [pdf, other]

doi 10.1088/1742-6596/1036/1/012021

Combining Machine Learning and Physics to Understand Glassy Systems

Authors: Samuel S. Schoenholz

Abstract: Our understanding of supercooled liquids and glasses has lagged significantly behind that of simple liquids and crystalline solids. This is in part due to the many possibly relevant degrees of freedom that are present due to the disorder inherent to these systems and in part to non-equilibrium effects which are difficult to treat in the standard context of statistical physics. Together these issue… ▽ More Our understanding of supercooled liquids and glasses has lagged significantly behind that of simple liquids and crystalline solids. This is in part due to the many possibly relevant degrees of freedom that are present due to the disorder inherent to these systems and in part to non-equilibrium effects which are difficult to treat in the standard context of statistical physics. Together these issues have resulted in a field whose theories are under-constrained by experiment and where fundamental questions are still unresolved. Mean field results have been successful in infinite dimensions but it is unclear to what extent they apply to realistic systems and assume uniform local structure. At odds with this are theories premised on the existence of structural defects. However, until recently it has been impossible to find structural signatures that are predictive of dynamics. Here we summarize and recast the results from several recent papers offering a data driven approach to building a phenomenological theory of disordered materials by combining machine learning with physical intuition. △ Less

Submitted 23 September, 2017; originally announced September 2017.

arXiv:1704.01212 [pdf, other]

Neural Message Passing for Quantum Chemistry

Authors: Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, George E. Dahl

Abstract: Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At… ▽ More Supervised learning on molecules has incredible potential to be useful in chemistry, drug discovery, and materials science. Luckily, several promising and closely related neural network models invariant to molecular symmetries have already been described in the literature. These models learn a message passing algorithm and aggregation procedure to compute a function of their entire input graph. At this point, the next step is to find a particularly effective variant of this general approach and apply it to chemical prediction benchmarks until we either solve them or reach the limits of the approach. In this paper, we reformulate existing models into a single common framework we call Message Passing Neural Networks (MPNNs) and explore additional novel variations within this framework. Using MPNNs we demonstrate state of the art results on an important molecular property prediction benchmark; these results are strong enough that we believe future work should focus on datasets with larger molecules or more accurate ground truth labels. △ Less

Submitted 12 June, 2017; v1 submitted 4 April, 2017; originally announced April 2017.

Comments: 14 pages

ACM Class: I.2.6

arXiv:1702.05532 [pdf, other]

doi 10.1021/acs.jctc.7b00577

Machine learning prediction errors better than DFT accuracy

Authors: Felix A. Faber, Luke Hutchison, Bing Huang, Justin Gilmer, Samuel S. Schoenholz, George E. Dahl, Oriol Vinyals, Steven Kearnes, Patrick F. Riley, O. Anatole von Lilienfeld

Abstract: We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k… ▽ More We investigate the impact of choosing regressors and molecular representations for the construction of fast machine learning (ML) models of thirteen electronic ground-state properties of organic molecules. The performance of each regressor/representation/property combination is assessed using learning curves which report out-of-sample errors as a function of training set size with up to $\sim$117k distinct molecules. Molecular structures and properties at hybrid density functional theory (DFT) level of theory used for training and testing come from the QM9 database [Ramakrishnan et al, {\em Scientific Data} {\bf 1} 140022 (2014)] and include dipole moment, polarizability, HOMO/LUMO energies and gap, electronic spatial extent, zero point vibrational energy, enthalpies and free energies of atomization, heat capacity and the highest fundamental vibrational frequency. Various representations from the literature have been studied (Coulomb matrix, bag of bonds, BAML and ECFP4, molecular graphs (MG)), as well as newly developed distribution based variants including histograms of distances (HD), and angles (HDA/MARAD), and dihedrals (HDAD). Regressors include linear models (Bayesian ridge regression (BR) and linear regression with elastic net regularization (EN)), random forest (RF), kernel ridge regression (KRR) and two types of neural net works, graph convolutions (GC) and gated graph networks (GG). We present numerical evidence that ML model predictions deviate from DFT less than DFT deviates from experiment for all properties. Furthermore, our out-of-sample prediction errors with respect to hybrid DFT reference are on par with, or close to, chemical accuracy. Our findings suggest that ML models could be more accurate than hybrid DFT if explicitly electron correlated quantum (or experimental) data was available. △ Less

Submitted 4 June, 2017; v1 submitted 17 February, 2017; originally announced February 2017.

arXiv:1611.01232 [pdf, other]

Deep Information Propagation

Authors: Samuel S. Schoenholz, Justin Gilmer, Surya Ganguli, Jascha Sohl-Dickstein

Abstract: We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth sca… ▽ More We study the behavior of untrained neural networks whose weights and biases are randomly distributed using mean field theory. We show the existence of depth scales that naturally limit the maximum depth of signal propagation through these random networks. Our main practical result is to show that random networks may be trained precisely when information can travel through them. Thus, the depth scales that we identify provide bounds on how deep a network may be trained for a specific choice of hyperparameters. As a corollary to this, we argue that in networks at the edge of chaos, one of these depth scales diverges. Thus arbitrarily deep networks may be trained only sufficiently close to criticality. We show that the presence of dropout destroys the order-to-chaos critical point and therefore strongly limits the maximum trainable depth for random networks. Finally, we develop a mean field theory for backpropagation and we show that the ordered and chaotic phases correspond to regions of vanishing and exploding gradient respectively. △ Less

Submitted 4 April, 2017; v1 submitted 3 November, 2016; originally announced November 2016.

arXiv:1610.03401 [pdf, other]

doi 10.1073/pnas.1703927114

Disconnecting structure and dynamics in glassy thin films

Authors: Daniel M. Sussman, Samuel S. Schoenholz, Ekin D. Cubuk, Andrea J. Liu

Abstract: Nanometrically thin glassy films depart strikingly from the behavior of their bulk counterparts. We investigate whether the dynamical differences between bulk and thin film glasses can be understood by differences in local microscopic structure. We employ machine-learning methods that have previously identified strong correlations between local structure and particle rearrangement dynamics in bulk… ▽ More Nanometrically thin glassy films depart strikingly from the behavior of their bulk counterparts. We investigate whether the dynamical differences between bulk and thin film glasses can be understood by differences in local microscopic structure. We employ machine-learning methods that have previously identified strong correlations between local structure and particle rearrangement dynamics in bulk systems. We show that these methods completely fail to detect key aspects of thin-film glassy dynamics. Furthermore, we show that no combination of local structural features drawn from a very general set of two- and multi-point functions is able to distinguish between particles at the center of film and those in intermediate layers where the dynamics are strongly perturbed. △ Less

Submitted 11 October, 2016; originally announced October 2016.

Comments: 8 pages, 7 figures

arXiv:1607.06969 [pdf, other]

doi 10.1073/pnas.1610204114

The Relationship Between Local Structure and Relaxation in Out-of-Equilibrium Glassy Systems

Authors: Samuel S. Schoenholz, Ekin D. Cubuk, Efthimios Kaxiras, Andrea J. Liu

Abstract: The dynamical glass transition is typically taken to be the temperature at which a glassy liquid is no longer able to equilibrate on experimental timescales. Consequently, the physical properties of these systems just above or below the dynamical glass transition, such as viscosity, can change by many orders of magnitude over long periods of time following external perturbation. During this progre… ▽ More The dynamical glass transition is typically taken to be the temperature at which a glassy liquid is no longer able to equilibrate on experimental timescales. Consequently, the physical properties of these systems just above or below the dynamical glass transition, such as viscosity, can change by many orders of magnitude over long periods of time following external perturbation. During this progress towards equilibrium, glassy systems exhibit a history dependence that has complicated their study. In previous work, we bridged the gap between structure and dynamics in glassy liquids above their dynamical glass transition temperatures by introducing a scalar field called "softness", a quantity obtained using machine learning methods. Softness is designed to capture the hidden patterns in relative particle positions that correlate strongly with dynamical rearrangements of particle positions. Here we show that the out-of-equilibrium behavior of a model glassforming system can be understood in terms of softness. To do this we first demonstrate that the evolution of behavior following a temperature quench is a primarily structural phenomenon: the structure changes considerably, but the relationship between structure and dynamics remains invariant. We then show that the history-dependent relaxation time can be robustly computed from structure as quantified by softness. Together, these results motivate the use of softness to characterize the history dependence of glasses. △ Less

Submitted 23 July, 2016; originally announced July 2016.

arXiv:1510.02129 [pdf, other]

doi 10.1007/JHEP06(2016)145

Nonlinear Sigma Models with Compact Hyperbolic Target Spaces

Authors: Steven Gubser, Zain H. Saleem, Samuel S. Schoenholz, Bogdan Stoica, James Stokes

Abstract: We explore the phase structure of nonlinear sigma models with target spaces corresponding to compact quotients of hyperbolic space, focusing on the case of a hyperbolic genus-2 Riemann surface. The continuum theory of these models can be approximated by a lattice spin system which we simulate using Monte Carlo methods. The target space possesses interesting geometric and topological properties whi… ▽ More We explore the phase structure of nonlinear sigma models with target spaces corresponding to compact quotients of hyperbolic space, focusing on the case of a hyperbolic genus-2 Riemann surface. The continuum theory of these models can be approximated by a lattice spin system which we simulate using Monte Carlo methods. The target space possesses interesting geometric and topological properties which are reflected in novel features of the sigma model. In particular, we observe a topological phase transition at a critical temperature, above which vortices proliferate, reminiscent of the Kosterlitz-Thouless phase transition in the $O(2)$ model. Unlike in the $O(2)$ case, there are many different types of vortices, suggesting a possible analogy to the Hagedorn treatment of statistical mechanics of a proliferating number of hadron species. Below the critical temperature the spins cluster around six special points in the target space known as Weierstrass points. The diversity of compact hyperbolic manifolds suggests that our model is only the simplest example of a broad class of statistical mechanical models whose main features can be understood essentially in geometric terms. △ Less

Submitted 7 October, 2015; originally announced October 2015.

Comments: 14 pages, 6 Figures

Report number: CALT-TH 2015-019, PUPT-2487

arXiv:1506.07772 [pdf, other]

A structural approach to relaxation in glassy liquids

Authors: Samuel S. Schoenholz, Ekin D. Cubuk, Daniel M. Sussman, Efthimios Kaxiras, Andrea J Liu

Abstract: When a liquid freezes, a change in the local atomic structure marks the transition to the crystal. When a liquid is cooled to form a glass, however, no noticeable structural change marks the glass transition. Indeed, characteristic features of glassy dynamics that appear below an onset temperature, T_0, are qualitatively captured by mean field theory, which assumes uniform local structure at all t… ▽ More When a liquid freezes, a change in the local atomic structure marks the transition to the crystal. When a liquid is cooled to form a glass, however, no noticeable structural change marks the glass transition. Indeed, characteristic features of glassy dynamics that appear below an onset temperature, T_0, are qualitatively captured by mean field theory, which assumes uniform local structure at all temperatures. Even studies of more realistic systems have found only weak correlations between structure and dynamics. This raises the question: is structure important to glassy dynamics in three dimensions? Here, we answer this question affirmatively by using machine learning methods to identify a new field, that we call softness, which characterizes local structure and is strongly correlated with rearrangement dynamics. We find that the onset of glassy dynamics at T_0 is marked by the onset of correlations between softness (i.e. structure) and dynamics. Moreover, we use softness to construct a simple model of slow glassy relaxation that is in excellent agreement with our simulation results, showing that a theory of the evolution of softness in time would constitute a theory of glassy dynamics. △ Less

Submitted 22 November, 2015; v1 submitted 25 June, 2015; originally announced June 2015.

arXiv:1503.07364 [pdf, other]

doi 10.1103/PhysRevE.92.022307

Strain fluctuations and elastic moduli in disordered solids

Authors: Daniel M. Sussman, Samuel S. Schoenholz, Ye Xu, Tim Still, A. G. Yodh, Andrea J. Liu

Abstract: Recently there has been a surge in interest in using video-microscopy techniques to infer the local mechanical properties of disordered solids. One common approach is to minimize the difference between particle vibrational displacements in a local coarse-graining volume and the displacements that would result from a best-fit affine deformation. Effective moduli are then be inferred under the assum… ▽ More Recently there has been a surge in interest in using video-microscopy techniques to infer the local mechanical properties of disordered solids. One common approach is to minimize the difference between particle vibrational displacements in a local coarse-graining volume and the displacements that would result from a best-fit affine deformation. Effective moduli are then be inferred under the assumption that the components of this best-fit affine deformation tensor have a Boltzmann distribution. In this paper, we combine theoretical arguments with experimental and simulation data to demonstrate that the above does not reveal information about the true elastic moduli of jammed packings and colloidal glasses. △ Less

Submitted 24 August, 2015; v1 submitted 25 March, 2015; originally announced March 2015.

Comments: 12 pages, 8 figures

Journal ref: Phys. Rev. E 92, 022307 (2015)

arXiv:1409.6820 [pdf, other]

doi 10.1103/PhysRevLett.114.108001

Identifying structural flow defects in disordered solids using machine learning methods

Authors: Ekin D. Cubuk, Samuel S. Schoenholz, Jennifer M. Rieser, Brad D. Malone, Joerg Rottler, Douglas J. Durian, Efthimios Kaxiras, Andrea J. Liu

Abstract: We use machine learning methods on local structure to identify flow defects - or regions susceptible to rearrangement - in jammed and glassy systems. We apply this method successfully to two disparate systems: a two dimensional experimental realization of a granular pillar under compression, and a Lennard-Jones glass in both two and three dimensions above and below its glass transition temperature… ▽ More We use machine learning methods on local structure to identify flow defects - or regions susceptible to rearrangement - in jammed and glassy systems. We apply this method successfully to two disparate systems: a two dimensional experimental realization of a granular pillar under compression, and a Lennard-Jones glass in both two and three dimensions above and below its glass transition temperature. We also identify characteristics of flow defects that differentiate them from the rest of the sample. Our results show it is possible to discern subtle structural features responsible for heterogeneous dynamics observed across a broad range of disordered materials. △ Less

Submitted 24 September, 2014; originally announced September 2014.

Comments: 4 pages, 4 figures

Journal ref: Phys. Rev. Lett. 114, 108001 (2015)

arXiv:1404.1403 [pdf, other]

Understanding plastic deformation in thermal glasses from single-soft-spot dynamics

Authors: Samuel S. Schoenholz, Andrea J. Liu, Robert A. Riggleman, Joerg Rottler

Abstract: By considering the low-frequency vibrational modes of amorphous solids, Manning and Liu [Phys. Rev. Lett. 107, 108302 (2011)] showed that a population of "soft spots" can be identified that are intimately related to plasticity at zero temperature under quasistatic shear. In this work we track individual soft spots with time in a two-dimensional sheared thermal Lennard Jones glass at temperatures r… ▽ More By considering the low-frequency vibrational modes of amorphous solids, Manning and Liu [Phys. Rev. Lett. 107, 108302 (2011)] showed that a population of "soft spots" can be identified that are intimately related to plasticity at zero temperature under quasistatic shear. In this work we track individual soft spots with time in a two-dimensional sheared thermal Lennard Jones glass at temperatures ranging from deep in the glassy regime to above the glass transition temperature. We show that the lifetimes of individual soft spots are correlated with the timescale for structural relaxation. We additionally calculate the number of rearrangements required to destroy soft spots, and show that most soft spots can survive many rearrangements. Finally, we show that soft spots are robust predictors of rearrangements at temperatures well into the super-cooled regime. Altogether, these results pave the way for mesoscopic theories of plasticity of amorphous solids based on dynamical behavior of individual soft spots. △ Less

Submitted 4 April, 2014; originally announced April 2014.

Comments: 9 pages, 6 figures

arXiv:1403.0922 [pdf, other]

doi 10.1103/PhysRevE.89.042304

Predicting plasticity with soft vibrational modes: from dislocations to glasses

Authors: Joerg Rottler, Samuel S. Schoenholz, Andrea J. Liu

Abstract: We show that quasi localized low-frequency modes in the vibrational spectrum can be used to construct soft spots, or regions vulnerable to rearrangement, which serve as a universal tool for the identification of flow defects in solids. We show that soft spots not only encode spatial information, via their location, but also directional information, via directors for particles within each soft spot… ▽ More We show that quasi localized low-frequency modes in the vibrational spectrum can be used to construct soft spots, or regions vulnerable to rearrangement, which serve as a universal tool for the identification of flow defects in solids. We show that soft spots not only encode spatial information, via their location, but also directional information, via directors for particles within each soft spot. Single crystals with isolated dislocations exhibit low-frequency phonon modes that localize at the core, and their polarization pattern predicts the motion of atoms during elementary dislocation glide in exquisite detail. Even in polycrystals and disordered solids, we find that the directors associated with particles in soft spots are highly correlated with the direction of particle displacements in rearrangements. △ Less

Submitted 4 March, 2014; originally announced March 2014.

arXiv:1301.6982 [pdf, other]

doi 10.1039/C3SM51096D

Stability of jammed packings II: the transverse length scale

Authors: Samuel S. Schoenholz, Carl P. Goodrich, Oleg Kogan, Andrea J. Liu, Sidney R. Nagel

Abstract: As a function of packing fraction at zero temperature and applied stress, an amorphous packing of spheres exhibits a jamming transition where the system is sensitive to boundary conditions even in the thermodynamic limit. Upon further compression, the system should become insensitive to boundary conditions provided it is sufficiently large. Here we explore the linear response to a large class of b… ▽ More As a function of packing fraction at zero temperature and applied stress, an amorphous packing of spheres exhibits a jamming transition where the system is sensitive to boundary conditions even in the thermodynamic limit. Upon further compression, the system should become insensitive to boundary conditions provided it is sufficiently large. Here we explore the linear response to a large class of boundary perturbations in 2 and 3 dimensions. We consider each finite packing with periodic-boundary conditions as the basis of an infinite square or cubic lattice and study properties of vibrational modes at arbitrary wave vector. We find that the stability of such modes be understood in terms of a competition between plane waves and the anomalous vibrational modes associated with the jamming transition; infinitesimal boundary perturbations become irrelevant for systems that are larger than a length scale that characterizes the transverse excitations. This previously identified length diverges at the jamming transition. △ Less

Submitted 22 November, 2013; v1 submitted 29 January, 2013; originally announced January 2013.

Comments: 8 pages, 5 figures

Journal ref: Soft Matter, 2013,9, 11000-11006

Showing 1–45 of 45 results for author: Schoenholz, S S