Search | arXiv e-print repository

Quantum Many-Body Physics Calculations with Large Language Models

Authors: Haining Pan, Nayantara Mudur, Will Taranto, Maria Tikhanovskaya, Subhashini Venugopalan, Yasaman Bahri, Michael P. Brenner, Eun-Ah Kim

Abstract: Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock metho… ▽ More Large language models (LLMs) have demonstrated an unprecedented ability to perform complex tasks in multiple domains, including mathematical and scientific reasoning. We demonstrate that with carefully designed prompts, LLMs can accurately carry out key calculations in research papers in theoretical physics. We focus on a broadly used approximation method in quantum physics: the Hartree-Fock method, requiring an analytic multi-step calculation deriving approximate Hamiltonian and corresponding self-consistency equations. To carry out the calculations using LLMs, we design multi-step prompt templates that break down the analytic calculation into standardized steps with placeholders for problem-specific information. We evaluate GPT-4's performance in executing the calculation for 15 research papers from the past decade, demonstrating that, with correction of intermediate steps, it can correctly derive the final Hartree-Fock Hamiltonian in 13 cases and makes minor errors in 2 cases. Aggregating across all research papers, we find an average score of 87.5 (out of 100) on the execution of individual calculation steps. Overall, the requisite skill for doing these calculations is at the graduate level in quantum condensed matter theory. We further use LLMs to mitigate the two primary bottlenecks in this evaluation process: (i) extracting information from papers to fill in templates and (ii) automatic scoring of the calculation steps, demonstrating good results in both cases. The strong performance is the first step for develo** algorithms that automatically explore theoretical hypotheses at an unprecedented scale. △ Less

Submitted 5 March, 2024; originally announced March 2024.

Comments: 9 pages, 4 figures. Supplemental material in the source file

arXiv:2309.01592 [pdf, other]

Les Houches Lectures on Deep Learning at Large & Infinite Width

Authors: Yasaman Bahri, Boris Hanin, Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

Abstract: These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural ne… ▽ More These lectures, presented at the 2022 Les Houches Summer School on Statistical Physics and Machine Learning, focus on the infinite-width limit and large-width regime of deep neural networks. Topics covered include various statistical and dynamical properties of these networks. In particular, the lecturers discuss properties of random deep neural networks; connections between trained deep neural networks, linear models, kernels, and Gaussian processes that arise in the infinite-width limit; and perturbative and non-perturbative treatments of large but finite-width networks, at initialization and after training. △ Less

Submitted 12 February, 2024; v1 submitted 4 September, 2023; originally announced September 2023.

Comments: These are notes from lectures delivered by Yasaman Bahri and Boris Hanin at the 2022 Les Houches Summer School on Statistics Physics and Machine Learning and a first version of them were transcribed by Antonin Brossollet, Vittorio Erba, Christian Keup, Rosalba Pacelli, James B. Simon

arXiv:2208.07896 [pdf, ps, other]

Normal form for transverse instability of gZK equation for the line soliton with nearly critical speed

Authors: Yakine Bahri, Hichem Hajaiej

Abstract: In this paper, we study the transverse instability of generalized Zakharov-Kuznetsov equation for the line soliton with critical speed. We derive and justify a normal form reduction for a bifurcation problem of the stationary nonlinear KdV equation on the product space R ? T. In this paper, we study the transverse instability of generalized Zakharov-Kuznetsov equation for the line soliton with critical speed. We derive and justify a normal form reduction for a bifurcation problem of the stationary nonlinear KdV equation on the product space R ? T. △ Less

Submitted 16 August, 2022; originally announced August 2022.

Comments: arXiv admin note: text overlap with arXiv:1706.00064 by other authors

arXiv:2206.04615 [pdf, other]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Authors: Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R. Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, Agnieszka Kluska, Aitor Lewkowycz, Akshat Agarwal, Alethea Power, Alex Ray, Alex Warstadt, Alexander W. Kocurek, Ali Safaya, Ali Tazarv, Alice Xiang, Alicia Parrish, Allen Nie, Aman Hussain, Amanda Askell, Amanda Dsouza , et al. (426 additional authors not shown)

Abstract: Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-futur… ▽ More Language models demonstrate both quantitative improvement and new qualitative capabilities with increasing scale. Despite their potentially transformative impact, these new capabilities are as yet poorly characterized. In order to inform future research, prepare for disruptive new model capabilities, and ameliorate socially harmful effects, it is vital that we understand the present and near-future capabilities and limitations of language models. To address this challenge, we introduce the Beyond the Imitation Game benchmark (BIG-bench). BIG-bench currently consists of 204 tasks, contributed by 450 authors across 132 institutions. Task topics are diverse, drawing problems from linguistics, childhood development, math, common-sense reasoning, biology, physics, social bias, software development, and beyond. BIG-bench focuses on tasks that are believed to be beyond the capabilities of current language models. We evaluate the behavior of OpenAI's GPT models, Google-internal dense transformer architectures, and Switch-style sparse transformers on BIG-bench, across model sizes spanning millions to hundreds of billions of parameters. In addition, a team of human expert raters performed all tasks in order to provide a strong baseline. Findings include: model performance and calibration both improve with scale, but are poor in absolute terms (and when compared with rater performance); performance is remarkably similar across model classes, though with benefits from sparsity; tasks that improve gradually and predictably commonly involve a large knowledge or memorization component, whereas tasks that exhibit "breakthrough" behavior at a critical scale often involve multiple steps or components, or brittle metrics; social bias typically increases with scale in settings with ambiguous context, but this can be improved with prompting. △ Less

Submitted 12 June, 2023; v1 submitted 9 June, 2022; originally announced June 2022.

Comments: 27 pages, 17 figures + references and appendices, repo: https://github.com/google/BIG-bench

Journal ref: Transactions on Machine Learning Research, May/2022, https://openreview.net/forum?id=uyTL5Bvosj

arXiv:2201.06764 [pdf, ps, other]

Infinitely many positive solutions of a Gross-Pitaevskii equation in the presence of a harmonic potential and combined nonlinearities

Authors: Yakine Bahri, Hichem Hajaiej

Abstract: The main goal of this paper is to address an important conjecture in the field of differential equations in the presence of a harmonic potential. While in the subcritical case, the uniqueness of positive solution has been addressed by Hirose and Ohta in 2007, the problem has remained open for years in the supercritical case. In Hadj Selem et al., the authors obtained interesting numerical computat… ▽ More The main goal of this paper is to address an important conjecture in the field of differential equations in the presence of a harmonic potential. While in the subcritical case, the uniqueness of positive solution has been addressed by Hirose and Ohta in 2007, the problem has remained open for years in the supercritical case. In Hadj Selem et al., the authors obtained interesting numerical computations suggesting that for some bifurcating parameter $λ$, the equation has many positive solutions that vanish at infinity. In this paper, we provide a proof to this claim by constructing an accountable number of solutions that bifurcate from the unique singular solutions with $λ$ close to the first eigenvalue $λ_1$ of the harmonic operator $-Δ+ |x|^2$. Our method hinges on a matching argument, and applies to the supercritical case, and to the supercritical case in the presence of a subcritical, critical or supercritical perturbation. △ Less

Submitted 6 March, 2022; v1 submitted 18 January, 2022; originally announced January 2022.

arXiv:2106.15831 [pdf, other]

The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning

Authors: Anders Andreassen, Yasaman Bahri, Behnam Neyshabur, Rebecca Roelofs

Abstract: Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" and are exceedingly rare. Ident… ▽ More Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" and are exceedingly rare. Identifying such models, and understanding their properties, is key to improving out-of-distribution performance. We conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset. We also find that models that display effective robustness are able to correctly classify 10% of the examples that no other current testbed model gets correct. Finally, we discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models. △ Less

Submitted 30 June, 2021; originally announced June 2021.

Comments: 27 pages, 25 figures

arXiv:2103.05887 [pdf, ps, other]

Pitchfork bifurcation at line solitons for nonlinear Schrödinger equations on the product space $\mathbb{R} \times \mathbb{T}$

Authors: Takafumi Akahori, Yakine Bahri, Slim Ibrahim, Hiroaki Kikuchi

Abstract: In this paper, we study the bifurcation problem from a line soliton for a stationary nonlinear Schrödinger equation on the product space $\mathbb{R} \times \mathbb{T}$. We extend earlier results to a larger class of the nonlinearity in the equation. The salient point of our analysis relies on a lower bound of solution to the ``auxiliary equation'' and then on the application of the Crandall-Rabino… ▽ More In this paper, we study the bifurcation problem from a line soliton for a stationary nonlinear Schrödinger equation on the product space $\mathbb{R} \times \mathbb{T}$. We extend earlier results to a larger class of the nonlinearity in the equation. The salient point of our analysis relies on a lower bound of solution to the ``auxiliary equation'' and then on the application of the Crandall-Rabinowitz argument △ Less

Submitted 18 January, 2022; v1 submitted 10 March, 2021; originally announced March 2021.

arXiv:2102.06701 [pdf, other]

doi 10.1073/pnas.2311878121

Explaining Neural Scaling Laws

Authors: Yasaman Bahri, Ethan Dyer, Jared Kaplan, Jaehoon Lee, Utkarsh Sharma

Abstract: The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scali… ▽ More The population loss of trained deep neural networks often follows precise power-law scaling relations with either the size of the training dataset or the number of parameters in the network. We propose a theory that explains the origins of and connects these scaling laws. We identify variance-limited and resolution-limited scaling behavior for both dataset and model size, for a total of four scaling regimes. The variance-limited scaling follows simply from the existence of a well-behaved infinite data or infinite width limit, while the resolution-limited regime can be explained by positing that models are effectively resolving a smooth data manifold. In the large width limit, this can be equivalently obtained from the spectrum of certain kernels, and we present evidence that large width and large dataset resolution-limited scaling exponents are related by a duality. We exhibit all four scaling regimes in the controlled setting of large random feature and pretrained models and test the predictions empirically on a range of standard architectures and datasets. We also observe several empirical relationships between datasets and scaling exponents under modifications of task and architecture aspect ratio. Our work provides a taxonomy for classifying different scaling regimes, underscores that there can be different mechanisms driving improvements in loss, and lends insight into the microscopic origins of and relationships between scaling exponents. △ Less

Submitted 28 April, 2024; v1 submitted 12 February, 2021; originally announced February 2021.

Comments: 11 pages, 3 figures + Supplement (expanded). This version to appear in PNAS

Journal ref: PNAS 121 (27) e2311878121 (2024)

arXiv:2101.01314 [pdf, ps, other]

Transverse stability of line soliton and characterization of ground state for wave guide Schrödinger equations

Authors: Yakine Bahri, Slim Ibrahim, Hiroaki Kikuchi

Abstract: In this paper, we study the transverse stability of the line Schrödinger soliton under a full wave guide Schrödinger flow on a cylindrical domain $\mathbb R\times\mathbb T$. When the nonlinearity is of power type $|ψ|^{p-1}ψ$ with $p>1$, we show that there exists a critical frequency $ω_{p} >0$ such that the line standing wave is stable for $0<ω< ω_{p}$ and unstable for $ω> ω_{p}$. Furthermore, we… ▽ More In this paper, we study the transverse stability of the line Schrödinger soliton under a full wave guide Schrödinger flow on a cylindrical domain $\mathbb R\times\mathbb T$. When the nonlinearity is of power type $|ψ|^{p-1}ψ$ with $p>1$, we show that there exists a critical frequency $ω_{p} >0$ such that the line standing wave is stable for $0<ω< ω_{p}$ and unstable for $ω> ω_{p}$. Furthermore, we characterize the ground state of the wave guide Schrödinger equation. More precisely, we prove that there exists $ω_{*} \in (0, ω_{p}]$ such that the ground states coincide with the line standing waves for $ω\in (0, ω_{*}]$ and are different from the line standing waves for $ω\in (ω_{*}, \infty)$. △ Less

Submitted 8 January, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: To appear in JDDE

arXiv:2006.10541 [pdf, other]

Exact posterior distributions of wide Bayesian neural networks

Authors: Jiri Hron, Yasaman Bahri, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it… ▽ More Recent work has shown that the prior over functions induced by a deep Bayesian neural network (BNN) behaves as a Gaussian process (GP) as the width of all layers becomes large. However, many BNN applications are concerned with the BNN function space posterior. While some empirical evidence of the posterior convergence was provided in the original works of Neal (1996) and Matthews et al. (2018), it is limited to small datasets or architectures due to the notorious difficulty of obtaining and verifying exactness of BNN posterior approximations. We provide the missing theoretical proof that the exact BNN posterior converges (weakly) to the one induced by the GP limit of the prior. For empirical validation, we show how to generate exact samples from a finite BNN on a small dataset via rejection sampling. △ Less

Submitted 26 November, 2020; v1 submitted 18 June, 2020; originally announced June 2020.

arXiv:2006.10540 [pdf, other]

Infinite attention: NNGP and NTK for deep attention networks

Authors: Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, Roman Novak

Abstract: There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly… ▽ More There is a growing amount of literature on the relationship between wide neural networks (NNs) and Gaussian processes (GPs), identifying an equivalence between the two for a variety of NN architectures. This equivalence enables, for instance, accurate approximation of the behaviour of wide Bayesian NNs without MCMC or variational approximations, or characterisation of the distribution of randomly initialised wide NNs optimised by gradient descent without ever running an optimiser. We provide a rigorous extension of these results to NNs involving attention layers, showing that unlike single-head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity. We further discuss the effects of positional encodings and layer normalisation, and propose modifications of the attention mechanism which lead to improved results for both finite and infinitely wide NNs. We evaluate attention kernels empirically, leading to a moderate improvement upon the previous state-of-the-art on CIFAR-10 for GPs without trainable kernels and advanced data preprocessing. Finally, we introduce new features to the Neural Tangents library (Novak et al., 2020) allowing applications of NNGP/NTK models, with and without attention, to variable-length sequences, with an example on the IMDb reviews dataset. △ Less

Submitted 18 June, 2020; originally announced June 2020.

Comments: ICML 2020

arXiv:2003.02218 [pdf, other]

The large learning rate phase of deep learning: the catapult mechanism

Authors: Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, Guy Gur-Ari

Abstract: The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small l… ▽ More The choice of initial learning rate can have a profound effect on the performance of deep networks. We present a class of neural networks with solvable training dynamics, and confirm their predictions empirically in practical deep learning settings. The networks exhibit sharply distinct behaviors at small and large learning rates. The two regimes are separated by a phase transition. In the small learning rate phase, training can be understood using the existing theory of infinitely wide neural networks. At large learning rates the model captures qualitatively distinct phenomena, including the convergence of gradient descent dynamics to flatter minima. One key prediction of our model is a narrow range of large, stable learning rates. We find good agreement between our model's predictions and training dynamics in realistic deep learning settings. Furthermore, we find that the optimal performance in such settings is often found in the large learning rate phase. We believe our results shed light on characteristics of models trained at different learning rates. In particular, they fill a gap between existing wide neural network theory, and the nonlinear, large learning rate, training dynamics relevant to practice. △ Less

Submitted 4 March, 2020; originally announced March 2020.

Comments: 25 pages, 19 figures

arXiv:1911.11457 [pdf, ps, other]

Self-similar blow-up profiles for slightly supercritical nonlinear Schrödinger equations

Authors: Yakine Bahri, Yvan Martel, Pierre Raphaël

Abstract: We construct radially symmetric self-similar blow-up profiles for the mass supercritical nonlinear Schrödinger equation $i\partial_t u + Δu + |u|^{p-1}u=0$ on $\mathbf{R}^d$, close to the mass critical case and for any space dimension $d\ge 1$. These profiles bifurcate from the ground state solitary wave. The argument relies on the classical matched asymptotics method suggested in [Sulem, C.; Sule… ▽ More We construct radially symmetric self-similar blow-up profiles for the mass supercritical nonlinear Schrödinger equation $i\partial_t u + Δu + |u|^{p-1}u=0$ on $\mathbf{R}^d$, close to the mass critical case and for any space dimension $d\ge 1$. These profiles bifurcate from the ground state solitary wave. The argument relies on the classical matched asymptotics method suggested in [Sulem, C.; Sulem, P.-L., The nonlinear Schrödinger equation. Self-focusing and wave collapse. Applied Mathematical Sciences, 139. Springer-Verlag, New York, 1999] which needs to be applied in a degenerate case due to the presence of exponentially small terms in the bifurcation equation related to the log-log blow-up law observed in the mass critical case. △ Less

Submitted 26 November, 2019; originally announced November 2019.

arXiv:1902.06720 [pdf, other]

doi 10.1088/1742-5468/abc62b

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

Authors: Jaehoon Lee, Lechao Xiao, Samuel S. Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, Jeffrey Pennington

Abstract: A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained… ▽ More A longstanding goal in deep learning research has been to precisely characterize training and generalization. However, the often complex loss landscapes of neural networks have made a theory of learning dynamics elusive. In this work, we show that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Furthermore, mirroring the correspondence between wide Bayesian neural networks and Gaussian processes, gradient-based training of wide neural networks with a squared loss produces test set predictions drawn from a Gaussian process with a particular compositional kernel. While these theoretical results are only exact in the infinite width limit, we nevertheless find excellent empirical agreement between the predictions of the original network and those of the linearized version even for finite practically-sized networks. This agreement is robust across different architectures, optimization methods, and loss functions. △ Less

Submitted 8 December, 2019; v1 submitted 18 February, 2019; originally announced February 2019.

Comments: 12+16 pages; open-source code available at https://github.com/google/neural-tangents; accepted to NeurIPS 2019

arXiv:1810.05148 [pdf, other]

Bayesian Deep Convolutional Networks with Many Channels are Gaussian Processes

Authors: Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous… ▽ More There is a previously identified equivalence between wide fully connected neural networks (FCNs) and Gaussian processes (GPs). This equivalence enables, for instance, test set predictions that would have resulted from a fully Bayesian, infinitely wide trained FCN to be computed without ever instantiating the FCN, but by instead evaluating the corresponding GP. In this work, we derive an analogous equivalence for multi-layer convolutional neural networks (CNNs) both with and without pooling layers, and achieve state of the art results on CIFAR10 for GPs without trainable kernels. We also introduce a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible. Surprisingly, in the absence of pooling layers, the GPs corresponding to CNNs with and without weight sharing are identical. As a consequence, translation equivariance, beneficial in finite channel CNNs trained with stochastic gradient descent (SGD), is guaranteed to play no role in the Bayesian treatment of the infinite channel limit - a qualitative difference between the two regimes that is not present in the FCN case. We confirm experimentally, that while in some scenarios the performance of SGD-trained finite CNNs approaches that of the corresponding GPs as the channel count increases, with careful tuning SGD-trained CNNs can significantly outperform their corresponding GPs, suggesting advantages from SGD training compared to fully Bayesian parameter estimation. △ Less

Submitted 21 August, 2020; v1 submitted 11 October, 2018; originally announced October 2018.

Comments: Published as a conference paper at ICLR 2019

arXiv:1810.01385 [pdf, ps, other]

Remarks on solitary waves and Cauchy problem for a Half-wave-Schrödinger equations

Authors: Yakine Bahri, Slim Ibrahim, Hiroaki Kikuchi

Abstract: In this paper, we study the solitary wave and the Cauchy problem for Half-wave-Schrödinger equations in the plane. First, we show the existence and orbital stability of the ground states. Secondly, we prove that traveling waves exist and converge to zero as the velocity tends to $1$. Finally, we solve the Cauchy problem for initial data in $L^{2}_{x}H^{s}_{y}(\mathbb{R}^{2})$, with… ▽ More In this paper, we study the solitary wave and the Cauchy problem for Half-wave-Schrödinger equations in the plane. First, we show the existence and orbital stability of the ground states. Secondly, we prove that traveling waves exist and converge to zero as the velocity tends to $1$. Finally, we solve the Cauchy problem for initial data in $L^{2}_{x}H^{s}_{y}(\mathbb{R}^{2})$, with $s>\frac{1}{2}$. △ Less

Submitted 2 October, 2018; originally announced October 2018.

arXiv:1806.05393 [pdf, other]

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Authors: Lechao Xiao, Yasaman Bahri, Jascha Sohl-Dickstein, Samuel S. Schoenholz, Jeffrey Pennington

Abstract: In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enabl… ▽ More In recent years, state-of-the-art methods in computer vision have utilized increasingly deep convolutional neural network architectures (CNNs), with some of the most successful models employing hundreds or even thousands of layers. A variety of pathologies such as vanishing/exploding gradients make training such deep networks challenging. While residual connections and batch normalization do enable training at these depths, it has remained unclear whether such specialized architecture designs are truly necessary to train deep CNNs. In this work, we demonstrate that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme. We derive this initialization scheme theoretically by develo** a mean field theory for signal propagation and by characterizing the conditions for dynamical isometry, the equilibration of singular values of the input-output Jacobian matrix. These conditions require that the convolution operator be an orthogonal transformation in the sense that it is norm-preserving. We present an algorithm for generating such random initial orthogonal convolution kernels and demonstrate empirically that they enable efficient training of extremely deep architectures. △ Less

Submitted 10 July, 2018; v1 submitted 14 June, 2018; originally announced June 2018.

Comments: ICML 2018 Conference Proceedings

arXiv:1802.08760 [pdf, other]

Sensitivity and Generalization in Neural Networks: an Empirical Study

Authors: Roman Novak, Yasaman Bahri, Daniel A. Abolafia, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of… ▽ More In practice it is often found that large over-parameterized neural networks generalize better than their smaller counterparts, an observation that appears to conflict with classical notions of function complexity, which typically favor smaller models. In this work, we investigate this tension between complexity and generalization through an extensive empirical exploration of two natural metrics of complexity related to sensitivity to input perturbations. Our experiments survey thousands of models with various fully-connected architectures, optimizers, and other hyper-parameters, as well as four different image classification datasets. We find that trained neural networks are more robust to input perturbations in the vicinity of the training data manifold, as measured by the norm of the input-output Jacobian of the network, and that it correlates well with generalization. We further establish that factors associated with poor generalization $-$ such as full-batch training or using random labels $-$ correspond to lower robustness, while factors associated with good generalization $-$ such as data augmentation and ReLU non-linearities $-$ give rise to more robust functions. Finally, we demonstrate how the input-output Jacobian norm can be predictive of generalization at the level of individual test points. △ Less

Submitted 18 June, 2018; v1 submitted 23 February, 2018; originally announced February 2018.

Comments: Published as a conference paper at ICLR 2018

arXiv:1711.00165 [pdf, other]

Deep Neural Networks as Gaussian Processes

Authors: Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, Jascha Sohl-Dickstein

Abstract: It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer… ▽ More It has long been known that a single-layer fully-connected neural network with an i.i.d. prior over its parameters is equivalent to a Gaussian process (GP), in the limit of infinite network width. This correspondence enables exact Bayesian inference for infinite width neural networks on regression tasks by means of evaluating the corresponding GP. Recently, kernel functions which mimic multi-layer random neural networks have been developed, but only outside of a Bayesian framework. As such, previous work has not identified that these kernels can be used as covariance functions for GPs and allow fully Bayesian prediction with a deep neural network. In this work, we derive the exact equivalence between infinitely wide deep networks and GPs. We further develop a computationally efficient pipeline to compute the covariance function for these GPs. We then use the resulting GPs to perform Bayesian inference for wide deep neural networks on MNIST and CIFAR-10. We observe that trained neural network accuracy approaches that of the corresponding GP with increasing layer width, and that the GP uncertainty is strongly correlated with trained network prediction error. We further find that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite-width networks. Finally we connect the performance of these GPs to the recent theory of signal propagation in random neural networks. △ Less

Submitted 2 March, 2018; v1 submitted 31 October, 2017; originally announced November 2017.

Comments: Published version in ICLR 2018. 10 pages + appendix

arXiv:1604.03715 [pdf, ps, other]

On the asymptotic stability in the energy space for multi-solitons of the Landau-Lifshitz equation

Authors: Yakine Bahri

Abstract: We establish the asymptotic stability of multi-solitons for the one-dimensional Landau-Lifshitz equation with an easy-plane anisotropy. The solitons have non-zero speed, are ordered according to their speeds and have sufficiently separated initial positions. We provide the asymptotic stability around solitons and between solitons. More precisely, we show that for an initial datum close to a sum of… ▽ More We establish the asymptotic stability of multi-solitons for the one-dimensional Landau-Lifshitz equation with an easy-plane anisotropy. The solitons have non-zero speed, are ordered according to their speeds and have sufficiently separated initial positions. We provide the asymptotic stability around solitons and between solitons. More precisely, we show that for an initial datum close to a sum of $N$ dark solitons, the corresponding solution converges weakly to one of the solitons in the sum, when it is translated to the centre of this soliton, and converges weakly to zero when it is translated between solitons. △ Less

Submitted 13 April, 2016; originally announced April 2016.

arXiv:1512.00441 [pdf, ps, other]

doi 10.2140/apde.2016.9.645

Asymptotic stability in the energy space for dark solitons of the Landau-Lifshitz equation

Authors: Yakine Bahri

Abstract: We prove the asymptotic stability in the energy space of non-zero speed solitons for the one-dimensional Landau-Lifshitz equation with an easy-plane anisotropy. More precisely, we show that any solution corresponding to an initial datum close to a soliton with non-zero speed, is weakly convergent in the energy space as time goes to infinity, to a soliton with a possible different non-zero speed, u… ▽ More We prove the asymptotic stability in the energy space of non-zero speed solitons for the one-dimensional Landau-Lifshitz equation with an easy-plane anisotropy. More precisely, we show that any solution corresponding to an initial datum close to a soliton with non-zero speed, is weakly convergent in the energy space as time goes to infinity, to a soliton with a possible different non-zero speed, up to the invariances of the equation. Our analysis relies on the ideas developed by Martel and Merle for the generalized Korteweg-de Vries equations. We use the Madelung transform to study the problem in the hydrodynamical framework. In this framework, we rely on the orbital stability of the solitons and the weak continuity of the flow in order to construct a limit profile. We next derive a monotonicity formula for the momentum, which gives the localization of the limit profile. Its smoothness and exponential decay then follow from a smoothing result for the localized solutions of the Schrödinger equations. Finally, we prove a Liouville type theorem, which shows that only the solitons enjoy these properties in their neighbourhoods. △ Less

Submitted 1 December, 2015; originally announced December 2015.

Comments: arXiv admin note: substantial text overlap with arXiv:1212.5027 by other authors

Journal ref: Anal. PDE 9 (2016) 645-697

arXiv:1410.1320 [pdf, other]

doi 10.1103/PhysRevB.93.205158

Phonon analogue of topological nodal semimetals

Authors: Hoi Chun Po, Yasaman Bahri, Ashvin Vishwanath

Abstract: Topological band structures in electronic systems like topological insulators and semimetals give rise to highly unusual physical properties. Analogous topological effects have also been discussed in bosonic systems, but the novel phenomena typically occur only when the system is excited by finite-frequency probes. A map** recently proposed by Kane and Lubensky [Nat. Phys. 10, 39 (2014)], howeve… ▽ More Topological band structures in electronic systems like topological insulators and semimetals give rise to highly unusual physical properties. Analogous topological effects have also been discussed in bosonic systems, but the novel phenomena typically occur only when the system is excited by finite-frequency probes. A map** recently proposed by Kane and Lubensky [Nat. Phys. 10, 39 (2014)], however, establishes a closer correspondence. It relates the zero-frequency excitations of mechanical systems to topological zero modes of fermions that appear at the edges of an otherwise gapped system. Here we generalize the map** to systems with an intrinsically gapless bulk. In particular, we construct mechanical counterparts of topological semimetals. The resulting gapless bulk modes are physically distinct from the usual acoustic Goldstone phonons, and appear even in the absence of continuous translation invariance. Moreover, the zero-frequency phonon modes feature adjustable momenta and are topologically protected as long as the lattice coordination is unchanged. Such protected soft modes with tunable wavevector may be useful in designing mechanical structures with fault-tolerant properties. △ Less

Submitted 3 March, 2017; v1 submitted 6 October, 2014; originally announced October 2014.

Comments: 5 pages, 3 figures and supplementary materials; v2: 6+1 pages, 5+1 figures. Close to published version

Journal ref: Phys. Rev. B 93, 205158 (2016)

arXiv:1408.6826 [pdf, other]

doi 10.1103/PhysRevB.92.035131

Stable non-Fermi liquid phase of itinerant spin-orbit coupled ferromagnets

Authors: Yasaman Bahri, Andrew C. Potter

Abstract: Direct coupling between gapless bosons and a Fermi surface results in the destruction of Landau quasiparticles and a breakdown of Fermi liquid theory. Such a non-Fermi liquid phase arises in spin-orbit coupled ferromagnets with spontaneously broken continuous symmetries due to strong coupling between rotational Goldstone modes and itinerant electrons. These systems provide an experimentally access… ▽ More Direct coupling between gapless bosons and a Fermi surface results in the destruction of Landau quasiparticles and a breakdown of Fermi liquid theory. Such a non-Fermi liquid phase arises in spin-orbit coupled ferromagnets with spontaneously broken continuous symmetries due to strong coupling between rotational Goldstone modes and itinerant electrons. These systems provide an experimentally accessible context for studying non-Fermi liquid physics. Possible examples include low-density Rashba coupled electron gases, which have a natural tendency towards spontaneous ferromagnetism, or topological insulator surface states with proximity-induced ferromagnetism. Crucially, unlike the related case of a spontaneous nematic distortion of the Fermi surface, for which the non-Fermi liquid regime is expected to be masked by a superconducting dome, we show that the non-Fermi liquid phase in spin-orbit coupled ferromagnets is stable. △ Less

Submitted 8 September, 2014; v1 submitted 28 August, 2014; originally announced August 2014.

Comments: 14 pages; typos fixed and transport/disorder sections revised

Journal ref: Phys. Rev. B 92, 035131 (2015)

arXiv:1307.4092 [pdf, other]

Localization and topology protected quantum coherence at the edge of 'hot' matter

Authors: Yasaman Bahri, Ronen Vosk, Ehud Altman, Ashvin Vishwanath

Abstract: Topological phases are often characterized by special edge states confined near the boundaries by an energy gap in the bulk. On raising temperature, these edge states are lost in a clean system due to mobile thermal excitations. Recently however, it has been established that disorder can localize an isolated many body system, potentially allowing for a sharply defined topological phase even in a h… ▽ More Topological phases are often characterized by special edge states confined near the boundaries by an energy gap in the bulk. On raising temperature, these edge states are lost in a clean system due to mobile thermal excitations. Recently however, it has been established that disorder can localize an isolated many body system, potentially allowing for a sharply defined topological phase even in a highly excited state. Here we show this to be the case for the topological phase of a one dimensional magnet with quenched disorder, which features spin one-half excitations at the edges. The time evolution of a simple, highly excited, initial state is used to reveal quantum coherent edge spins. In particular, we demonstrate, using theoretical arguments and numerical simulation, the coherent revival of an edge spin over a time scale that grows exponentially bigger with system size. This is in sharp contrast to the general expectation that quantum bits strongly coupled to a 'hot' many body system will rapidly lose coherence. △ Less

Submitted 4 October, 2013; v1 submitted 15 July, 2013; originally announced July 2013.

Comments: Typos corrected and appendix E added

arXiv:1303.2600 [pdf, ps, other]

doi 10.1103/PhysRevB.89.155135

Detecting Majorana fermions in quasi-one-dimensional topological phases using nonlocal order parameters

Authors: Yasaman Bahri, Ashvin Vishwanath

Abstract: Topological phases which host Majorana fermions can not be identified via local order parameters. We give simple nonlocal order parameters to distinguish quasi-one-dimensional (1D) topological superconductors of spinless fermions, for any interacting model in the absence of time reversal symmetry. These string or "brane" order parameters are natural for measurements in cold atom systems using quan… ▽ More Topological phases which host Majorana fermions can not be identified via local order parameters. We give simple nonlocal order parameters to distinguish quasi-one-dimensional (1D) topological superconductors of spinless fermions, for any interacting model in the absence of time reversal symmetry. These string or "brane" order parameters are natural for measurements in cold atom systems using quantum gas microscopy. We propose them as a way to identify symmetry-protected topological phases of Majorana fermions in cold atom experiments via bulk rather than edge degrees of freedom. Subsequently, we study two-dimensional (2D) topological superconductors via the quasi-1D limit of coupling $N$ identical chains on the cylinder. We classify the symmetric, interacting topological phases protected by the additional $\mathbb{Z}_N$ translation symmetry. The phases include quasi-1D analogs of (i) the $p+ip$ chiral topological superconductor, which can be distinguished up to the 2D Chern number mod 2, and (ii) the 2D weak topological superconductor. We devise general rules for constructing nonlocal order parameters which distinguish the phases. These rules encode the signature of the fermionic topological phase in the symmetry properties of the terminating operators of the nonlocal string or brane. The nonlocal order parameters for some of these phases simply involve a product of the string order parameters for the individual chains. Finally, we give a physical picture of one of the topological phases as a condensate of certain defects, which motivates the form of the nonlocal order parameter and is reminiscent of higher dimensional constructions of topological phases. △ Less

Submitted 25 July, 2014; v1 submitted 11 March, 2013; originally announced March 2013.

Comments: Final version; 14 pages, 4 figures

Journal ref: Phys. Rev. B 89, 155135 (2014)

Showing 1–25 of 25 results for author: Bahri, Y