Search | arXiv e-print repository

The Eigenlearning Framework: A Conservation Law Perspective on Kernel Regression and Wide Neural Networks

Authors: James B. Simon, Madeline Dickens, Dhruva Karkada, Michael R. DeWeese

Abstract: We derive simple closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. These improvements are enabled by our identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions.… ▽ More We derive simple closed-form estimates for the test risk and other generalization metrics of kernel ridge regression (KRR). Relative to prior work, our derivations are greatly simplified and our final expressions are more readily interpreted. These improvements are enabled by our identification of a sharp conservation law which limits the ability of KRR to learn any orthonormal basis of functions. Test risk and other objects of interest are expressed transparently in terms of our conserved quantity evaluated in the kernel eigenbasis. We use our improved framework to: i) provide a theoretical explanation for the "deep bootstrap" of Nakkiran et al (2020), ii) generalize a previous result regarding the hardness of the classic parity problem, iii) fashion a theoretical tool for the study of adversarial robustness, and iv) draw a tight analogy between KRR and a well-studied system in statistical physics. △ Less

Submitted 26 October, 2023; v1 submitted 8 October, 2021; originally announced October 2021.

Comments: 12 pages (main text) + 25 pages (refs + appendices). A previous version of this manuscript was entitled "Neural Tangent Kernel Eigenvalues Accurately Predict Generalization."

arXiv:2106.03186 [pdf, other]

Reverse Engineering the Neural Tangent Kernel

Authors: James B. Simon, Sajant Anand, Michael R. DeWeese

Abstract: The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature… ▽ More The development of methods to guide the design of neural networks is an important open challenge for deep learning theory. As a paradigm for principled neural architecture design, we propose the translation of high-performing kernels, which are better-understood and amenable to first-principles design, into equivalent network architectures, which have superior efficiency, flexibility, and feature learning. To this end, we constructively prove that, with just an appropriate choice of activation function, any positive-semidefinite dot-product kernel can be realized as either the NNGP or neural tangent kernel of a fully-connected neural network with only one hidden layer. We verify our construction numerically and demonstrate its utility as a design tool for finite fully-connected networks in several experiments. △ Less

Submitted 13 August, 2022; v1 submitted 6 June, 2021; originally announced June 2021.

Comments: 15 pages, 5 figures

arXiv:2007.09240 [pdf, other]

A new method for parameter estimation in probabilistic models: Minimum probability flow

Authors: Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

Abstract: Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function. We propose a new parameter fitting method, Minimum Probability Flow (MPF), which is applicable to any parametric model. We demonstrate parameter estimation using MPF in two cases: a continuous state space model, and an Ising spin glass. In the latter case it outperforms current tec… ▽ More Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function. We propose a new parameter fitting method, Minimum Probability Flow (MPF), which is applicable to any parametric model. We demonstrate parameter estimation using MPF in two cases: a continuous state space model, and an Ising spin glass. In the latter case it outperforms current techniques by at least an order of magnitude in convergence time with lower error in the recovered coupling parameters. △ Less

Submitted 17 July, 2020; originally announced July 2020.

Comments: Originally published 2011. Uploaded to arXiv 2020. arXiv admin note: text overlap with arXiv:0906.4779, arXiv:1205.4295

arXiv:2003.10397 [pdf, other]

Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Authors: Charles G. Frye, James Simon, Neha S. Wadia, Andrew Ligeralde, Michael R. DeWeese, Kristofer E. Bouchard

Abstract: Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstratin… ▽ More Despite the fact that the loss functions of deep neural networks are highly non-convex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by characterizing the local curvature near critical points of the loss function, where the gradients are near zero, and demonstrating that neural network losses enjoy a no-bad-local-minima property and an abundance of saddle points. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks. △ Less

Submitted 23 March, 2020; originally announced March 2020.

Comments: 18 pages, 5 figures

arXiv:2001.01681 [pdf, other]

doi 10.1364/OE.27.014009

Design of optical neural networks with component imprecisions

Authors: Michael Y. -S. Fang, Sasikanth Manipatruni, Casimir Wierzynski, Amir Khosrowshahi, Michael R. DeWeese

Abstract: For the benefit of designing scalable, fault resistant optical neural networks (ONNs), we investigate the effects architectural designs have on the ONNs' robustness to imprecise components. We train two ONNs -- one with a more tunable design (GridNet) and one with better fault tolerance (FFTNet) -- to classify handwritten digits. When simulated without any imperfections, GridNet yields a better ac… ▽ More For the benefit of designing scalable, fault resistant optical neural networks (ONNs), we investigate the effects architectural designs have on the ONNs' robustness to imprecise components. We train two ONNs -- one with a more tunable design (GridNet) and one with better fault tolerance (FFTNet) -- to classify handwritten digits. When simulated without any imperfections, GridNet yields a better accuracy (~98%) than FFTNet (~95%). However, under a small amount of error in their photonic components, the more fault tolerant FFTNet overtakes GridNet. We further provide thorough quantitative and qualitative analyses of ONNs' sensitivity to varying levels and types of imprecisions. Our results offer guidelines for the principled design of fault-tolerant ONNs as well as a foundation for further research. △ Less

Submitted 13 December, 2019; originally announced January 2020.

Journal ref: Optics express 27.10 (2019): 14009-14029

arXiv:1901.10603 [pdf, ps, other]

Numerically Recovering the Critical Points of a Deep Linear Autoencoder

Authors: Charles G. Frye, Neha S. Wadia, Michael R. DeWeese, Kristofer E. Bouchard

Abstract: Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actuall… ▽ More Numerically locating the critical points of non-convex surfaces is a long-standing problem central to many fields. Recently, the loss surfaces of deep neural networks have been explored to gain insight into outstanding questions in optimization, generalization, and network architecture design. However, the degree to which recently-proposed methods for numerically recovering critical points actually do so has not been thoroughly evaluated. In this paper, we examine this issue in a case for which the ground truth is known: the deep linear autoencoder. We investigate two sub-problems associated with numerical critical point identification: first, because of large parameter counts, it is infeasible to find all of the critical points for contemporary neural networks, necessitating sampling approaches whose characteristics are poorly understood; second, the numerical tolerance for accurately identifying a critical point is unknown, and conservative tolerances are difficult to satisfy. We first identify connections between recently-proposed methods and well-understood methods in other fields, including chemical physics, economics, and algebraic geometry. We find that several methods work well at recovering certain information about loss surfaces, but fail to take an unbiased sample of critical points. Furthermore, numerical tolerance must be very strict to ensure that numerically-identified critical points have similar properties to true analytical critical points. We also identify a recently-published Newton method for optimization that outperforms previous methods as a critical point-finding algorithm. We expect our results will guide future attempts to numerically study critical points in large nonlinear neural networks. △ Less

Submitted 29 January, 2019; originally announced January 2019.

arXiv:1504.04756 [pdf, other]

Time Resolution Dependence of Information Measures for Spiking Neurons: Atoms, Scaling, and Universality

Authors: Sarah E. Marzen, Michael R. DeWeese, James P. Crutchfield

Abstract: The mutual information between stimulus and spike-train response is commonly used to monitor neural coding efficiency, but neuronal computation broadly conceived requires more refined and targeted information measures of input-output joint processes. A first step towards that larger goal is to develop information measures for individual output processes, including information generation (entropy r… ▽ More The mutual information between stimulus and spike-train response is commonly used to monitor neural coding efficiency, but neuronal computation broadly conceived requires more refined and targeted information measures of input-output joint processes. A first step towards that larger goal is to develop information measures for individual output processes, including information generation (entropy rate), stored information (statistical complexity), predictable information (excess entropy), and active information accumulation (bound information rate). We calculate these for spike trains generated by a variety of noise-driven integrate-and-fire neurons as a function of time resolution and for alternating renewal processes. We show that their time-resolution dependence reveals coarse-grained structural properties of interspike interval statistics; e.g., $τ$-entropy rates that diverge less quickly than the firing rate indicate interspike interval correlations. We also find evidence that the excess entropy and regularized statistical complexity of different types of integrate-and-fire neurons are universal in the continuous-time limit in the sense that they do not depend on mechanism details. This suggests a surprising simplicity in the spike trains generated by these model neurons. Interestingly, neurons with gamma-distributed ISIs and neurons whose spike trains are alternating renewal processes do not fall into the same universality class. These results lead to two conclusions. First, the dependence of information measures on time resolution reveals mechanistic details about spike train generation. Second, information measures can be used as model selection tools for analyzing spike train processes. △ Less

Submitted 18 April, 2015; originally announced April 2015.

Comments: 20 pages, 6 figures; http://csc.ucdavis.edu/~cmg/compmech/pubs/trdctim.htm

arXiv:0906.4779 [pdf, other]

Minimum Probability Flow Learning

Authors: Jascha Sohl-Dickstein, Peter Battaglino, Michael R. DeWeese

Abstract: Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives. Here we propose a new parameter estimation technique that does not require computing an intractable normalization factor or sampling from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data… ▽ More Fitting probabilistic models to data is often difficult, due to the general intractability of the partition function and its derivatives. Here we propose a new parameter estimation technique that does not require computing an intractable normalization factor or sampling from the equilibrium distribution of the model. This is achieved by establishing dynamics that would transform the observed data distribution into the model distribution, and then setting as the objective the minimization of the KL divergence between the data distribution and the distribution produced by running the dynamics for an infinitesimal time. Score matching, minimum velocity learning, and certain forms of contrastive divergence are shown to be special cases of this learning technique. We demonstrate parameter estimation in Ising models, deep belief networks and an independent component analysis model of natural scenes. In the Ising model case, current state of the art techniques are outperformed by at least an order of magnitude in learning time, with lower error in recovered coupling parameters. △ Less

Submitted 24 September, 2011; v1 submitted 25 June, 2009; originally announced June 2009.

Comments: Updated to match ICML conference proceedings

Showing 1–8 of 8 results for author: DeWeese, M R