Skip to main content

Showing 1–6 of 6 results for author: Yaida, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2304.02034  [pdf, other

    cs.LG cs.CL hep-th stat.ML

    Effective Theory of Transformers at Initialization

    Authors: Emily Dinan, Sho Yaida, Susan Zhang

    Abstract: We perform an effective-theory analysis of forward-backward signal propagation in wide and deep Transformers, i.e., residual neural networks with multi-head self-attention blocks and multilayer perceptron blocks. This analysis suggests particular width scalings of initialization and training hyperparameters for these models. We then take up such suggestions, training Vision and Language Transforme… ▽ More

    Submitted 4 April, 2023; originally announced April 2023.

    Comments: 64 pages, 5 figures

  2. arXiv:2210.04909  [pdf, ps, other

    cs.LG hep-th stat.ML

    Meta-Principled Family of Hyperparameter Scaling Strategies

    Authors: Sho Yaida

    Abstract: In this note, we first derive a one-parameter family of hyperparameter scaling strategies that interpolates between the neural-tangent scaling and mean-field/maximal-update scaling. We then calculate the scalings of dynamical observables -- network outputs, neural tangent kernels, and differentials of neural tangent kernels -- for wide and deep neural networks. These calculations in turn reveal a… ▽ More

    Submitted 18 October, 2022; v1 submitted 10 October, 2022; originally announced October 2022.

    Comments: 24 pages; v2: an addendum added

  3. arXiv:2106.10165  [pdf, other

    cs.LG cs.AI hep-th stat.ML

    The Principles of Deep Learning Theory

    Authors: Daniel A. Roberts, Sho Yaida, Boris Hanin

    Abstract: This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are… ▽ More

    Submitted 24 August, 2021; v1 submitted 18 June, 2021; originally announced June 2021.

    Comments: 471 pages, to be published by Cambridge University Press; v2: hyperlinks fixed, index added

    Report number: MIT-CTP/5306

    Journal ref: Cambridge University Press (2022)

  4. arXiv:1910.00019  [pdf, other

    stat.ML cond-mat.dis-nn cs.LG hep-th

    Non-Gaussian processes and neural networks at finite widths

    Authors: Sho Yaida

    Abstract: Gaussian processes are ubiquitous in nature and engineering. A case in point is a class of neural networks in the infinite-width limit, whose priors correspond to Gaussian processes. Here we perturbatively extend this correspondence to finite-width neural networks, yielding non-Gaussian processes as priors. The methodology developed herein allows us to track the flow of preactivation distributions… ▽ More

    Submitted 26 August, 2020; v1 submitted 30 September, 2019; originally announced October 2019.

    Comments: 33 pages, 3 figures; v2: final version accepted at MSML 2020, with some clarification on the connection to renormalization-group flow

  5. arXiv:1908.02729  [pdf, other

    stat.ML cs.LG

    Robust Learning with Jacobian Regularization

    Authors: Judy Hoffman, Daniel A. Roberts, Sho Yaida

    Abstract: Design of reliable systems must guarantee stability against input perturbations. In machine learning, such guarantee entails preventing overfitting and ensuring robustness of models against corruption of input data. In order to maximize stability, we analyze and develop a computationally efficient implementation of Jacobian regularization that increases classification margins of neural networks. T… ▽ More

    Submitted 7 August, 2019; originally announced August 2019.

    Comments: 21 pages, 10 figures

  6. arXiv:1810.00004  [pdf, other

    stat.ML cs.LG

    Fluctuation-dissipation relations for stochastic gradient descent

    Authors: Sho Yaida

    Abstract: The notion of the stationary equilibrium ensemble has played a central role in statistical mechanics. In machine learning as well, training serves as generalized equilibration that drives the probability distribution of model parameters toward stationarity. Here, we derive stationary fluctuation-dissipation relations that link measurable quantities and hyperparameters in the stochastic gradient de… ▽ More

    Submitted 21 December, 2018; v1 submitted 28 September, 2018; originally announced October 2018.

    Comments: 15 pages, 6 figures; v2: final version accepted at ICLR 2019, with derivations/assumptions clarified and Adam/AMSGrad experiments added