Skip to main content

Showing 1–11 of 11 results for author: Nichani, E

.
  1. arXiv:2402.14735  [pdf, other

    cs.LG cs.IT stat.ML

    How Transformers Learn Causal Structure with Gradient Descent

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structur… ▽ More

    Submitted 22 February, 2024; originally announced February 2024.

  2. arXiv:2311.13774  [pdf, other

    cs.LG stat.ML

    Learning Hierarchical Polynomials with Three-Layer Neural Networks

    Authors: Zihao Wang, Eshaan Nichani, Jason D. Lee

    Abstract: We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index mod… ▽ More

    Submitted 22 November, 2023; originally announced November 2023.

    Comments: 57 pages

  3. arXiv:2305.17333  [pdf, other

    cs.LG cs.CL

    Fine-Tuning Language Models with Just Forward Passes

    Authors: Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, Sanjeev Arora

    Abstract: Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder opti… ▽ More

    Submitted 11 January, 2024; v1 submitted 26 May, 2023; originally announced May 2023.

    Comments: Accepted by NeurIPS 2023 (oral). Code available at https://github.com/princeton-nlp/MeZO

  4. arXiv:2305.10633  [pdf, other

    cs.LG cs.IT stat.ML

    Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models

    Authors: Alex Damian, Eshaan Nichani, Rong Ge, Jason D. Lee

    Abstract: We focus on the task of learning a single index model $σ(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $σ$, which is defined as the index of the first nonzero Hermite coefficient of $σ$. Ben Arous et al. (2021) showe… ▽ More

    Submitted 17 May, 2023; originally announced May 2023.

  5. arXiv:2305.06986  [pdf, other

    cs.LG stat.ML

    Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks

    Authors: Eshaan Nichani, Alex Damian, Jason D. Lee

    Abstract: One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical per… ▽ More

    Submitted 31 October, 2023; v1 submitted 11 May, 2023; originally announced May 2023.

    Comments: v2: NeurIPS 2023 camera ready

  6. arXiv:2209.15594  [pdf, other

    cs.LG cs.IT math.OC stat.ML

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

    Authors: Alex Damian, Eshaan Nichani, Jason D. Lee

    Abstract: Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen e… ▽ More

    Submitted 10 April, 2023; v1 submitted 30 September, 2022; originally announced September 2022.

    Comments: ICLR 2023, first two authors contributed equally

  7. arXiv:2208.13153  [pdf, ps, other

    math.PR math.ST

    Metastable Mixing of Markov Chains: Efficiently Sampling Low Temperature Exponential Random Graphs

    Authors: Guy Bresler, Dheeraj Nagaraj, Eshaan Nichani

    Abstract: In this paper we consider the problem of sampling from the low-temperature exponential random graph model (ERGM). The usual approach is via Markov chain Monte Carlo, but Bhamidi et al. showed that any local Markov chain suffers from an exponentially large mixing time due to metastable states. We instead consider metastable mixing, a notion of approximate mixing relative to the stationary distribut… ▽ More

    Submitted 4 October, 2022; v1 submitted 28 August, 2022; originally announced August 2022.

    Comments: No figures. We don't do that around here

  8. arXiv:2207.01237  [pdf, other

    stat.ME

    Causal Structure Discovery between Clusters of Nodes Induced by Latent Factors

    Authors: Chandler Squires, Annie Yun, Eshaan Nichani, Raj Agrawal, Caroline Uhler

    Abstract: We consider the problem of learning the structure of a causal directed acyclic graph (DAG) model in the presence of latent variables. We define latent factor causal models (LFCMs) as a restriction on causal DAG models with latent variables, which are composed of clusters of observed variables that share the same latent parent and connections between these clusters given by edges pointing from the… ▽ More

    Submitted 5 July, 2022; v1 submitted 4 July, 2022; originally announced July 2022.

    Comments: Causal Learning and Reasoning (CLeaR) 2022

  9. arXiv:2206.03688  [pdf, other

    cs.LG stat.ML

    Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

    Authors: Eshaan Nichani, Yu Bai, Jason D. Lee

    Abstract: A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning… ▽ More

    Submitted 26 November, 2022; v1 submitted 8 June, 2022; originally announced June 2022.

    Comments: v2: NeurIPS 2022 camera ready version

  10. arXiv:2010.09610  [pdf, other

    cs.LG stat.ML

    Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks

    Authors: Eshaan Nichani, Adityanarayanan Radhakrishnan, Caroline Uhler

    Abstract: Recent works have demonstrated that increasing model capacity through width in over-parameterized neural networks leads to a decrease in test risk. For neural networks, however, model capacity can also be increased through depth, yet understanding the impact of increasing depth on test risk remains an open question. In this work, we demonstrate that the test risk of over-parameterized convolutiona… ▽ More

    Submitted 4 June, 2021; v1 submitted 19 October, 2020; originally announced October 2020.

    Comments: 27 pages, 23 figures

  11. arXiv:2003.06340  [pdf, other

    cs.LG stat.ML

    On Alignment in Deep Linear Neural Networks

    Authors: Adityanarayanan Radhakrishnan, Eshaan Nichani, Daniel Bernstein, Caroline Uhler

    Abstract: We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global mini… ▽ More

    Submitted 16 June, 2020; v1 submitted 13 March, 2020; originally announced March 2020.