-
How Transformers Learn Causal Structure with Gradient Descent
Authors:
Eshaan Nichani,
Alex Damian,
Jason D. Lee
Abstract:
The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structur…
▽ More
The incredible success of transformers on sequence modeling tasks can be largely attributed to the self-attention mechanism, which allows information to be transferred between different parts of a sequence. Self-attention allows transformers to encode causal structure which makes them particularly suitable for sequence modeling. However, the process by which transformers learn such causal structure via gradient-based training algorithms remains poorly understood. To better understand this process, we introduce an in-context learning task that requires learning latent causal structure. We prove that gradient descent on a simplified two-layer transformer learns to solve this task by encoding the latent causal graph in the first attention layer. The key insight of our proof is that the gradient of the attention matrix encodes the mutual information between tokens. As a consequence of the data processing inequality, the largest entries of this gradient correspond to edges in the latent causal graph. As a special case, when the sequences are generated from in-context Markov chains, we prove that transformers learn an induction head (Olsson et al., 2022). We confirm our theoretical findings by showing that transformers trained on our in-context learning task are able to recover a wide variety of causal structures.
△ Less
Submitted 22 February, 2024;
originally announced February 2024.
-
Learning Hierarchical Polynomials with Three-Layer Neural Networks
Authors:
Zihao Wang,
Eshaan Nichani,
Jason D. Lee
Abstract:
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index mod…
▽ More
We study the problem of learning hierarchical polynomials over the standard Gaussian distribution with three-layer neural networks. We specifically consider target functions of the form $h = g \circ p$ where $p : \mathbb{R}^d \rightarrow \mathbb{R}$ is a degree $k$ polynomial and $g: \mathbb{R} \rightarrow \mathbb{R}$ is a degree $q$ polynomial. This function class generalizes the single-index model, which corresponds to $k=1$, and is a natural class of functions possessing an underlying hierarchical structure. Our main result shows that for a large subclass of degree $k$ polynomials $p$, a three-layer neural network trained via layerwise gradient descent on the square loss learns the target $h$ up to vanishing test error in $\widetilde{\mathcal{O}}(d^k)$ samples and polynomial time. This is a strict improvement over kernel methods, which require $\widetilde Θ(d^{kq})$ samples, as well as existing guarantees for two-layer networks, which require the target function to be low-rank. Our result also generalizes prior works on three-layer neural networks, which were restricted to the case of $p$ being a quadratic. When $p$ is indeed a quadratic, we achieve the information-theoretically optimal sample complexity $\widetilde{\mathcal{O}}(d^2)$, which is an improvement over prior work~\citep{nichani2023provable} requiring a sample size of $\widetildeΘ(d^4)$. Our proof proceeds by showing that during the initial stage of training the network performs feature learning to recover the feature $p$ with $\widetilde{\mathcal{O}}(d^k)$ samples. This work demonstrates the ability of three-layer neural networks to learn complex features and as a result, learn a broad class of hierarchical functions.
△ Less
Submitted 22 November, 2023;
originally announced November 2023.
-
Fine-Tuning Language Models with Just Forward Passes
Authors:
Sadhika Malladi,
Tianyu Gao,
Eshaan Nichani,
Alex Damian,
Jason D. Lee,
Danqi Chen,
Sanjeev Arora
Abstract:
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder opti…
▽ More
Fine-tuning language models (LMs) has yielded success on diverse downstream tasks, but as LMs grow in size, backpropagation requires a prohibitively large amount of memory. Zeroth-order (ZO) methods can in principle estimate gradients using only two forward passes but are theorized to be catastrophically slow for optimizing large models. In this work, we propose a memory-efficient zerothorder optimizer (MeZO), adapting the classical ZO-SGD method to operate in-place, thereby fine-tuning LMs with the same memory footprint as inference. For example, with a single A100 80GB GPU, MeZO can train a 30-billion parameter model, whereas fine-tuning with backpropagation can train only a 2.7B LM with the same budget. We conduct comprehensive experiments across model types (masked and autoregressive LMs), model scales (up to 66B), and downstream tasks (classification, multiple-choice, and generation). Our results demonstrate that (1) MeZO significantly outperforms in-context learning and linear probing; (2) MeZO achieves comparable performance to fine-tuning with backpropagation across multiple tasks, with up to 12x memory reduction and up to 2x GPU-hour reduction in our implementation; (3) MeZO is compatible with both full-parameter and parameter-efficient tuning techniques such as LoRA and prefix tuning; (4) MeZO can effectively optimize non-differentiable objectives (e.g., maximizing accuracy or F1). We support our empirical findings with theoretical insights, highlighting how adequate pre-training and task prompts enable MeZO to fine-tune huge models, despite classical ZO analyses suggesting otherwise.
△ Less
Submitted 11 January, 2024; v1 submitted 26 May, 2023;
originally announced May 2023.
-
Smoothing the Landscape Boosts the Signal for SGD: Optimal Sample Complexity for Learning Single Index Models
Authors:
Alex Damian,
Eshaan Nichani,
Rong Ge,
Jason D. Lee
Abstract:
We focus on the task of learning a single index model $σ(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $σ$, which is defined as the index of the first nonzero Hermite coefficient of $σ$. Ben Arous et al. (2021) showe…
▽ More
We focus on the task of learning a single index model $σ(w^\star \cdot x)$ with respect to the isotropic Gaussian distribution in $d$ dimensions. Prior work has shown that the sample complexity of learning $w^\star$ is governed by the information exponent $k^\star$ of the link function $σ$, which is defined as the index of the first nonzero Hermite coefficient of $σ$. Ben Arous et al. (2021) showed that $n \gtrsim d^{k^\star-1}$ samples suffice for learning $w^\star$ and that this is tight for online SGD. However, the CSQ lower bound for gradient based methods only shows that $n \gtrsim d^{k^\star/2}$ samples are necessary. In this work, we close the gap between the upper and lower bounds by showing that online SGD on a smoothed loss learns $w^\star$ with $n \gtrsim d^{k^\star/2}$ samples. We also draw connections to statistical analyses of tensor PCA and to the implicit regularization effects of minibatch SGD on empirical losses.
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Provable Guarantees for Nonlinear Feature Learning in Three-Layer Neural Networks
Authors:
Eshaan Nichani,
Alex Damian,
Jason D. Lee
Abstract:
One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical per…
▽ More
One of the central questions in the theory of deep learning is to understand how neural networks learn hierarchical features. The ability of deep networks to extract salient features is crucial to both their outstanding generalization ability and the modern deep learning paradigm of pretraining and finetuneing. However, this feature learning process remains poorly understood from a theoretical perspective, with existing analyses largely restricted to two-layer networks. In this work we show that three-layer neural networks have provably richer feature learning capabilities than two-layer networks. We analyze the features learned by a three-layer network trained with layer-wise gradient descent, and present a general purpose theorem which upper bounds the sample complexity and width needed to achieve low test error when the target has specific hierarchical structure. We instantiate our framework in specific statistical learning settings -- single-index models and functions of quadratic features -- and show that in the latter setting three-layer networks obtain a sample complexity improvement over all existing guarantees for two-layer networks. Crucially, this sample complexity improvement relies on the ability of three-layer networks to efficiently learn nonlinear features. We then establish a concrete optimization-based depth separation by constructing a function which is efficiently learnable via gradient descent on a three-layer network, yet cannot be learned efficiently by a two-layer network. Our work makes progress towards understanding the provable benefit of three-layer neural networks over two-layer networks in the feature learning regime.
△ Less
Submitted 31 October, 2023; v1 submitted 11 May, 2023;
originally announced May 2023.
-
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability
Authors:
Alex Damian,
Eshaan Nichani,
Jason D. Lee
Abstract:
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen e…
▽ More
Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(θ)$, is bounded by $2/η$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/η$. The second, dubbed edge of stability, is that the sharpness hovers at $2/η$ for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(θ) \le 2/η$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.
△ Less
Submitted 10 April, 2023; v1 submitted 30 September, 2022;
originally announced September 2022.
-
Metastable Mixing of Markov Chains: Efficiently Sampling Low Temperature Exponential Random Graphs
Authors:
Guy Bresler,
Dheeraj Nagaraj,
Eshaan Nichani
Abstract:
In this paper we consider the problem of sampling from the low-temperature exponential random graph model (ERGM). The usual approach is via Markov chain Monte Carlo, but Bhamidi et al. showed that any local Markov chain suffers from an exponentially large mixing time due to metastable states. We instead consider metastable mixing, a notion of approximate mixing relative to the stationary distribut…
▽ More
In this paper we consider the problem of sampling from the low-temperature exponential random graph model (ERGM). The usual approach is via Markov chain Monte Carlo, but Bhamidi et al. showed that any local Markov chain suffers from an exponentially large mixing time due to metastable states. We instead consider metastable mixing, a notion of approximate mixing relative to the stationary distribution, for which it turns out to suffice to mix only within a collection of metastable states. We show that the Glauber dynamics for the ERGM at any temperature -- except at a lower-dimensional critical set of parameters -- when initialized at $G(n,p)$ for the right choice of $p$ has a metastable mixing time of $O(n^2\log n)$ to within total variation distance $\exp(-Ω(n))$.
△ Less
Submitted 4 October, 2022; v1 submitted 28 August, 2022;
originally announced August 2022.
-
Causal Structure Discovery between Clusters of Nodes Induced by Latent Factors
Authors:
Chandler Squires,
Annie Yun,
Eshaan Nichani,
Raj Agrawal,
Caroline Uhler
Abstract:
We consider the problem of learning the structure of a causal directed acyclic graph (DAG) model in the presence of latent variables. We define latent factor causal models (LFCMs) as a restriction on causal DAG models with latent variables, which are composed of clusters of observed variables that share the same latent parent and connections between these clusters given by edges pointing from the…
▽ More
We consider the problem of learning the structure of a causal directed acyclic graph (DAG) model in the presence of latent variables. We define latent factor causal models (LFCMs) as a restriction on causal DAG models with latent variables, which are composed of clusters of observed variables that share the same latent parent and connections between these clusters given by edges pointing from the observed variables to latent variables. LFCMs are motivated by gene regulatory networks, where regulatory edges, corresponding to transcription factors, connect spatially clustered genes. We show identifiability results on this model and design a consistent three-stage algorithm that discovers clusters of observed nodes, a partial ordering over clusters, and finally, the entire structure over both observed and latent nodes. We evaluate our method in a synthetic setting, demonstrating its ability to almost perfectly recover the ground truth clustering even at relatively low sample sizes, as well as the ability to recover a significant number of the edges from observed variables to latent factors. Finally, we apply our method in a semi-synthetic setting to protein mass spectrometry data with a known ground truth network, and achieve almost perfect recovery of the ground truth variable clusters.
△ Less
Submitted 5 July, 2022; v1 submitted 4 July, 2022;
originally announced July 2022.
-
Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials
Authors:
Eshaan Nichani,
Yu Bai,
Jason D. Lee
Abstract:
A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning…
▽ More
A recent goal in the theory of deep learning is to identify how neural networks can escape the "lazy training," or Neural Tangent Kernel (NTK) regime, where the network is coupled with its first order Taylor expansion at initialization. While the NTK is minimax optimal for learning dense polynomials (Ghorbani et al, 2021), it cannot learn features, and hence has poor sample complexity for learning many classes of functions including sparse polynomials. Recent works have thus aimed to identify settings where gradient based algorithms provably generalize better than the NTK. One such example is the "QuadNTK" approach of Bai and Lee (2020), which analyzes the second-order term in the Taylor expansion. Bai and Lee (2020) show that the second-order term can learn sparse polynomials efficiently; however, it sacrifices the ability to learn general dense polynomials.
In this paper, we analyze how gradient descent on a two-layer neural network can escape the NTK regime by utilizing a spectral characterization of the NTK (Montanari and Zhong, 2020) and building on the QuadNTK approach. We first expand upon the spectral analysis to identify "good" directions in parameter space in which we can move without harming generalization. Next, we show that a wide two-layer neural network can jointly use the NTK and QuadNTK to fit target functions consisting of a dense low-degree term and a sparse high-degree term -- something neither the NTK nor the QuadNTK can do on their own. Finally, we construct a regularizer which encourages our parameter vector to move in the "good" directions, and show that gradient descent on the regularized loss will converge to a global minimizer, which also has low test error. This yields an end to end convergence and generalization guarantee with provable sample complexity improvement over both the NTK and QuadNTK on their own.
△ Less
Submitted 26 November, 2022; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Increasing Depth Leads to U-Shaped Test Risk in Over-parameterized Convolutional Networks
Authors:
Eshaan Nichani,
Adityanarayanan Radhakrishnan,
Caroline Uhler
Abstract:
Recent works have demonstrated that increasing model capacity through width in over-parameterized neural networks leads to a decrease in test risk. For neural networks, however, model capacity can also be increased through depth, yet understanding the impact of increasing depth on test risk remains an open question. In this work, we demonstrate that the test risk of over-parameterized convolutiona…
▽ More
Recent works have demonstrated that increasing model capacity through width in over-parameterized neural networks leads to a decrease in test risk. For neural networks, however, model capacity can also be increased through depth, yet understanding the impact of increasing depth on test risk remains an open question. In this work, we demonstrate that the test risk of over-parameterized convolutional networks is a U-shaped curve (i.e. monotonically decreasing, then increasing) with increasing depth. We first provide empirical evidence for this phenomenon via image classification experiments using both ResNets and the convolutional neural tangent kernel (CNTK). We then present a novel linear regression framework for characterizing the impact of depth on test risk, and show that increasing depth leads to a U-shaped test risk for the linear CNTK. In particular, we prove that the linear CNTK corresponds to a depth-dependent linear transformation on the original space and characterize properties of this transformation. We then analyze over-parameterized linear regression under arbitrary linear transformations and, in simplified settings, provably identify the depths which minimize each of the bias and variance terms of the test risk.
△ Less
Submitted 4 June, 2021; v1 submitted 19 October, 2020;
originally announced October 2020.
-
On Alignment in Deep Linear Neural Networks
Authors:
Adityanarayanan Radhakrishnan,
Eshaan Nichani,
Daniel Bernstein,
Caroline Uhler
Abstract:
We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global mini…
▽ More
We study the properties of alignment, a form of implicit regularization, in linear neural networks under gradient descent. We define alignment for fully connected networks with multidimensional outputs and show that it is a natural extension of alignment in networks with 1-dimensional outputs as defined by Ji and Telgarsky, 2018. While in fully connected networks, there always exists a global minimum corresponding to an aligned solution, we analyze alignment as it relates to the training process. Namely, we characterize when alignment is an invariant of training under gradient descent by providing necessary and sufficient conditions for this invariant to hold. In such settings, the dynamics of gradient descent simplify, thereby allowing us to provide an explicit learning rate under which the network converges linearly to a global minimum. We then analyze networks with layer constraints such as convolutional networks. In this setting, we prove that gradient descent is equivalent to projected gradient descent, and that alignment is impossible with sufficiently large datasets.
△ Less
Submitted 16 June, 2020; v1 submitted 13 March, 2020;
originally announced March 2020.