-
In-context Learning and Gradient Descent Revisited
Authors:
Gilad Deutch,
Nadav Magar,
Tomer Bar Natan,
Guy Dar
Abstract:
In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evide…
▽ More
In-context learning (ICL) has shown impressive results in few-shot learning tasks, yet its underlying mechanism is still not fully understood. A recent line of work suggests that ICL performs gradient descent (GD)-based optimization implicitly. While appealing, much of the research focuses on simplified settings, where the parameters of a shallow model are optimized. In this work, we revisit evidence for ICL-GD correspondence on realistic NLP tasks and models. We find gaps in evaluation, both in terms of problematic metrics and insufficient baselines. We show that surprisingly, even untrained models achieve comparable ICL-GD similarity scores despite not exhibiting ICL. Next, we explore a major discrepancy in the flow of information throughout the model between ICL and GD, which we term Layer Causality. We propose a simple GD-based optimization procedure that respects layer causality, and show it improves similarity scores significantly.
△ Less
Submitted 31 March, 2024; v1 submitted 13 November, 2023;
originally announced November 2023.
-
Analyzing Transformers in Embedding Space
Authors:
Guy Dar,
Mor Geva,
Ankit Gupta,
Jonathan Berant
Abstract:
Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, a…
▽ More
Understanding Transformer-based models has attracted significant attention, as they lie at the heart of recent technological advances across machine learning. While most interpretability methods rely on running models over inputs, recent work has shown that a zero-pass approach, where parameters are interpreted directly without a forward/backward pass is feasible for some Transformer parameters, and for two-layer attention networks. In this work, we present a theoretical analysis where all parameters of a trained Transformer are interpreted by projecting them into the embedding space, that is, the space of vocabulary items they operate on. We derive a simple theoretical framework to support our arguments and provide ample evidence for its validity. First, an empirical analysis showing that parameters of both pretrained and fine-tuned models can be interpreted in embedding space. Second, we present two applications of our framework: (a) aligning the parameters of different models that share a vocabulary, and (b) constructing a classifier without training by ``translating'' the parameters of a fine-tuned classifier to parameters of a different model that was only pretrained. Overall, our findings open the door to interpretation methods that, at least in part, abstract away from model specifics and operate in the embedding space only.
△ Less
Submitted 24 December, 2023; v1 submitted 6 September, 2022;
originally announced September 2022.
-
LM-Debugger: An Interactive Tool for Inspection and Intervention in Transformer-Based Language Models
Authors:
Mor Geva,
Avi Caciularu,
Guy Dar,
Paul Roit,
Shoval Sadde,
Micah Shlain,
Bar Tamir,
Yoav Goldberg
Abstract:
The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpretation methods mostly focus on probing models from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we int…
▽ More
The opaque nature and unexplained behavior of transformer-based language models (LMs) have spurred a wide interest in interpreting their predictions. However, current interpretation methods mostly focus on probing models from outside, executing behavioral tests, and analyzing salience input features, while the internal prediction construction process is largely not understood. In this work, we introduce LM-Debugger, an interactive debugger tool for transformer-based LMs, which provides a fine-grained interpretation of the model's internal prediction process, as well as a powerful framework for intervening in LM behavior. For its backbone, LM-Debugger relies on a recent method that interprets the inner token representations and their updates by the feed-forward layers in the vocabulary space. We demonstrate the utility of LM-Debugger for single-prediction debugging, by inspecting the internal disambiguation process done by GPT2. Moreover, we show how easily LM-Debugger allows to shift model behavior in a direction of the user's choice, by identifying a few vectors in the network and inducing effective interventions to the prediction process. We release LM-Debugger as an open-source tool and a demo over GPT2 models.
△ Less
Submitted 12 October, 2022; v1 submitted 26 April, 2022;
originally announced April 2022.
-
Memory-efficient Transformers via Top-$k$ Attention
Authors:
Ankit Gupta,
Guy Dar,
Shaya Goodman,
David Ciprut,
Jonathan Berant
Abstract:
Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training…
▽ More
Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
△ Less
Submitted 12 June, 2021;
originally announced June 2021.
-
Comment on "On two-dimensional magnetohydrodynamic turbulence" [Phys. Plasmas, 8, 3282 (2001)]
Authors:
Mahendra K. Verma,
Gaurav Dar,
V. Eswaran
Abstract:
Biskamp and Schwarz [Phys. Plasmas, 8, 3282 (2001)] have reported that the energy spectrum of two-dimensional magnetohydrodynamic turbulence is proportional to $k^{-3/2}$, which is a prediction of Iroshnikov-Kraichnan phenomenology. In this comment we report some earlier results which conclusively show that for two-dimensional magnetohydrodynamic turbulence, Kolmogorov-like phenomenology (spectr…
▽ More
Biskamp and Schwarz [Phys. Plasmas, 8, 3282 (2001)] have reported that the energy spectrum of two-dimensional magnetohydrodynamic turbulence is proportional to $k^{-3/2}$, which is a prediction of Iroshnikov-Kraichnan phenomenology. In this comment we report some earlier results which conclusively show that for two-dimensional magnetohydrodynamic turbulence, Kolmogorov-like phenomenology (spectral index 5/3) is better model than Iroshnikov-Kraichnan phenomenology; these results are based on energy flux analysis.
△ Less
Submitted 14 April, 2002;
originally announced April 2002.
-
Energy transfer in two-dimensional magnetohydrodynamic turbulence: formalism and numerical results
Authors:
Gaurav Dar,
Mahendra K. Verma,
V. Eswaran
Abstract:
The basic entity of nonlinear interaction in Navier-Stokes and the Magnetohydrodynamic (MHD) equations is a wavenumber triad ({\bf k,p,q}) satisfying ${\bf k+p+q=0}$. The expression for the combined energy transfer from two of these wavenumbers to the third wavenumber is known. In this paper we introduce the idea of an effective energy transfer between a pair of modes by the mediation of the thi…
▽ More
The basic entity of nonlinear interaction in Navier-Stokes and the Magnetohydrodynamic (MHD) equations is a wavenumber triad ({\bf k,p,q}) satisfying ${\bf k+p+q=0}$. The expression for the combined energy transfer from two of these wavenumbers to the third wavenumber is known. In this paper we introduce the idea of an effective energy transfer between a pair of modes by the mediation of the third mode, and find an expression for it. Then we apply this formalism to compute the energy transfer in the quasi-steady-state of two-dimensional MHD turbulence with large-scale kinetic forcing. The computation of energy fluxes and the energy transfer between different wavenumber shells is done using the data generated by the pseudo-spectral direct numerical simulation. The picture of energy flux that emerges is quite complex---there is a forward cascade of magnetic energy, an inverse cascade of kinetic energy, a flux of energy from the kinetic to the magnetic field, and a reverse flux which transfers the energy back to the kinetic from the magnetic. The energy transfer between different wavenumber shells is also complex---local and nonlocal transfers often possess opposing features, i.e., energy transfer between some wavenumber shells occurs from kinetic to magnetic, and between other wavenumber shells this transfer is reversed. The net transfer of energy is from kinetic to magnetic. The results obtained from the studies of flux and shell-to-shell energy transfer are consistent with each other.
△ Less
Submitted 3 September, 2001;
originally announced September 2001.
-
A new approach to study energy transfer in turbulence
Authors:
Gaurav Dar,
Mahendra K. Verma,
V. Eswaran
Abstract:
The unit of nonlinear interaction in Navier-Stokes and the Magnetohydrodynamic (MHD) equations is a wavenumber triad ({\bf k,p,q}) satisfying ${\bf k+p+q=0}$. The expression for the combined energy transfer from two of these wavenumbers to the third wavenumber is known. In this paper we introduce the idea of an effective energy transfer between a pair of modes through the mediation of the third…
▽ More
The unit of nonlinear interaction in Navier-Stokes and the Magnetohydrodynamic (MHD) equations is a wavenumber triad ({\bf k,p,q}) satisfying ${\bf k+p+q=0}$. The expression for the combined energy transfer from two of these wavenumbers to the third wavenumber is known. In this paper we introduce the idea of an effective energy transfer between a pair of modes through the mediation of the third mode and then find an expression for it. In fluid turbulence, energy transfer takes place between a pair of velocity modes, whereas in MHD turbulence energy transfer takes places between (1) a pair of velocity modes, (2) a pair of magnetic modes, and (3) between a velocity and a magnetic mode in a triad. In this paper we have obtained the expression for each of these transfers. We also show how the effective mode-to-mode energy transfer rate can be utilised to study energy cascades and shell-to-shell energy transfer rates.
△ Less
Submitted 7 June, 2000;
originally announced June 2000.
-
Energy transfer in two-dimensional magnetohydrodynamic turbulence
Authors:
Gaurav Dar,
Mahendra K. Verma,
V. Eswaran
Abstract:
In an earlier paper (physics/0006012) we had developed a method for computing the effective energy transfer between any two Fourier modes in fluid or magnetohydrodynamic (MHD) flows. This method is applied to a pseudo-spectral, direct numerical simulation (DNS) study of energy transfer in the quasi-steady state of 2-D MHD turbulence with large scale kinetic forcing. Two aspects of energy transfe…
▽ More
In an earlier paper (physics/0006012) we had developed a method for computing the effective energy transfer between any two Fourier modes in fluid or magnetohydrodynamic (MHD) flows. This method is applied to a pseudo-spectral, direct numerical simulation (DNS) study of energy transfer in the quasi-steady state of 2-D MHD turbulence with large scale kinetic forcing. Two aspects of energy transfer are studied: the energy fluxes, and the energy transfer between different wavenumber regions ({\it shells}). The picture of energy fluxes that emerges is quite complex - there is a forward cascade of magnetic energy, an inverse cascade of kinetic energy, a flux of energy from the kinetic to the magnetic field, and a reverse flux which transfers the energy back to the kinetic from the magnetic. The energy transfer between different wave number shells is also complex - local and nonlocal transfers often possess opposing features, i.e., energy transfer between some wave number shells occurs from kinetic to magnetic, and between other wave number shells this transfer is reversed. The net transfer of energy is from kinetic to magnetic. The results obtained from the flux studies and the shell-to-shell energy transfer studies are consistent with each other.
△ Less
Submitted 14 July, 2000; v1 submitted 28 October, 1998;
originally announced October 1998.
-
Initial Condition Sensitivity of Global Quantities in Magnetohydrodynamic Turbulence
Authors:
Gaurav Dar,
Mahendra K. Verma,
V. Eswaran
Abstract:
In this paper we study the effect of subtle changes in initial conditions on the evolution of global quantities in two-dimensional Magnetohydrodynamic (MHD) turbulence. We find that a change in the initial phases of complex Fourier modes of the Elsässer variables, while kee** the initial values of total energy, cross helicity and Alfvén ratio unchanged, has a significant effect on the evolutio…
▽ More
In this paper we study the effect of subtle changes in initial conditions on the evolution of global quantities in two-dimensional Magnetohydrodynamic (MHD) turbulence. We find that a change in the initial phases of complex Fourier modes of the Elsässer variables, while kee** the initial values of total energy, cross helicity and Alfvén ratio unchanged, has a significant effect on the evolution of cross helicity. On the contrary, the total energy and Alfvén ratio are insensitive to the initial phases. Our simulations are based on direct numerical simulation using the pseudo-spectral method.
△ Less
Submitted 16 March, 1998;
originally announced March 1998.
-
Probing Physics of Magnetohydrodynamic Turbulence Using Direct Numerical Simulation
Authors:
Mahendra K. Verma,
Gaurav Dar
Abstract:
The energy spectrum and the nolinear cascade rates of MHD turbulence is not clearly understood. We have addressed this problem using direct numerical simulation and analytical calculations. Our numerical simulations indicate that Kolmogorov-like phenomenology with $k^{-5/3}$ energy spectrum, rather than Kraichnan's $k^{-3/2}$, appears to be applicable in MHD turbulence. Here, we also construct a…
▽ More
The energy spectrum and the nolinear cascade rates of MHD turbulence is not clearly understood. We have addressed this problem using direct numerical simulation and analytical calculations. Our numerical simulations indicate that Kolmogorov-like phenomenology with $k^{-5/3}$ energy spectrum, rather than Kraichnan's $k^{-3/2}$, appears to be applicable in MHD turbulence. Here, we also construct a self-consistent renomalization group procedure in which the mean magnetic field gets renormalized, which in turns yields $k^{-5/3}$ energy spectrum. The numerical simulations also show that the fluid energy is transferred to magnetic energy. This result could shed light on the generation magnetic field as in dynamo mechanism.
△ Less
Submitted 21 March, 1998; v1 submitted 16 March, 1998;
originally announced March 1998.