Search | arXiv e-print repository

Mechanics of Next Token Prediction with Self-Attention

Authors: Yingcong Li, Yixiao Huang, M. Emrullah Ildiz, Ankit Singh Rawat, Samet Oymak

Abstract: Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$… ▽ More Transformer-based language models are trained on large datasets to predict the next token given an input sequence. Despite this simple training objective, they have led to revolutionary advances in natural language processing. Underlying this success is the self-attention mechanism. In this work, we ask: $\textit{What}$ $\textit{does}$ $\textit{a}$ $\textit{single}$ $\textit{self-attention}$ $\textit{layer}$ $\textit{learn}$ $\textit{from}$ $\textit{next-token}$ $\textit{prediction?}$ We show that training self-attention with gradient descent learns an automaton which generates the next token in two distinct steps: $\textbf{(1)}$ $\textbf{Hard}$ $\textbf{retrieval:}$ Given input sequence, self-attention precisely selects the $\textit{high-priority}$ $\textit{input}$ $\textit{tokens}$ associated with the last input token. $\textbf{(2)}$ $\textbf{Soft}$ $\textbf{composition:}$ It then creates a convex combination of the high-priority tokens from which the next token can be sampled. Under suitable conditions, we rigorously characterize these mechanics through a directed graph over tokens extracted from the training data. We prove that gradient descent implicitly discovers the strongly-connected components (SCC) of this graph and self-attention learns to retrieve the tokens that belong to the highest-priority SCC available in the context window. Our theory relies on decomposing the model weights into a directional component and a finite component that correspond to hard retrieval and soft composition steps respectively. This also formalizes a related implicit bias formula conjectured in [Tarzanagh et al. 2023]. We hope that these findings shed light on how self-attention processes sequential data and pave the path toward demystifying more complex architectures. △ Less

Submitted 12 March, 2024; originally announced March 2024.

Comments: Accepted to AISTATS 2024

arXiv:2402.13512 [pdf, other]

From Self-Attention to Markov Models: Unveiling the Dynamics of Generative Transformers

Authors: M. Emrullah Ildiz, Yixiao Huang, Yingcong Li, Ankit Singh Rawat, Samet Oymak

Abstract: Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise map** between the self-attention mechanism and Markov models: Inputting a prompt to the model… ▽ More Modern language models rely on the transformer architecture and attention mechanism to perform language understanding and text generation. In this work, we study learning a 1-layer self-attention model from a set of prompts and associated output data sampled from the model. We first establish a precise map** between the self-attention mechanism and Markov models: Inputting a prompt to the model samples the output token according to a context-conditioned Markov chain (CCMC) which weights the transition matrix of a base Markov chain. Additionally, incorporating positional encoding results in position-dependent scaling of the transition probabilities. Building on this formalism, we develop identifiability/coverage conditions for the prompt distribution that guarantee consistent estimation and establish sample complexity guarantees under IID samples. Finally, we study the problem of learning from a single output trajectory generated from an initial prompt. We characterize an intriguing winner-takes-all phenomenon where the generative process implemented by self-attention collapses into sampling a limited subset of tokens due to its non-mixing nature. This provides a mathematical explanation to the tendency of modern LLMs to generate repetitive text. In summary, the equivalence to CCMC provides a simple but powerful framework to study self-attention and its properties. △ Less

Submitted 20 February, 2024; originally announced February 2024.

Comments: 30 pages

arXiv:2301.07067 [pdf, other]

Transformers as Algorithms: Generalization and Stability in In-context Learning

Authors: Yingcong Li, M. Emrullah Ildiz, Dimitris Papailiopoulos, Samet Oymak

Abstract: In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through t… ▽ More In-context learning (ICL) is a type of prompting where a transformer model operates on a sequence of (input, output) examples and performs inference on-the-fly. In this work, we formalize in-context learning as an algorithm learning problem where a transformer model implicitly constructs a hypothesis function at inference-time. We first explore the statistical aspects of this abstraction through the lens of multitask learning: We obtain generalization bounds for ICL when the input prompt is (1) a sequence of i.i.d. (input, label) pairs or (2) a trajectory arising from a dynamical system. The crux of our analysis is relating the excess risk to the stability of the algorithm implemented by the transformer. We characterize when transformer/attention architecture provably obeys the stability condition and also provide empirical verification. For generalization on unseen tasks, we identify an inductive bias phenomenon in which the transfer learning risk is governed by the task complexity and the number of MTL tasks in a highly predictable manner. Finally, we provide numerical evaluations that (1) demonstrate transformers can indeed implement near-optimal algorithms on classical regression problems with i.i.d. and dynamic data, (2) provide insights on stability, and (3) verify our theoretical predictions. △ Less

Submitted 6 February, 2023; v1 submitted 17 January, 2023; originally announced January 2023.

Comments: Revised version significantly improves the stability guarantees and provides new experiments

arXiv:2111.02309 [pdf, other]

Pull or Wait: How to Optimize Query Age of Information

Authors: M. Emrullah Ildiz, Orhan T. Yavascan, Elif Uysal, O. Tugberk Kartal

Abstract: We study a pull-based status update communication model where a source node submits update packets to a channel with random transmission delay, at times requested by a remote destination node. The objective is to minimize the average query-age-of-information (QAoI), defined as the average age-of-information (AoI) measured at query instants that occur at the destination side according to a stochast… ▽ More We study a pull-based status update communication model where a source node submits update packets to a channel with random transmission delay, at times requested by a remote destination node. The objective is to minimize the average query-age-of-information (QAoI), defined as the average age-of-information (AoI) measured at query instants that occur at the destination side according to a stochastic arrival process. In reference to a push-based problem formulation defined in the literature where the source decides to \textit{update or wait} at will, with the objective of minimizing the time average AoI at the destination, we name this problem the \textit{Pull-or-Wait} (PoW) problem. We provide a comparison of the two formulations: (i) Under Poisson query arrivals, an optimal policy that minimizes the time average AoI also minimizes the average QAoI, and these minimum values are equal; and (ii) the optimal average QAoI under periodic query arrivals is always less than or equal to the optimal time average AoI. We identify the PoW problem in the case of a single query as a stochastic shortest path (SSP) problem with uncountable state and action spaces, which has been not solved in previous literature. We derive an optimal solution for this SSP problem and use it as a building block for the solution of the PoW problem under periodic query arrivals. △ Less

Submitted 4 November, 2021; v1 submitted 3 November, 2021; originally announced November 2021.

arXiv:1805.10704 [pdf]

Synergistic Reconstruction and Synthesis via Generative Adversarial Networks for Accelerated Multi-Contrast MRI

Authors: Salman Ul Hassan Dar, Mahmut Yurt, Mohammad Shahdloo, Muhammed Emrullah Ildız, Tolga Çukur

Abstract: Multi-contrast MRI acquisitions of an anatomy enrich the magnitude of information available for diagnosis. Yet, excessive scan times associated with additional contrasts may be a limiting factor. Two mainstream approaches for enhanced scan efficiency are reconstruction of undersampled acquisitions and synthesis of missing acquisitions. In reconstruction, performance decreases towards higher accele… ▽ More Multi-contrast MRI acquisitions of an anatomy enrich the magnitude of information available for diagnosis. Yet, excessive scan times associated with additional contrasts may be a limiting factor. Two mainstream approaches for enhanced scan efficiency are reconstruction of undersampled acquisitions and synthesis of missing acquisitions. In reconstruction, performance decreases towards higher acceleration factors with diminished sampling density particularly at high-spatial-frequencies. In synthesis, the absence of data samples from the target contrast can lead to artefactual sensitivity or insensitivity to image features. Here we propose a new approach for synergistic reconstruction-synthesis of multi-contrast MRI based on conditional generative adversarial networks. The proposed method preserves high-frequency details of the target contrast by relying on the shared high-frequency information available from the source contrast, and prevents feature leakage or loss by relying on the undersampled acquisitions of the target contrast. Demonstrations on brain MRI datasets from healthy subjects and patients indicate the superior performance of the proposed method compared to previous state-of-the-art. The proposed method can help improve the quality and scan efficiency of multi-contrast MRI exams. △ Less

Submitted 27 May, 2018; originally announced May 2018.

Showing 1–5 of 5 results for author: Ildız, M E