-
FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design
Authors:
Nandeeka Nayak,
Xinrui Wu,
Toluwanimi O. Odemuyiwa,
Michael Pellauer,
Joel S. Emer,
Christopher W. Fletcher
Abstract:
Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is exp…
▽ More
Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time).
This paper ameliorates these issues, enabling attention with nearly 100% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction -- the cascade of Einsums -- to describe, formalize and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process.
Based on the above characterization, we propose FuseMax -- a novel map** of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average $6.7\times$ speedup over the prior state-of-the-art FLAT while using $79\%$ of the energy. Similarly, on the full end-to-end transformer inference, FuseMax achieves an average $5.3\times$ speedup over FLAT using $83\%$ of the energy.
△ Less
Submitted 25 June, 2024; v1 submitted 15 June, 2024;
originally announced June 2024.
-
The EDGE Language: Extended General Einsums for Graph Algorithms
Authors:
Toluwanimi O. Odemuyiwa,
Joel S. Emer,
John D. Owens
Abstract:
In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor,…
▽ More
In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor, and (2) the power and expressivity of Einsum notation in the tensor algebra world. In this work, we describe our design goals for EDGE and walk through the extensions we add to Einsums to support more complex operations common in graph algorithms. Additionally, we provide a few examples of how to express graph algorithms in our proposed notation. We hope that a single, mathematical notation for graph algorithms will (1) allow researchers to more easily compare different algorithms and different implementations of a graph algorithm; (2) enable developers to factor complexity by separating the concerns of what to compute (described with the extended Einsum notation) from the lower level details of how to compute; and (3) enable the discovery of different algorithmic variants of a problem through algebraic manipulations and transformations on a given EDGE expression.
△ Less
Submitted 17 April, 2024;
originally announced April 2024.
-
TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators
Authors:
Nandeeka Nayak,
Toluwanimi O. Odemuyiwa,
Shubham Ugare,
Christopher W. Fletcher,
Michael Pellauer,
Joel S. Emer
Abstract:
Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full ra…
▽ More
Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art.
To address this, we propose TeAAL: a language and simulator generator for the concise and precise specification and evaluation of sparse tensor algebra accelerators. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators -- ExTensor, Gamma, OuterSPACE, and SIGMA -- and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up vertex-centric programming accelerators -- achieving $1.9\times$ on BFS and $1.2\times$ on SSSP over GraphDynS.
△ Less
Submitted 11 June, 2024; v1 submitted 16 April, 2023;
originally announced April 2023.