Search | arXiv e-print repository

The Continuous Tensor Abstraction: Where Indices are Real

Authors: Jaeyeon Won, Willow Ahrens, Joel S. Emer, Saman Amarasinghe

Abstract: This paper introduces the continuous tensor abstraction, allowing indices to take real-number values (e.g., A[3.14]), and provides a continuous loop construct that iterates over the infinitely large set of real numbers. This paper expands the existing tensor abstraction to include continuous tensors that exhibit a piecewise-constant property, enabling the transformation of an infinite amount of co… ▽ More This paper introduces the continuous tensor abstraction, allowing indices to take real-number values (e.g., A[3.14]), and provides a continuous loop construct that iterates over the infinitely large set of real numbers. This paper expands the existing tensor abstraction to include continuous tensors that exhibit a piecewise-constant property, enabling the transformation of an infinite amount of computation into a finite amount. Additionally, we present a new tensor format abstraction for storing continuous tensors and a code generation technique that automatically generates kernels for the continuous tensor abstraction. Our approach introduces a novel method for loop-level reasoning in domains like computational geometry and computer graphics, traditionally unexplored in tensor programming models. Our approach demonstrates comparable performance to hand-optimized kernels in leading libraries across diverse applications. Compared to hand-implemented libraries on a CPU, our compiler-based implementation achieves an average speedup of 9.20x on 2D radius search with ~100x fewer lines of code (LoC), 1.22x on genomic interval overlap** queries (with ~26x LoC saving), and 1.69x on trilinear interpolation in Neural Radiance Field (with ~9x LoC saving). △ Less

Submitted 1 July, 2024; originally announced July 2024.

arXiv:2406.10491 [pdf, other]

FuseMax: Leveraging Extended Einsums to Optimize Attention Accelerator Design

Authors: Nandeeka Nayak, Xinrui Wu, Toluwanimi O. Odemuyiwa, Michael Pellauer, Joel S. Emer, Christopher W. Fletcher

Abstract: Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is exp… ▽ More Attention for transformers is a critical workload that has recently received significant "attention" as a target for custom acceleration. Yet, while prior work succeeds in reducing attention's memory-bandwidth requirements, it creates load imbalance between attention operators (resulting in severe compute under-utilization) and requires on-chip memory that scales with sequence length (which is expected to grow over time). This paper ameliorates these issues, enabling attention with nearly 100% compute utilization, no off-chip memory traffic bottlenecks, and on-chip buffer size requirements that are independent of sequence length. The main conceptual contribution is to use a recently proposed abstraction -- the cascade of Einsums -- to describe, formalize and taxonomize the space of attention algorithms that appear in the literature. In particular, we show how Einsum cascades can be used to infer non-trivial lower bounds on the number of passes a kernel must take through its input data, which has implications for either required on-chip buffer capacity or memory traffic. We show how this notion can be used to meaningfully divide the space of attention algorithms into several categories and use these categories to inform our design process. Based on the above characterization, we propose FuseMax -- a novel map** of attention onto a spatial array-style architecture. On attention, in an iso-area comparison, FuseMax achieves an average $6.7\times$ speedup over the prior state-of-the-art FLAT while using $79\%$ of the energy. Similarly, on the full end-to-end transformer inference, FuseMax achieves an average $5.3\times$ speedup over FLAT using $83\%$ of the energy. △ Less

Submitted 25 June, 2024; v1 submitted 15 June, 2024; originally announced June 2024.

Comments: 15 pages, 10 figures

arXiv:2405.07266 [pdf, other]

Architecture-Level Modeling of Photonic Deep Neural Network Accelerators

Authors: Tanner Andrulis, Gohar Irfan Chaudhry, Vinith M. Suriyakumar, Joel S. Emer, Vivienne Sze

Abstract: Photonics is a promising technology to accelerate Deep Neural Networks as it can use optical interconnects to reduce data movement energy and it enables low-energy, high-throughput optical-analog computations. To realize these benefits in a full system (accelerator + DRAM), designers must ensure that the benefits of using the electrical, optical, analog, and digital domains exceed the costs of c… ▽ More Photonics is a promising technology to accelerate Deep Neural Networks as it can use optical interconnects to reduce data movement energy and it enables low-energy, high-throughput optical-analog computations. To realize these benefits in a full system (accelerator + DRAM), designers must ensure that the benefits of using the electrical, optical, analog, and digital domains exceed the costs of converting data between domains. Designers must also consider system-level energy costs such as data fetch from DRAM. Converting data and accessing DRAM can consume significant energy, so to evaluate and explore the photonic system space, there is a need for a tool that can model these full-system considerations. In this work, we show that similarities between Compute-in-Memory (CiM) and photonics let us use CiM system modeling tools to accurately model photonics systems. Bringing modeling tools to photonics enables evaluation of photonic research in a full-system context, rapid design space exploration, co-design, and comparison between systems. Using our open-source model, we show that cross-domain conversion and DRAM can consume a significant portion of photonic system energy. We then demonstrate optimizations that reduce conversions and DRAM accesses to improve photonic system energy efficiency by up to 3x. △ Less

Submitted 14 May, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

Comments: Published at ISPASS 2024

arXiv:2405.07259 [pdf, other]

CiMLoop: A Flexible, Accurate, and Fast Compute-In-Memory Modeling Tool

Authors: Tanner Andrulis, Joel S. Emer, Vivienne Sze

Abstract: Compute-In-Memory (CiM) is a promising solution to accelerate Deep Neural Networks (DNNs) as it can avoid energy-intensive DNN weight movement and use memory arrays to perform low-energy, high-density computations. These benefits have inspired research across the CiM stack, but CiM research often focuses on only one level of the stack (i.e., devices, circuits, architecture, workload, or map**) o… ▽ More Compute-In-Memory (CiM) is a promising solution to accelerate Deep Neural Networks (DNNs) as it can avoid energy-intensive DNN weight movement and use memory arrays to perform low-energy, high-density computations. These benefits have inspired research across the CiM stack, but CiM research often focuses on only one level of the stack (i.e., devices, circuits, architecture, workload, or map**) or only one design point (e.g., one fabricated chip). There is a need for a full-stack modeling tool to evaluate design decisions in the context of full systems (e.g., see how a circuit impacts system energy) and to perform rapid early-stage exploration of the CiM co-design space. To address this need, we propose CiMLoop: an open-source tool to model diverse CiM systems and explore decisions across the CiM stack. CiMLoop introduces (1) a flexible specification that lets users describe, model, and map workloads to both circuits and architecture, (2) an accurate energy model that captures the interaction between DNN operand values, hardware data representations, and analog/digital values propagated by circuits, and (3) a fast statistical model that can explore the design space orders-of-magnitude more quickly than other high-accuracy models. Using CiMLoop, researchers can evaluate design choices at different levels of the CiM stack, co-design across all levels, fairly compare different implementations, and rapidly explore the design space. △ Less

Submitted 29 May, 2024; v1 submitted 12 May, 2024; originally announced May 2024.

Comments: Available at https://github.com/mit-emze/cimloop. Published in ISPASS 2024

arXiv:2404.11591 [pdf, other]

The EDGE Language: Extended General Einsums for Graph Algorithms

Authors: Toluwanimi O. Odemuyiwa, Joel S. Emer, John D. Owens

Abstract: In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor,… ▽ More In this work, we propose a unified abstraction for graph algorithms: the Extended General Einsums language, or EDGE. The EDGE language expresses graph algorithms in the language of tensor algebra, providing a rigorous, succinct, and expressive mathematical framework. EDGE leverages two ideas: (1) the well-known foundations provided by the graph-matrix duality, where a graph is simply a 2D tensor, and (2) the power and expressivity of Einsum notation in the tensor algebra world. In this work, we describe our design goals for EDGE and walk through the extensions we add to Einsums to support more complex operations common in graph algorithms. Additionally, we provide a few examples of how to express graph algorithms in our proposed notation. We hope that a single, mathematical notation for graph algorithms will (1) allow researchers to more easily compare different algorithms and different implementations of a graph algorithm; (2) enable developers to factor complexity by separating the concerns of what to compute (described with the extended Einsum notation) from the lower level details of how to compute; and (3) enable the discovery of different algorithmic variants of a problem through algebraic manipulations and transformations on a given EDGE expression. △ Less

Submitted 17 April, 2024; originally announced April 2024.

Comments: 79 pages, 14 figures

arXiv:2404.06553 [pdf, other]

Modeling Analog-Digital-Converter Energy and Area for Compute-In-Memory Accelerator Design

Authors: Tanner Andrulis, Ruicong Chen, Hae-Seung Lee, Joel S. Emer, Vivienne Sze

Abstract: Analog Compute-in-Memory (CiM) accelerators use analog-digital converters (ADCs) to read the analog values that they compute. ADCs can consume significant energy and area, so architecture-level ADC decisions such as ADC resolution or number of ADCs can significantly impact overall CiM accelerator energy and area. Therefore, modeling how architecture-level decisions affect ADC energy and area is cr… ▽ More Analog Compute-in-Memory (CiM) accelerators use analog-digital converters (ADCs) to read the analog values that they compute. ADCs can consume significant energy and area, so architecture-level ADC decisions such as ADC resolution or number of ADCs can significantly impact overall CiM accelerator energy and area. Therefore, modeling how architecture-level decisions affect ADC energy and area is critical for performing architecture-level design space exploration of CiM accelerators. This work presents an open-source architecture-level model to estimate ADC energy and area. To enable fast design space exploration, the model uses only architecture-level attributes while abstracting circuit-level details. Our model enables researchers to quickly and easily model key architecture-level tradeoffs in accelerators that use ADCs. △ Less

Submitted 14 May, 2024; v1 submitted 9 April, 2024; originally announced April 2024.

arXiv:2310.00192 [pdf, other]

doi 10.1145/3613424.3623793

Tailors: Accelerating Sparse Tensor Algebra by Overbooking Buffer Capacity

Authors: Zi Yu Xue, Yannan Nellie Wu, Joel S. Emer, Vivienne Sze

Abstract: Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of availa… ▽ More Sparse tensor algebra is a challenging class of workloads to accelerate due to low arithmetic intensity and varying sparsity patterns. Prior sparse tensor algebra accelerators have explored tiling sparse data to increase exploitable data reuse and improve throughput, but typically allocate tile size in a given buffer for the worst-case data occupancy. This severely limits the utilization of available memory resources and reduces data reuse. Other accelerators employ complex tiling during preprocessing or at runtime to determine the exact tile size based on its occupancy. This paper proposes a speculative tensor tiling approach, called overbooking, to improve buffer utilization by taking advantage of the distribution of nonzero elements in sparse tensors to construct larger tiles with greater data reuse. To ensure correctness, we propose a low-overhead hardware mechanism, Tailors, that can tolerate data overflow by design while ensuring reasonable data reuse. We demonstrate that Tailors can be easily integrated into the memory hierarchy of an existing sparse tensor algebra accelerator. To ensure high buffer utilization with minimal tiling overhead, we introduce a statistical approach, Swiftiles, to pick a tile size so that tiles usually fit within the buffer's capacity, but can potentially overflow, i.e., it overbooks the buffers. Across a suite of 22 sparse tensor algebra workloads, we show that our proposed overbooking strategy introduces an average speedup of $52.7\times$ and $2.3\times$ and an average energy reduction of $22.5\times$ and $2.5\times$ over ExTensor without and with optimized tiling, respectively. △ Less

Submitted 26 June, 2024; v1 submitted 29 September, 2023; originally announced October 2023.

Comments: 17 pages, 13 figures, in MICRO 2023

Journal ref: 56th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO '23), 2023

arXiv:2309.04119 [pdf, other]

Penetrating Shields: A Systematic Analysis of Memory Corruption Mitigations in the Spectre Era

Authors: Weon Taek Na, Joel S. Emer, Mengjia Yan

Abstract: This paper provides the first systematic analysis of a synergistic threat model encompassing memory corruption vulnerabilities and microarchitectural side-channel vulnerabilities. We study speculative shield bypass attacks that leverage speculative execution attacks to leak secrets that are critical to the security of memory corruption mitigations (i.e., the shields), and then use the leaked secre… ▽ More This paper provides the first systematic analysis of a synergistic threat model encompassing memory corruption vulnerabilities and microarchitectural side-channel vulnerabilities. We study speculative shield bypass attacks that leverage speculative execution attacks to leak secrets that are critical to the security of memory corruption mitigations (i.e., the shields), and then use the leaked secrets to bypass the mitigation mechanisms and successfully conduct memory corruption exploits, such as control-flow hijacking. We start by systematizing a taxonomy of the state-of-the-art memory corruption mitigations focusing on hardware-software co-design solutions. The taxonomy helps us to identify 10 likely vulnerable defense schemes out of 20 schemes that we analyze. Next, we develop a graph-based model to analyze the 10 likely vulnerable defenses and reason about possible countermeasures. Finally, we present three proof-of-concept attacks targeting an already-deployed mitigation mechanism and two state-of-the-art academic proposals. △ Less

Submitted 8 September, 2023; originally announced September 2023.

Comments: 14 pages

ACM Class: K.6.5; D.4.6; C.2.0

arXiv:2305.12718 [pdf, other]

doi 10.1145/3613424.3623786

HighLight: Efficient and Flexible DNN Acceleration with Hierarchical Structured Sparsity

Authors: Yannan Nellie Wu, Po-An Tsai, Saurav Muralidharan, Angshuman Parashar, Vivienne Sze, Joel S. Emer

Abstract: Due to complex interactions among various deep neural network (DNN) optimization techniques, modern DNNs can have weights and activations that are dense or sparse with diverse sparsity degrees. To offer a good trade-off between accuracy and hardware performance, an ideal DNN accelerator should have high flexibility to efficiently translate DNN sparsity into reductions in energy and/or latency with… ▽ More Due to complex interactions among various deep neural network (DNN) optimization techniques, modern DNNs can have weights and activations that are dense or sparse with diverse sparsity degrees. To offer a good trade-off between accuracy and hardware performance, an ideal DNN accelerator should have high flexibility to efficiently translate DNN sparsity into reductions in energy and/or latency without incurring significant complexity overhead. This paper introduces hierarchical structured sparsity (HSS), with the key insight that we can systematically represent diverse sparsity degrees by having them hierarchically composed from multiple simple sparsity patterns. As a result, HSS simplifies the underlying hardware since it only needs to support simple sparsity patterns; this significantly reduces the sparsity acceleration overhead, which improves efficiency. Motivated by such opportunities, we propose a simultaneously efficient and flexible accelerator, named HighLight, to accelerate DNNs that have diverse sparsity degrees (including dense). Due to the flexibility of HSS, different HSS patterns can be introduced to DNNs to meet different applications' accuracy requirements. Compared to existing works, HighLight achieves a geomean of up to 6.4x better energy-delay product (EDP) across workloads with diverse sparsity degrees, and always sits on the EDP-accuracy Pareto frontier for representative DNNs △ Less

Submitted 1 October, 2023; v1 submitted 22 May, 2023; originally announced May 2023.

Comments: Accepted to MICRO23

arXiv:2304.07935 [pdf, other]

doi 10.1145/3579371.3589062

RAELLA: Reforming the Arithmetic for Efficient, Low-Resolution, and Low-Loss Analog PIM: No Retraining Required!

Authors: Tanner Andrulis, Joel S. Emer, Vivienne Sze

Abstract: Processing-In-Memory (PIM) accelerators have the potential to efficiently run Deep Neural Network (DNN) inference by reducing costly data movement and by using resistive RAM (ReRAM) for efficient analog compute. Unfortunately, overall PIM accelerator efficiency is limited by energy-intensive analog-to-digital converters (ADCs). Furthermore, existing accelerators that reduce ADC cost do so by chang… ▽ More Processing-In-Memory (PIM) accelerators have the potential to efficiently run Deep Neural Network (DNN) inference by reducing costly data movement and by using resistive RAM (ReRAM) for efficient analog compute. Unfortunately, overall PIM accelerator efficiency is limited by energy-intensive analog-to-digital converters (ADCs). Furthermore, existing accelerators that reduce ADC cost do so by changing DNN weights or by using low-resolution ADCs that reduce output fidelity. These strategies harm DNN accuracy and/or require costly DNN retraining to compensate. To address these issues, we propose the RAELLA architecture. RAELLA adapts the architecture to each DNN; it lowers the resolution of computed analog values by encoding weights to produce near-zero analog values, adaptively slicing weights for each DNN layer, and dynamically slicing inputs through speculation and recovery. Low-resolution analog values allow RAELLA to both use efficient low-resolution ADCs and maintain accuracy without retraining, all while computing with fewer ADC converts. Compared to other low-accuracy-loss PIM accelerators, RAELLA increases energy efficiency by up to 4.9$\times$ and throughput by up to 3.3$\times$. Compared to PIM accelerators that cause accuracy loss and retrain DNNs to recover, RAELLA achieves similar efficiency and throughput without expensive DNN retraining. △ Less

Submitted 16 April, 2023; originally announced April 2023.

Comments: 16 pages; 15 figures; Accepted at ISCA 2023 (the International Symposium on Computer Architecture)

ACM Class: C.1.3

arXiv:2304.07931 [pdf, other]

doi 10.1145/3613424.3623791

TeAAL: A Declarative Framework for Modeling Sparse Tensor Accelerators

Authors: Nandeeka Nayak, Toluwanimi O. Odemuyiwa, Shubham Ugare, Christopher W. Fletcher, Michael Pellauer, Joel S. Emer

Abstract: Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full ra… ▽ More Over the past few years, the explosion in sparse tensor algebra workloads has led to a corresponding rise in domain-specific accelerators to service them. Due to the irregularity present in sparse tensors, these accelerators employ a wide variety of novel solutions to achieve good performance. At the same time, prior work on design-flexible sparse accelerator modeling does not express this full range of design features, making it difficult to understand the impact of each design choice and compare or extend the state-of-the-art. To address this, we propose TeAAL: a language and simulator generator for the concise and precise specification and evaluation of sparse tensor algebra accelerators. We use TeAAL to represent and evaluate four disparate state-of-the-art accelerators -- ExTensor, Gamma, OuterSPACE, and SIGMA -- and verify that it reproduces their performance with high accuracy. Finally, we demonstrate the potential of TeAAL as a tool for designing new accelerators by showing how it can be used to speed up vertex-centric programming accelerators -- achieving $1.9\times$ on BFS and $1.2\times$ on SSSP over GraphDynS. △ Less

Submitted 11 June, 2024; v1 submitted 16 April, 2023; originally announced April 2023.

Comments: 17 pages, 13 figures

arXiv:2205.05826 [pdf, other]

Sparseloop: An Analytical Approach To Sparse Tensor Accelerator Modeling

Authors: Yannan Nellie Wu, Po-An Tsai, Angshuman Parashar, Vivienne Sze, Joel S. Emer

Abstract: In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. T… ▽ More In recent years, many accelerators have been proposed to efficiently process sparse tensor algebra applications (e.g., sparse neural networks). However, these proposals are single points in a large and diverse design space. The lack of systematic description and modeling support for these sparse tensor accelerators impedes hardware designers from efficient and effective design space exploration. This paper first presents a unified taxonomy to systematically describe the diverse sparse tensor accelerator design space. Based on the proposed taxonomy, it then introduces Sparseloop, the first fast, accurate, and flexible analytical modeling framework to enable early-stage evaluation and exploration of sparse tensor accelerators. Sparseloop comprehends a large set of architecture specifications, including various dataflows and sparse acceleration features (e.g., elimination of zero-based compute). Using these specifications, Sparseloop evaluates a design's processing speed and energy efficiency while accounting for data movement and compute incurred by the employed dataflow as well as the savings and overhead introduced by the sparse acceleration features using stochastic tensor density models. Across representative accelerators and workloads, Sparseloop achieves over 2000 times faster modeling speed than cycle-level simulations, maintains relative performance trends, and achieves 0.1% to 8% average error. With a case study, we demonstrate Sparseloop's ability to help reveal important insights for designing sparse tensor accelerators (e.g., it is important to co-design orthogonal design aspects). △ Less

Submitted 9 January, 2023; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: Update website link, update UOP format description

Showing 1–12 of 12 results for author: Emer, J S