-
SatIn: Hardware for Boolean Satisfiability Inference
Authors:
Chenzhuo Zhu,
Alexander C. Rucker,
Yawen Wang,
William J. Dally
Abstract:
This paper describes SatIn, a hardware accelerator for determining boolean satisfiability (SAT) -- an important problem in many domains including verification, security analysis, and planning.
SatIn is based on a distributed associative array which performs short, atomic operations that can be composed into high level operations.
To overcome scaling limitations imposed by wire delay, we extend…
▽ More
This paper describes SatIn, a hardware accelerator for determining boolean satisfiability (SAT) -- an important problem in many domains including verification, security analysis, and planning.
SatIn is based on a distributed associative array which performs short, atomic operations that can be composed into high level operations.
To overcome scaling limitations imposed by wire delay, we extended the algorithms used in software solvers to function efficiently on a distributed set of nodes communicating with message passing.
A cycle-level simulation on real benchmarks shows that SatIn achieves an average 72x speedup against Glucose, the winner of 2016 SAT competition, with the potential to achieve a 113x speedup using two contexts.
To quantify SatIn's physical requirements, we placed and routed a single clause using the Synopsys 32nm} educational development kit.
We were able to meet a 1ns cycle constraint with our target clause fitting in 4867um^2 and consuming 63.8uW of dynamic power; with a network, this corresponds to 100k clauses consuming 8.3W of dynamic power (not including leakage or global clock power) in a 500mm^2 32nm chip.
△ Less
Submitted 5 March, 2023;
originally announced March 2023.
-
Revet: A Language and Compiler for Dataflow Threads
Authors:
Alexander Rucker,
Shiv Sundram,
Coleman Smith,
Matthew Vilim,
Raghu Prabhakar,
Fredrik Kjolstad,
Kunle Olukotun
Abstract:
Spatial dataflow architectures such as reconfigurable dataflow accelerators (RDA) can provide much higher performance and efficiency than CPUs and GPUs. In particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent literature represent a design point that enhances the efficiency of dataflow architectures with vectorization. Today, vRDAs can be exploited using either hardcoded ker…
▽ More
Spatial dataflow architectures such as reconfigurable dataflow accelerators (RDA) can provide much higher performance and efficiency than CPUs and GPUs. In particular, vectorized reconfigurable dataflow accelerators (vRDA) in recent literature represent a design point that enhances the efficiency of dataflow architectures with vectorization. Today, vRDAs can be exploited using either hardcoded kernels or MapReduce languages like Spatial, which cannot vectorize data-dependent control flow. In contrast, CPUs and GPUs can be programmed using general-purpose threaded abstractions.
The ideal combination would be the generality of a threaded programming model coupled with the efficient execution model of a vRDA. We introduce Revet: a programming model, compiler, and execution model that lets threaded applications run efficiently on vRDAs. The Revet programming language uses threads to support a broader range of applications than Spatial's parallel patterns, and our MLIR-based compiler lowers this language to a generic dataflow backend that operates on streaming tensors. Finally, we show that map** threads to dataflow outperforms GPUs, the current state-of-the-art for threaded accelerators, by 3.8x.
△ Less
Submitted 30 January, 2024; v1 submitted 13 February, 2023;
originally announced February 2023.
-
Stardust: Compiling Sparse Tensor Algebra to a Reconfigurable Dataflow Architecture
Authors:
Olivia Hsu,
Alexander Rucker,
Tian Zhao,
Kunle Olukotun,
Fredrik Kjolstad
Abstract:
We introduce Stardust, a compiler that compiles sparse tensor algebra to reconfigurable dataflow architectures (RDAs). Stardust introduces new user-provided data representation and scheduling language constructs for map** to resource-constrained accelerated architectures. Stardust uses the information provided by these constructs to determine on-chip memory placement and to lower to the Capstan…
▽ More
We introduce Stardust, a compiler that compiles sparse tensor algebra to reconfigurable dataflow architectures (RDAs). Stardust introduces new user-provided data representation and scheduling language constructs for map** to resource-constrained accelerated architectures. Stardust uses the information provided by these constructs to determine on-chip memory placement and to lower to the Capstan RDA through a parallel-patterns rewrite system that targets the Spatial programming model. The Stardust compiler is implemented as a new compilation path inside the TACO open-source system. Using cycle-accurate simulation, we demonstrate that Stardust can generate more Capstan tensor operations than its authors had implemented and that it results in 138$\times$ better performance than generated CPU kernels and 41$\times$ better performance than generated GPU kernels.
△ Less
Submitted 6 November, 2022;
originally announced November 2022.
-
Capstan: A Vector RDA for Sparsity
Authors:
Alexander Rucker,
Matthew Vilim,
Tian Zhao,
Yaqi Zhang,
Raghu Prabhakar,
Kunle Olukotun
Abstract:
This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one application, we start with common sparse data formats, each of which supports multiple applications. Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives t…
▽ More
This paper proposes Capstan: a scalable, parallel-patterns-based, reconfigurable dataflow accelerator (RDA) for sparse and dense tensor applications. Instead of designing for one application, we start with common sparse data formats, each of which supports multiple applications. Using a declarative programming model, Capstan supports application-independent sparse iteration and memory primitives that can be mapped to vectorized, high-performance hardware. We optimize random-access sparse memories with configurable out-of-order execution to increase SRAM random-access throughput from 32% to 80%.
For a variety of sparse applications, Capstan with DDR4 memory is 18x faster than a multi-core CPU baseline, while Capstan with HBM2 memory is 16x faster than an Nvidia V100 GPU. For sparse applications that can be mapped to Plasticine, a recent dense RDA, Capstan is 7.6x to 365x faster and only 16% larger.
△ Less
Submitted 22 September, 2021; v1 submitted 26 April, 2021;
originally announced April 2021.
-
Reward Sha** for Human Learning via Inverse Reinforcement Learning
Authors:
Mark A. Rucker,
Layne T. Watson,
Matthew S. Gerber,
Laura E. Barnes
Abstract:
Humans are spectacular reinforcement learners, constantly learning from and adjusting to experience and feedback. Unfortunately, this doesn't necessarily mean humans are fast learners. When tasks are challenging, learning can become unacceptably slow. Fortunately, humans do not have to learn tabula rasa, and learning speed can be greatly increased with learning aids. In this work we validate a new…
▽ More
Humans are spectacular reinforcement learners, constantly learning from and adjusting to experience and feedback. Unfortunately, this doesn't necessarily mean humans are fast learners. When tasks are challenging, learning can become unacceptably slow. Fortunately, humans do not have to learn tabula rasa, and learning speed can be greatly increased with learning aids. In this work we validate a new type of learning aid -- reward sha** for humans via inverse reinforcement learning (IRL). The goal of this aid is to increase the speed with which humans can learn good policies for specific tasks. Furthermore this approach compliments alternative machine learning techniques such as safety features that try to prevent individuals from making poor decisions. To achieve our results we first extend a well known IRL algorithm via kernel methods. Afterwards we conduct two human subjects experiments using an online game where players have limited time to learn a good policy. We show with statistical significance that players who receive our learning aid are able to approach desired policies more quickly than the control group.
△ Less
Submitted 15 December, 2022; v1 submitted 25 February, 2020;
originally announced February 2020.
-
Taurus: A Data Plane Architecture for Per-Packet ML
Authors:
Tushar Swamy,
Alexander Rucker,
Muhammad Shahbaz,
Ishan Gaur,
Kunle Olukotun
Abstract:
Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- demand responsive, secure, and scalable datacenter networks. These networks currently implement simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under a slow, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet applications' s…
▽ More
Emerging applications -- cloud computing, the internet of things, and augmented/virtual reality -- demand responsive, secure, and scalable datacenter networks. These networks currently implement simple, per-packet, data-plane heuristics (e.g., ECMP and sketches) under a slow, millisecond-latency control plane that runs data-driven performance and security policies. However, to meet applications' service-level objectives (SLOs) in a modern data center, networks must bridge the gap between line-rate, per-packet execution and complex decision making.
In this work, we present the design and implementation of Taurus, a data plane for line-rate inference. Taurus adds custom hardware based on a flexible, parallel-patterns (MapReduce) abstraction to programmable network devices, such as switches and NICs; this new hardware uses pipelined SIMD parallelism to enable per-packet MapReduce operations (e.g., inference). Our evaluation of a Taurus switch ASIC -- supporting several real-world models -- shows that Taurus operates orders of magnitude faster than a server-based control plane while increasing area by 3.8% and latency for line-rate ML models by up to 221 ns. Furthermore, our Taurus FPGA prototype achieves full model accuracy and detects two orders of magnitude more events than a state-of-the-art control-plane anomaly-detection system.
△ Less
Submitted 19 January, 2022; v1 submitted 12 February, 2020;
originally announced February 2020.