-
FPGA Technology Map** Using Sketch-Guided Program Synthesis
Authors:
Gus Henry Smith,
Ben Kushigian,
Vishal Canumalla,
Andrew Cheung,
Steven Lyubomirsky,
Sorawee Porncharoenwase,
René Just,
Gilbert Louis Bernstein,
Zachary Tatlock
Abstract:
FPGA technology map** is the process of implementing a hardware design expressed in high-level HDL (hardware design language) code using the low-level, architecture-specific primitives of the target FPGA. As FPGAs become increasingly heterogeneous, achieving high performance requires hardware synthesis tools that better support map** to complex, highly configurable primitives like digital sign…
▽ More
FPGA technology map** is the process of implementing a hardware design expressed in high-level HDL (hardware design language) code using the low-level, architecture-specific primitives of the target FPGA. As FPGAs become increasingly heterogeneous, achieving high performance requires hardware synthesis tools that better support map** to complex, highly configurable primitives like digital signal processors (DSPs). Current tools support DSP map** via handwritten special-case map** rules, which are laborious to write, error-prone, and often overlook map** opportunities. We introduce Lakeroad, a principled approach to technology map** via sketch-guided program synthesis. Lakeroad leverages two techniques -- architecture-independent sketch templates and semantics extraction from HDL -- to provide extensible technology map** with stronger correctness guarantees and higher coverage of map** opportunities than state-of-the-art tools. Across representative microbenchmarks, Lakeroad produces 2--3.5$\times$ the number of optimal map**s compared to proprietary state-of-the-art tools and 6--44$\times$ the number of optimal map**s compared to popular open-source tools, while also providing correctness guarantees not given by any other tool.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.
-
Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
Authors:
Ruihang Lai,
Junru Shao,
Siyuan Feng,
Steven S. Lyubomirsky,
Bohan Hou,
Wuwei Lin,
Zihao Ye,
Hongyi **,
Yuchen **,
Jiawei Liu,
Lesheng **,
Yaxing Cai,
Ziheng Jiang,
Yong Wu,
Sunghyun Park,
Prakalp Srivastava,
Jared G. Roesch,
Todd C. Mowry,
Tianqi Chen
Abstract:
Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape…
▽ More
Dynamic shape computations have become critical in modern machine learning workloads, especially in emerging large language models. The success of these models has driven demand for deploying them to a diverse set of backend environments. In this paper, we present Relax, a compiler abstraction for optimizing end-to-end dynamic machine learning workloads. Relax introduces first-class symbolic shape annotations to track dynamic shape computations globally across the program. It also introduces a cross-level abstraction that encapsulates computational graphs, loop-level tensor programs, and library calls in a single representation to enable cross-level optimizations. We build an end-to-end compilation framework using the proposed approach to optimize dynamic shape models. Experimental results on large language models show that Relax delivers performance competitive with state-of-the-art hand-optimized systems across platforms and enables deployment of emerging dynamic models to a broader set of environments, including mobile phones, embedded devices, and web browsers.
△ Less
Submitted 1 November, 2023;
originally announced November 2023.
-
Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface
Authors:
Bo-Yuan Huang,
Steven Lyubomirsky,
Yi Li,
Mike He,
Gus Henry Smith,
Thierry Tambe,
Akash Gaonkar,
Vishal Canumalla,
Andrew Cheung,
Gu-Yeon Wei,
Aarti Gupta,
Zachary Tatlock,
Sharad Malik
Abstract:
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed "3LA" to enable…
▽ More
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed "3LA" to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator's operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term "flexible matching." By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.
△ Less
Submitted 22 August, 2023; v1 submitted 28 February, 2022;
originally announced March 2022.
-
Pure Tensor Program Rewriting via Access Patterns (Representation Pearl)
Authors:
Gus Henry Smith,
Andrew Liu,
Steven Lyubomirsky,
Scott Davidson,
Joseph McMahan,
Michael Taylor,
Luis Ceze,
Zachary Tatlock
Abstract:
Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and map** to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}…
▽ More
Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and map** to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be \textit{pure but high-level}, making low-level rewrites to hardware targets inexpressible, or \textit{low-level but impure}, hampering the use of term rewriting altogether. This paper introduces Glenside, a pure IR whose core abstraction -- the \textit{access pattern} -- enables low-level, layout-aware, hardware-centric program rewrites.
We demonstrate how term rewriting in Glenside can be used to map program fragments to hardware accelerator invocations and automatically discover classic data layout transformations like \texttt{im2col}. Glenside establishes a new foundation for exploring further term rewriting techniques in optimizing low-level tensor programs.
△ Less
Submitted 19 May, 2021;
originally announced May 2021.
-
Dynamic Tensor Rematerialization
Authors:
Marisa Kirisame,
Steven Lyubomirsky,
Altan Haan,
Jennifer Brennan,
Mike He,
Jared Roesch,
Tianqi Chen,
Zachary Tatlock
Abstract:
Checkpointing enables the training of deep learning models under restricted memory budgets by freeing intermediate activations from memory and recomputing them on demand. Current checkpointing techniques statically plan these recomputations offline and assume static computation graphs. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Re…
▽ More
Checkpointing enables the training of deep learning models under restricted memory budgets by freeing intermediate activations from memory and recomputing them on demand. Current checkpointing techniques statically plan these recomputations offline and assume static computation graphs. We demonstrate that a simple online algorithm can achieve comparable performance by introducing Dynamic Tensor Rematerialization (DTR), a greedy online algorithm for checkpointing that is extensible and general, is parameterized by eviction policy, and supports dynamic models. We prove that DTR can train an $N$-layer linear feedforward network on an $Ω(\sqrt{N})$ memory budget with only $\mathcal{O}(N)$ tensor operations. DTR closely matches the performance of optimal static checkpointing in simulated experiments. We incorporate a DTR prototype into PyTorch merely by interposing on tensor allocations and operator calls and collecting lightweight metadata on tensors.
△ Less
Submitted 18 March, 2021; v1 submitted 16 June, 2020;
originally announced June 2020.
-
Relay: A High-Level Compiler for Deep Learning
Authors:
Jared Roesch,
Steven Lyubomirsky,
Marisa Kirisame,
Logan Weber,
Josh Pollock,
Luis Vega,
Ziheng Jiang,
Tianqi Chen,
Thierry Moreau,
Zachary Tatlock
Abstract:
Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler…
▽ More
Frameworks for writing, compiling, and optimizing deep learning (DL) models have recently enabled progress in areas like computer vision and natural language processing. Extending these frameworks to accommodate the rapidly diversifying landscape of DL models and hardware platforms presents challenging tradeoffs between expressivity, composability, and portability. We present Relay, a new compiler framework for DL. Relay's functional, statically typed intermediate representation (IR) unifies and generalizes existing DL IRs to express state-of-the-art models. The introduction of Relay's expressive IR requires careful design of domain-specific optimizations, addressed via Relay's extension mechanisms. Using these extension mechanisms, Relay supports a unified compiler that can target a variety of hardware platforms. Our evaluation demonstrates Relay's competitive performance for a broad class of models and devices (CPUs, GPUs, and emerging accelerators). Relay's design demonstrates how a unified IR can provide expressivity, composability, and portability without compromising performance.
△ Less
Submitted 24 August, 2019; v1 submitted 17 April, 2019;
originally announced April 2019.
-
Relay: A New IR for Machine Learning Frameworks
Authors:
Jared Roesch,
Steven Lyubomirsky,
Logan Weber,
Josh Pollock,
Marisa Kirisame,
Tianqi Chen,
Zachary Tatlock
Abstract:
Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation…
▽ More
Machine learning powers diverse services in industry including search, translation, recommendation systems, and security. The scale and importance of these models require that they be efficient, expressive, and portable across an array of heterogeneous hardware devices. These constraints are often at odds; in order to better accommodate them we propose a new high-level intermediate representation (IR) called Relay. Relay is being designed as a purely-functional, statically-typed language with the goal of balancing efficient compilation, expressiveness, and portability. We discuss the goals of Relay and highlight its important design constraints. Our prototype is part of the open source NNVM compiler framework, which powers Amazon's deep learning framework MxNet.
△ Less
Submitted 25 September, 2018;
originally announced October 2018.