-
VaPr: Variable-Precision Tensors to Accelerate Robot Motion Planning
Authors:
Yu-Shun Hsiao,
Siva Kumar Sastry Hari,
Balakumar Sundaralingam,
Jason Yik,
Thierry Tambe,
Charbel Sakr,
Stephen W. Keckler,
Vijay Janapa Reddi
Abstract:
High-dimensional motion generation requires numerical precision for smooth, collision-free solutions. Typically, double-precision or single-precision floating-point (FP) formats are utilized. Using these for big tensors imposes a strain on the memory bandwidth provided by the devices and alters the memory footprint, hence limiting their applicability to low-power edge devices needed for mobile rob…
▽ More
High-dimensional motion generation requires numerical precision for smooth, collision-free solutions. Typically, double-precision or single-precision floating-point (FP) formats are utilized. Using these for big tensors imposes a strain on the memory bandwidth provided by the devices and alters the memory footprint, hence limiting their applicability to low-power edge devices needed for mobile robots. The uniform application of reduced precision can be advantageous but severely degrades solutions. Using decreased precision data types for important tensors, we propose to accelerate motion generation by removing memory bottlenecks. We propose variable-precision (VaPr) search optimization to determine the appropriate precision for large tensors from a vast search space of approximately 4 million unique combinations for FP data types across the tensors. To obtain the efficiency gains, we exploit existing platform support for an out-of-the-box GPU speedup and evaluate prospective precision converter units for GPU types that are not currently supported. Our experimental results on 800 planning problems for the Franka Panda robot on the MotionBenchmaker dataset across 8 environments show that a 4-bit FP format is sufficient for the largest set of tensors in the motion generation stack. With the software-only solution, VaPr achieves 6.3% and 6.3% speedups on average for a significant portion of motion generation over the SOTA solution (CuRobo) on Jetson Orin and RTX2080 Ti GPU, respectively, and 9.9%, 17.7% speedups with the FP converter.
△ Less
Submitted 11 October, 2023;
originally announced October 2023.
-
CAMEL: Co-Designing AI Models and Embedded DRAMs for Efficient On-Device Learning
Authors:
Sai Qian Zhang,
Thierry Tambe,
Nestor Cuevas,
Gu-Yeon Wei,
David Brooks
Abstract:
On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-acce…
▽ More
On-device learning allows AI models to adapt to user data, thereby enhancing service quality on edge platforms. However, training AI on resource-limited devices poses significant challenges due to the demanding computing workload and the substantial memory consumption and data access required by deep neural networks (DNNs). To address these issues, we propose utilizing embedded dynamic random-access memory (eDRAM) as the primary storage medium for transient training data. In comparison to static random-access memory (SRAM), eDRAM provides higher storage density and lower leakage power, resulting in reduced access cost and power leakage. Nevertheless, to maintain the integrity of the stored data, periodic power-hungry refresh operations could potentially degrade system performance.
To minimize the occurrence of expensive eDRAM refresh operations, it is beneficial to shorten the lifetime of stored data during the training process. To achieve this, we adopt the principles of algorithm and hardware co-design, introducing a family of reversible DNN architectures that effectively decrease data lifetime and storage costs throughout training. Additionally, we present a highly efficient on-device training engine named \textit{CAMEL}, which leverages eDRAM as the primary on-chip memory. This engine enables efficient on-device training with significantly reduced memory usage and off-chip DRAM traffic while maintaining superior training accuracy. We evaluate our CAMEL system on multiple DNNs with different datasets, demonstrating a $2.5\times$ speedup of the training process and $2.8\times$ training energy savings than the other baseline hardware platforms.
△ Less
Submitted 22 December, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
Application-Level Validation of Accelerator Designs Using a Formal Software/Hardware Interface
Authors:
Bo-Yuan Huang,
Steven Lyubomirsky,
Yi Li,
Mike He,
Gus Henry Smith,
Thierry Tambe,
Akash Gaonkar,
Vishal Canumalla,
Andrew Cheung,
Gu-Yeon Wei,
Aarti Gupta,
Zachary Tatlock,
Sharad Malik
Abstract:
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed "3LA" to enable…
▽ More
Ideally, accelerator development should be as easy as software development. Several recent design languages/tools are working toward this goal, but actually testing early designs on real applications end-to-end remains prohibitively difficult due to the costs of building specialized compiler and simulator support. We propose a new first-in-class, mostly automated methodology termed "3LA" to enable end-to-end testing of prototype accelerator designs on unmodified source applications. A key contribution of 3LA is the use of a formal software/hardware interface that specifies an accelerator's operations and their semantics. Specifically, we leverage the Instruction-Level Abstraction (ILA) formal specification for accelerators that has been successfully used thus far for accelerator implementation verification. We show how the ILA for accelerators serves as a software/hardware interface, similar to the Instruction Set Architecture (ISA) for processors, that can be used for automated development of compilers and instruction-level simulators. Another key contribution of this work is to show how ILA-based accelerator semantics enables extending recent work on equality saturation to auto-generate basic compiler support for prototype accelerators in a technique we term "flexible matching." By combining flexible matching with simulators auto-generated from ILA specifications, our approach enables end-to-end evaluation with modest engineering effort. We detail several case studies of 3LA, which uncovered an unknown flaw in a recently published accelerator and facilitated its fix.
△ Less
Submitted 22 August, 2023; v1 submitted 28 February, 2022;
originally announced March 2022.
-
AutoSoC: Automating Algorithm-SOC Co-design for Aerial Robots
Authors:
Srivatsan Krishnan,
Thierry Tambe,
Zishen Wan,
Vijay Janapa Reddi
Abstract:
Aerial autonomous machines (Drones) has a plethora of promising applications and use cases. While the popularity of these autonomous machines continues to grow, there are many challenges, such as endurance and agility, that could hinder the practical deployment of these machines. The closed-loop control frequency must be high to achieve high agility. However, given the resource-constrained nature…
▽ More
Aerial autonomous machines (Drones) has a plethora of promising applications and use cases. While the popularity of these autonomous machines continues to grow, there are many challenges, such as endurance and agility, that could hinder the practical deployment of these machines. The closed-loop control frequency must be high to achieve high agility. However, given the resource-constrained nature of the aerial robot, achieving high control loop frequency is hugely challenging and requires careful co-design of algorithm and onboard computer. Such an effort requires infrastructures that bridge various domains, namely robotics, machine learning, and system architecture design. To that end, we present AutoSoC, a framework for co-designing algorithms as well as hardware accelerator systems for end-to-end learning-based aerial autonomous machines. We demonstrate the efficacy of the framework by training an obstacle avoidance algorithm for aerial robots to navigate in a densely cluttered environment. For the best performing algorithm, our framework generates various accelerator design candidates with varying performance, area, and power consumption. The framework also runs the ASIC flow of place and route and generates a layout of the floor-planed accelerator, which can be used to tape-out the final hardware chip.
△ Less
Submitted 12 September, 2021;
originally announced September 2021.
-
Quantifying and Maximizing the Benefits of Back-End Noise Adaption on Attention-Based Speech Recognition Models
Authors:
Coleman Hooper,
Thierry Tambe,
Gu-Yeon Wei
Abstract:
This work analyzes how attention-based Bidirectional Long Short-Term Memory (BLSTM) models adapt to noise-augmented speech. We identify crucial components for noise adaptation in BLSTM models by freezing model components during fine-tuning. We first freeze larger model subnetworks and then pursue a fine-grained freezing approach in the encoder after identifying its importance for noise adaptation.…
▽ More
This work analyzes how attention-based Bidirectional Long Short-Term Memory (BLSTM) models adapt to noise-augmented speech. We identify crucial components for noise adaptation in BLSTM models by freezing model components during fine-tuning. We first freeze larger model subnetworks and then pursue a fine-grained freezing approach in the encoder after identifying its importance for noise adaptation. The first encoder layer is shown to be crucial for noise adaptation, and the weights are shown to be more important than the other layers. Appreciable accuracy benefits are identified when fine-tuning on a target noisy environment from a model pretrained with noisy speech relative to fine-tuning from a model pretrained with only clean speech when tested on the target noisy environment. For this analysis, we produce our own dataset augmentation tool and it is open-sourced to encourage future efforts in exploring noise adaptation in ASR.
△ Less
Submitted 23 September, 2021; v1 submitted 3 May, 2021;
originally announced May 2021.
-
EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware Multi-Task NLP Inference
Authors:
Thierry Tambe,
Coleman Hooper,
Lillian Pentecost,
Tianyu Jia,
En-Yu Yang,
Marco Donato,
Victor Sanh,
Paul N. Whatmough,
Alexander M. Rush,
David Brooks,
Gu-Yeon Wei
Abstract:
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimi…
▽ More
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks. However, their hefty computational and memory demands make them challenging to deploy to resource-constrained edge platforms with strict latency requirements. We present EdgeBERT, an in-depth algorithm-hardware co-design for latency-aware energy optimization for multi-task NLP. EdgeBERT employs entropy-based early exit predication in order to perform dynamic voltage-frequency scaling (DVFS), at a sentence granularity, for minimal energy consumption while adhering to a prescribed target latency. Computation and memory footprint overheads are further alleviated by employing a calibrated combination of adaptive attention span, selective network pruning, and floating-point quantization. Furthermore, in order to maximize the synergistic benefits of these algorithms in always-on and intermediate edge computing settings, we specialize a 12nm scalable hardware accelerator system, integrating a fast-switching low-dropout voltage regulator (LDO), an all-digital phase-locked loop (ADPLL), as well as, high-density embedded non-volatile memories (eNVMs) wherein the sparse floating-point bit encodings of the shared multi-task parameters are carefully stored. Altogether, latency-aware multi-task NLP inference acceleration on the EdgeBERT hardware system generates up to 7x, 2.5x, and 53x lower energy compared to the conventional inference without early stop**, the latency-unbounded early exit approach, and CUDA adaptations on an Nvidia Jetson Tegra X2 mobile GPU, respectively.
△ Less
Submitted 5 September, 2021; v1 submitted 28 November, 2020;
originally announced November 2020.
-
AdaptivFloat: A Floating-point based Data Type for Resilient Deep Learning Inference
Authors:
Thierry Tambe,
En-Yu Yang,
Zishen Wan,
Yuntian Deng,
Vijay Janapa Reddi,
Alexander Rush,
David Brooks,
Gu-Yeon Wei
Abstract:
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optim…
▽ More
Conventional hardware-friendly quantization methods, such as fixed-point or integer, tend to perform poorly at very low word sizes as their shrinking dynamic ranges cannot adequately capture the wide data distributions commonly seen in sequence transduction models. We present AdaptivFloat, a floating-point inspired number representation format for deep learning that dynamically maximizes and optimally clips its available dynamic range, at a layer granularity, in order to create faithful encoding of neural network parameters. AdaptivFloat consistently produces higher inference accuracies compared to block floating-point, uniform, IEEE-like float or posit encodings at very low precision ($\leq$ 8-bit) across a diverse set of state-of-the-art neural network topologies. And notably, AdaptivFloat is seen surpassing baseline FP32 performance by up to +0.3 in BLEU score and -0.75 in word error rate at weight bit widths that are $\leq$ 8-bit. Experimental results on a deep neural network (DNN) hardware accelerator, exploiting AdaptivFloat logic in its computational datapath, demonstrate per-operation energy and area that is 0.9$\times$ and 1.14$\times$, respectively, that of equivalent bit width integer-based accelerator variants.
△ Less
Submitted 11 February, 2020; v1 submitted 29 September, 2019;
originally announced September 2019.
-
MASR: A Modular Accelerator for Sparse RNNs
Authors:
Udit Gupta,
Brandon Reagen,
Lillian Pentecost,
Marco Donato,
Thierry Tambe,
Alexander M. Rush,
Gu-Yeon Wei,
David Brooks
Abstract:
Recurrent neural networks (RNNs) are becoming the de facto solution for speech recognition. RNNs exploit long-term temporal relationships in data by applying repeated, learned transformations. Unlike fully-connected (FC) layers with single vector matrix operations, RNN layers consist of hundreds of such operations chained over time. This poses challenges unique to RNNs that are not found in convol…
▽ More
Recurrent neural networks (RNNs) are becoming the de facto solution for speech recognition. RNNs exploit long-term temporal relationships in data by applying repeated, learned transformations. Unlike fully-connected (FC) layers with single vector matrix operations, RNN layers consist of hundreds of such operations chained over time. This poses challenges unique to RNNs that are not found in convolutional neural networks (CNNs) or FC models, namely large dynamic activation. In this paper we present MASR, a principled and modular architecture that accelerates bidirectional RNNs for on-chip ASR. MASR is designed to exploit sparsity in both dynamic activations and static weights. The architecture is enhanced by a series of dynamic activation optimizations that enable compact storage, ensure no energy is wasted computing null operations, and maintain high MAC utilization for highly parallel accelerator designs. In comparison to current state-of-the-art sparse neural network accelerators (e.g., EIE), MASR provides 2x area 3x energy, and 1.6x performance benefits. The modular nature of MASR enables designs that efficiently scale from resource-constrained low-power IoT applications to large-scale, highly parallel datacenter deployments.
△ Less
Submitted 23 August, 2019;
originally announced August 2019.
-
First-principles calculations of step formation energies and step interactions on TiN(001)
Authors:
C. V. Ciobanu,
D. T. Tambe,
V. B. Shenoy
Abstract:
We study the formation energies and repulsive interactions of monatomic steps on the TiN(001) surface, using density functional total-energy calculations. The calculated formation energy of [100] oriented steps agree well with recently reported experimental values; these steps are shown to have a rumpled structure, with the Ti atoms undergoing larger displacements than the N atoms. For steps tha…
▽ More
We study the formation energies and repulsive interactions of monatomic steps on the TiN(001) surface, using density functional total-energy calculations. The calculated formation energy of [100] oriented steps agree well with recently reported experimental values; these steps are shown to have a rumpled structure, with the Ti atoms undergoing larger displacements than the N atoms. For steps that are parallel to [110], our calculations predict a nitrogen (N) termination, as the corresponding formation energy is several hundred meV/Å\ smaller than that of Ti-terminated steps.
△ Less
Submitted 24 October, 2004; v1 submitted 30 September, 2004;
originally announced October 2004.
-
Influence of step-edge barriers on the morphological relaxation of nanoscale ripples on crystal surfaces
Authors:
V. B. Shenoy,
A. Ramasubramaniam,
H. Ramanarayan,
D. T. Tambe,
W-L. Chan,
E. Chason
Abstract:
We show that the decay of sinusoidal ripples on crystal surfaces, where mass transport is limited by the attachment and detachment of atoms at the step-edges, is remarkably different from the decay behavior that has been reported until now. Unlike the decreasing or at most constant rate of amplitude decay of sinusoidal profiles observed in earlier work, we find that the decay rate increases with…
▽ More
We show that the decay of sinusoidal ripples on crystal surfaces, where mass transport is limited by the attachment and detachment of atoms at the step-edges, is remarkably different from the decay behavior that has been reported until now. Unlike the decreasing or at most constant rate of amplitude decay of sinusoidal profiles observed in earlier work, we find that the decay rate increases with decreasing amplitude in this kinetic regime. The rate of shape invariant amplitude relaxation is shown to be inversely proportional to both the square of the wavelength and the current amplitude. We have also carried out numerical simulations of the relaxation of realistic sputter ripples.
△ Less
Submitted 29 April, 2004;
originally announced April 2004.
-
On the energetic origin of self-limiting trenches formed around Ge/Si quantum dots
Authors:
D. T. Tambe,
V. B. Shenoy
Abstract:
At high growth temperatures, the misfit strain at the boundary of Ge quantum dots on Si(001) is relieved by formation of trenches around the base of the islands. The depth of the trenches has been observed to saturate at a level that depends on the base-width of the islands. Using finite element simulations, we show that the self-limiting nature of trench depth is due to a competition between th…
▽ More
At high growth temperatures, the misfit strain at the boundary of Ge quantum dots on Si(001) is relieved by formation of trenches around the base of the islands. The depth of the trenches has been observed to saturate at a level that depends on the base-width of the islands. Using finite element simulations, we show that the self-limiting nature of trench depth is due to a competition between the elastic relaxation energy gained by the formation of the trench and the surface energy cost for creating the trench. Our simulations predict a linear increase of the trench depth with the island radius, in quantitative agreement with the experimental observations of Drucker and coworkers.
△ Less
Submitted 26 April, 2004;
originally announced April 2004.
-
Comparative study of dimer vacancies and dimer-vacancy lines on Si(001) and Ge(001)
Authors:
C. V. Ciobanu,
D. T. Tambe,
V. B. Shenoy
Abstract:
Although the clean Si(001) and Ge(001) surfaces are very similar, experiments to date have shown that dimer-vacancy (DV) defects self-organize into vacancy lines (VLs) on Si(001), but not on Ge(001). In this paper, we perform empirical-potential calculations aimed at understanding the differences between the vacancies on Si(001) and Ge(001). We identify three energetic parameters that characteri…
▽ More
Although the clean Si(001) and Ge(001) surfaces are very similar, experiments to date have shown that dimer-vacancy (DV) defects self-organize into vacancy lines (VLs) on Si(001), but not on Ge(001). In this paper, we perform empirical-potential calculations aimed at understanding the differences between the vacancies on Si(001) and Ge(001). We identify three energetic parameters that characterize the DVs on the two surfaces: the formation energy of a single DV, the attraction between two DVs in adjacent dimer rows, and the strain sensitivity of the formation energy of DVs and VLs. At the empirical level of treatment of the atomic interactions (Tersoff potentials), all three parameters are favorable for the self-assembly of DVs on the Si(001) surface rather than on Ge(001). The most significant difference between the defects on Si(001) and on Ge(001) concerns the formation energy of single DVs, which is three times larger in the latter case. By calculating the strain-dependent formation energies of DVs and VLs, we propose that the experimental observation of self-assembly of vacancies on clean Ge(001) could be achieved by applying compressive strains of the order of 2%.
△ Less
Submitted 18 March, 2004; v1 submitted 30 October, 2003;
originally announced October 2003.
-
Atomic-scale perspective on the origin of attractive step interactions on Si(113)
Authors:
C. V. Ciobanu,
D. T. Tambe,
V. B. Shenoy,
C. Z. Wang,
K. M. Ho
Abstract:
Recent experiments have shown that steps on Si(113) surfaces self-organize into bunches due to a competition between long-range repulsive and short-range attractive interactions. Using empirical and tight-binding interatomic potentials, we investigate the physical origin of the short-range attraction, and report the formation and interaction energies of steps. We find that the short-range attrac…
▽ More
Recent experiments have shown that steps on Si(113) surfaces self-organize into bunches due to a competition between long-range repulsive and short-range attractive interactions. Using empirical and tight-binding interatomic potentials, we investigate the physical origin of the short-range attraction, and report the formation and interaction energies of steps. We find that the short-range attraction between steps is due to the annihilation of force monopoles at their edges as they combine to form bunches. Our results for the strengths of the attractive interactions are consistent with the values determined from experimental studies on kinetics of faceting.
△ Less
Submitted 30 October, 2003; v1 submitted 24 April, 2003;
originally announced April 2003.