-
A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms
Authors:
Tom Glint,
Chandan Kumar Jha,
Manu Awasthi,
Joycee Mekie
Abstract:
Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous compari…
▽ More
Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous comparison among the state-of-the-art accelerators from DNN accelerator paradigms, we have used unique layers from MobileNet, ResNet, BERT, and DLRM of MLPerf Inference benchmark for our analysis. The detailed models are based on hardware-realized state-of-the art designs. We observe that for memory-intensive Fully Connected Layer (FCL) DNNs, NDP based accelerator is 10.6x faster than the state-of-the-art CHA and 39.9x faster than PIM based accelerator for inferencing. For compute-intensive image classification and object detection DNNs, the state-of-the-art CHA is ~10x faster than NDP and ~2000x faster than the PIM-based accelerator for inferencing. PIM-based accelerators are suitable for DNN applications where energy is a constraint (~2.7x and ~21x lower energy for CNN and FCL applications, respectively, than conventional ASIC systems). Further, we identify architectural changes (such as increasing memory bandwidth, buffer reorganization) that can increase throughput (up to linear increase) and lower energy (up to linear decrease) for ML applications with a detailed sensitivity analysis of relevant components in CHA, NDP and PIM based accelerators.
△ Less
Submitted 10 August, 2022;
originally announced August 2022.
-
Learning by Cheating : An End-to-End Zero Shot Framework for Autonomous Drone Navigation
Authors:
Praveen Venkatesh,
Viraj Shah,
Vrutik Shah,
Yash Kamble,
Joycee Mekie
Abstract:
This paper proposes a novel framework for autonomous drone navigation through a cluttered environment. Control policies are learnt in a low-level environment during training and are applied to a complex environment during inference. The controller learnt in the training environment is tricked into believing that the robot is still in the training environment when it is actually navigating in a mor…
▽ More
This paper proposes a novel framework for autonomous drone navigation through a cluttered environment. Control policies are learnt in a low-level environment during training and are applied to a complex environment during inference. The controller learnt in the training environment is tricked into believing that the robot is still in the training environment when it is actually navigating in a more complex environment. The framework presented in this paper can be adapted to reuse simple policies in more complex tasks. We also show that the framework can be used as an interpretation tool for reinforcement learning algorithms.
△ Less
Submitted 11 November, 2021;
originally announced November 2021.
-
HyGain: High Performance, Energy-Efficient Hybrid Gain Cell based Cache Hierarchy
Authors:
Sarabjeet Singh,
Neelam Surana,
Pranjali Jain,
Joycee Mekie,
Manu Awasthi
Abstract:
In this paper, we propose a 'full-stack' solution to designing high capacity and low latency on-chip cache hierarchies by starting at the circuit level of the hardware design stack. First, we propose a novel Gain Cell (GC) design using FDSOI. The GC has several desirable characteristics, including ~50% higher storage density and ~50% lower dynamic energy as compared to the traditional 6T SRAM, eve…
▽ More
In this paper, we propose a 'full-stack' solution to designing high capacity and low latency on-chip cache hierarchies by starting at the circuit level of the hardware design stack. First, we propose a novel Gain Cell (GC) design using FDSOI. The GC has several desirable characteristics, including ~50% higher storage density and ~50% lower dynamic energy as compared to the traditional 6T SRAM, even after accounting for peripheral circuit overheads. We also exploit back-gate bias to increase retention time to 1.12 ms (~60x of eDRAM) which, combined with optimizations like staggered refresh, makes it an ideal candidate to architect all levels of on-chip caches. We show that compared to 6T SRAM, for a given area budget, GC based caches, on average, provide 29% and 36% increase in IPC for single- and multi-programmed workloads, respectively on contemporary workloads including SPEC CPU2017. We also observe dynamic energy savings of 42% and 34% for single- and multi-programmed workloads, respectively.
We utilize the inherent properties of the proposed GC, including decoupled read and write bitlines to devise optimizations to save precharge energy and architect GC caches with better energy and performance characteristics. Finally, in a quest to utilize the best of all worlds, we combine GC with STT-RAM to create hybrid hierarchies. We show that a hybrid hierarchy with GC caches at L1 and L2, and an LLC split between GC and STT-RAM, with asymmetric write optimization enabled, is able to provide a 54% benefit in energy-delay product (EDP) as compared to an all-SRAM design, and 13% as compared to an all-GC cache hierarchy, averaged across multi-programmed workloads.
△ Less
Submitted 6 October, 2021; v1 submitted 4 October, 2021;
originally announced October 2021.
-
Analysing digital in-memory computing for advanced finFET node
Authors:
Veerendra S Devaraddi,
Joycee M. Mekie
Abstract:
Digital In-memory computing improves energy efficiency and throughput of a data-intensive process, which incur memory thrashing and, resulting multiple same memory accesses in a von Neumann architecture. Digital in-memory computing involves accessing multiple SRAM cells simultaneously, which may result in a bit flip when not timed critically. Therefore we discuss the transient voltage characterist…
▽ More
Digital In-memory computing improves energy efficiency and throughput of a data-intensive process, which incur memory thrashing and, resulting multiple same memory accesses in a von Neumann architecture. Digital in-memory computing involves accessing multiple SRAM cells simultaneously, which may result in a bit flip when not timed critically. Therefore we discuss the transient voltage characteristics of the bitlines during an SRAM compute. To improve the packaging density and also avoid MOSFET down-scaling issues, we use a 7-nm predictive PDK which uses a finFET node. The finFET process has discrete fins and a lower Voltage supply, which makes the design of in-memory compute SRAM difficult. In this paper, we design a 6T SRAM cell in 7-nm finFET node and compare its SNMs with a UMC 28nm node implementation. Further, we design and simulate the rest of the SRAM peripherals, and in-memory computation for an advanced finFET node.
△ Less
Submitted 10 August, 2021; v1 submitted 2 August, 2021;
originally announced August 2021.
-
Zero Aware Configurable Data Encoding by Skip** Transfer for Error Resilient Applications
Authors:
Chandan Kumar Jha,
Shreyas Singh,
Riddhi Thakker,
Manu Awasthi,
Joycee Mekie
Abstract:
In this paper, we propose Zero Aware Configurable Data Encoding by Skip** Transfer (ZAC-DEST), a data encoding scheme to reduce the energy consumption of DRAM channels, specifically targeted towards approximate computing and error resilient applications. ZAC-DEST exploits the similarity between recent data transfers across channels and information about the error resilience behavior of applicati…
▽ More
In this paper, we propose Zero Aware Configurable Data Encoding by Skip** Transfer (ZAC-DEST), a data encoding scheme to reduce the energy consumption of DRAM channels, specifically targeted towards approximate computing and error resilient applications. ZAC-DEST exploits the similarity between recent data transfers across channels and information about the error resilience behavior of applications to reduce on-die termination and switching energy by reducing the number of 1's transmitted over the channels. ZAC-DEST also provides a number of knobs for trading off the application's accuracy for energy savings, and vice versa, and can be applied to both training and inference.
We apply ZAC-DEST to five machine learning applications. On average, across all applications and configurations, we observed a reduction of $40$% in termination energy and $37$% in switching energy as compared to the state of the art data encoding technique BD-Coder with an average output quality loss of $10$%. We show that if both training and testing are done assuming the presence of ZAC-DEST, the output quality of the applications can be improved upto 9 times as compared to when ZAC-DEST is only applied during testing leading to energy savings during training and inference with increased output quality.
△ Less
Submitted 16 May, 2021;
originally announced May 2021.
-
Fixed-Posit: A Floating-Point Representation for Error-Resilient Applications
Authors:
Varun Gohil,
Sumit Walia,
Joycee Mekie,
Manu Awasthi
Abstract:
Today, almost all computer systems use IEEE-754 floating point to represent real numbers. Recently, posit was proposed as an alternative to IEEE-754 floating point as it has better accuracy and a larger dynamic range. The configurable nature of posit, with varying number of regime and exponent bits, has acted as a deterrent to its adoption. To overcome this shortcoming, we propose fixed-posit repr…
▽ More
Today, almost all computer systems use IEEE-754 floating point to represent real numbers. Recently, posit was proposed as an alternative to IEEE-754 floating point as it has better accuracy and a larger dynamic range. The configurable nature of posit, with varying number of regime and exponent bits, has acted as a deterrent to its adoption. To overcome this shortcoming, we propose fixed-posit representation where the number of regime and exponent bits are fixed, and present the design of a fixed-posit multiplier. We evaluate the fixed-posit multiplier on error-resilient applications of AxBench and OpenBLAS benchmarks as well as neural networks. The proposed fixed-posit multiplier has 47%, 38.5%, 22% savings for power, area and delay respectively when compared to posit multipliers and up to 70%, 66%, 26% savings in power, area and delay respectively when compared to 32-bit IEEE-754 multiplier. These savings are accompanied with minimal output quality loss (1.2% average relative error) across OpenBLAS and AxBench workloads. Further, for neural networks like ResNet-18 on ImageNet we observe a negligible accuracy loss (0.12%) on using the fixed-posit multiplier.
△ Less
Submitted 10 April, 2021;
originally announced April 2021.
-
SERAD: Soft Error Resilient Asynchronous Design using a Bundled Data Protocol
Authors:
Sai Aparna Aketi,
Smriti Gupta,
Huimei Cheng,
Joycee Mekie,
Peter A. Beerel
Abstract:
The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-d…
▽ More
The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-data design template, SERAD, which uses a combination of temporal and spatial redundancy to mitigate Single Event Transients (SETs) and upsets (SEUs). SERAD uses Error Detecting Logic (EDL) to detect SETs at the inputs of sequential elements and correct them via re-sampling. Because SERAD only pays the delay penalty in the presence of an SET, which rarely occurs, its average performance is comparable to the baseline synchronous design. We tested the SERAD design using a combination of Spice and Verilog simulations and evaluated its impact on area, frequency, and power on an open-core MIPS-like processor using a NCSU 45nm cell library. Our post-synthesis results show that the SERAD design consumes less than half of the area of the Triple Modular Redundancy (TMR), exhibits significantly less performance degradation than Glitch Filtering (GF), and consumes no more total power than the baseline unhardened design.
△ Less
Submitted 12 January, 2020;
originally announced January 2020.