Search | arXiv e-print repository

A Fresh Perspective on DNN Accelerators by Performing Holistic Analysis Across Paradigms

Authors: Tom Glint, Chandan Kumar Jha, Manu Awasthi, Joycee Mekie

Abstract: Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous compari… ▽ More Traditional computers with von Neumann architecture are unable to meet the latency and scalability challenges of Deep Neural Network (DNN) workloads. Various DNN accelerators based on Conventional compute Hardware Accelerator (CHA), Near-Data-Processing (NDP) and Processing-in-Memory (PIM) paradigms have been proposed to meet these challenges. Our goal in this work is to perform a rigorous comparison among the state-of-the-art accelerators from DNN accelerator paradigms, we have used unique layers from MobileNet, ResNet, BERT, and DLRM of MLPerf Inference benchmark for our analysis. The detailed models are based on hardware-realized state-of-the art designs. We observe that for memory-intensive Fully Connected Layer (FCL) DNNs, NDP based accelerator is 10.6x faster than the state-of-the-art CHA and 39.9x faster than PIM based accelerator for inferencing. For compute-intensive image classification and object detection DNNs, the state-of-the-art CHA is ~10x faster than NDP and ~2000x faster than the PIM-based accelerator for inferencing. PIM-based accelerators are suitable for DNN applications where energy is a constraint (~2.7x and ~21x lower energy for CNN and FCL applications, respectively, than conventional ASIC systems). Further, we identify architectural changes (such as increasing memory bandwidth, buffer reorganization) that can increase throughput (up to linear increase) and lower energy (up to linear decrease) for ML applications with a detailed sensitivity analysis of relevant components in CHA, NDP and PIM based accelerators. △ Less

Submitted 10 August, 2022; originally announced August 2022.

arXiv:2110.01208 [pdf, other]

HyGain: High Performance, Energy-Efficient Hybrid Gain Cell based Cache Hierarchy

Authors: Sarabjeet Singh, Neelam Surana, Pranjali Jain, Joycee Mekie, Manu Awasthi

Abstract: In this paper, we propose a 'full-stack' solution to designing high capacity and low latency on-chip cache hierarchies by starting at the circuit level of the hardware design stack. First, we propose a novel Gain Cell (GC) design using FDSOI. The GC has several desirable characteristics, including ~50% higher storage density and ~50% lower dynamic energy as compared to the traditional 6T SRAM, eve… ▽ More In this paper, we propose a 'full-stack' solution to designing high capacity and low latency on-chip cache hierarchies by starting at the circuit level of the hardware design stack. First, we propose a novel Gain Cell (GC) design using FDSOI. The GC has several desirable characteristics, including ~50% higher storage density and ~50% lower dynamic energy as compared to the traditional 6T SRAM, even after accounting for peripheral circuit overheads. We also exploit back-gate bias to increase retention time to 1.12 ms (~60x of eDRAM) which, combined with optimizations like staggered refresh, makes it an ideal candidate to architect all levels of on-chip caches. We show that compared to 6T SRAM, for a given area budget, GC based caches, on average, provide 29% and 36% increase in IPC for single- and multi-programmed workloads, respectively on contemporary workloads including SPEC CPU2017. We also observe dynamic energy savings of 42% and 34% for single- and multi-programmed workloads, respectively. We utilize the inherent properties of the proposed GC, including decoupled read and write bitlines to devise optimizations to save precharge energy and architect GC caches with better energy and performance characteristics. Finally, in a quest to utilize the best of all worlds, we combine GC with STT-RAM to create hybrid hierarchies. We show that a hybrid hierarchy with GC caches at L1 and L2, and an LLC split between GC and STT-RAM, with asymmetric write optimization enabled, is able to provide a 54% benefit in energy-delay product (EDP) as compared to an all-SRAM design, and 13% as compared to an all-GC cache hierarchy, averaged across multi-programmed workloads. △ Less

Submitted 6 October, 2021; v1 submitted 4 October, 2021; originally announced October 2021.

arXiv:2105.07432 [pdf, other]

Zero Aware Configurable Data Encoding by Skip** Transfer for Error Resilient Applications

Authors: Chandan Kumar Jha, Shreyas Singh, Riddhi Thakker, Manu Awasthi, Joycee Mekie

Abstract: In this paper, we propose Zero Aware Configurable Data Encoding by Skip** Transfer (ZAC-DEST), a data encoding scheme to reduce the energy consumption of DRAM channels, specifically targeted towards approximate computing and error resilient applications. ZAC-DEST exploits the similarity between recent data transfers across channels and information about the error resilience behavior of applicati… ▽ More In this paper, we propose Zero Aware Configurable Data Encoding by Skip** Transfer (ZAC-DEST), a data encoding scheme to reduce the energy consumption of DRAM channels, specifically targeted towards approximate computing and error resilient applications. ZAC-DEST exploits the similarity between recent data transfers across channels and information about the error resilience behavior of applications to reduce on-die termination and switching energy by reducing the number of 1's transmitted over the channels. ZAC-DEST also provides a number of knobs for trading off the application's accuracy for energy savings, and vice versa, and can be applied to both training and inference. We apply ZAC-DEST to five machine learning applications. On average, across all applications and configurations, we observed a reduction of $40$% in termination energy and $37$% in switching energy as compared to the state of the art data encoding technique BD-Coder with an average output quality loss of $10$%. We show that if both training and testing are done assuming the presence of ZAC-DEST, the output quality of the applications can be improved upto 9 times as compared to when ZAC-DEST is only applied during testing leading to energy savings during training and inference with increased output quality. △ Less

Submitted 16 May, 2021; originally announced May 2021.

arXiv:2104.04763 [pdf, other]

Fixed-Posit: A Floating-Point Representation for Error-Resilient Applications

Authors: Varun Gohil, Sumit Walia, Joycee Mekie, Manu Awasthi

Abstract: Today, almost all computer systems use IEEE-754 floating point to represent real numbers. Recently, posit was proposed as an alternative to IEEE-754 floating point as it has better accuracy and a larger dynamic range. The configurable nature of posit, with varying number of regime and exponent bits, has acted as a deterrent to its adoption. To overcome this shortcoming, we propose fixed-posit repr… ▽ More Today, almost all computer systems use IEEE-754 floating point to represent real numbers. Recently, posit was proposed as an alternative to IEEE-754 floating point as it has better accuracy and a larger dynamic range. The configurable nature of posit, with varying number of regime and exponent bits, has acted as a deterrent to its adoption. To overcome this shortcoming, we propose fixed-posit representation where the number of regime and exponent bits are fixed, and present the design of a fixed-posit multiplier. We evaluate the fixed-posit multiplier on error-resilient applications of AxBench and OpenBLAS benchmarks as well as neural networks. The proposed fixed-posit multiplier has 47%, 38.5%, 22% savings for power, area and delay respectively when compared to posit multipliers and up to 70%, 66%, 26% savings in power, area and delay respectively when compared to 32-bit IEEE-754 multiplier. These savings are accompanied with minimal output quality loss (1.2% average relative error) across OpenBLAS and AxBench workloads. Further, for neural networks like ResNet-18 on ImageNet we observe a negligible accuracy loss (0.12%) on using the fixed-posit multiplier. △ Less

Submitted 10 April, 2021; originally announced April 2021.

Comments: This is a pre-print for the paper version that has been accepted to TCAS-II, Express Briefs

ACM Class: C.1.3; C.0; B.0

arXiv:1910.00651 [pdf, other]

doi 10.1145/3297663.3310311

Memory Centric Characterization and Analysis of SPEC CPU2017 Suite

Authors: Sarabjeet Singh, Manu Awasthi

Abstract: In this paper we provide a comprehensive, memory-centric characterization of the SPEC CPU2017 benchmark suite, using a number of mechanisms including dynamic binary instrumentation, measurements on native hardware using hardware performance counters and OS based tools. We present a number of results including working set sizes, memory capacity consumption and, memory bandwidth utilization of var… ▽ More In this paper we provide a comprehensive, memory-centric characterization of the SPEC CPU2017 benchmark suite, using a number of mechanisms including dynamic binary instrumentation, measurements on native hardware using hardware performance counters and OS based tools. We present a number of results including working set sizes, memory capacity consumption and, memory bandwidth utilization of various workloads. Our experiments reveal that the SPEC CPU2017 workloads are surprisingly memory intensive, with approximately 50% of all dynamic instructions being memory intensive ones. We also show that there is a large variation in the memory footprint and bandwidth utilization profiles of the entire suite, with some benchmarks using as much as 16 GB of main memory and up to 2.3 GB/s of memory bandwidth. We also perform instruction execution and distribution analysis of the suite and find that the average instruction count for SPEC CPU2017 workloads is an order of magnitude higher than SPEC CPU2006 ones. In addition, we also find that FP benchmarks of the SPEC 2017 suite have higher compute requirements: on average, FP workloads execute three times the number of compute operations as compared to INT workloads. △ Less

Submitted 30 September, 2019; originally announced October 2019.

Comments: 12 pages, 133 figures, A short version of this work has been published at "Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering"

arXiv:1809.05928 [pdf, other]

doi 10.1109/TBDATA.2018.2871114

I/O Workload Management for All-Flash Datacenter Storage Systems Based on Total Cost of Ownership

Authors: Zhengyu Yang, Manu Awasthi, Mrinmoy Ghosh, Janki Bhimani, Ningfang Mi

Abstract: Recently, the capital expenditure of flash-based Solid State Driver (SSDs) keeps declining and the storage capacity of SSDs keeps increasing. As a result, all-flash storage systems have started to become more economically viable for large shared storage installations in datacenters, where metrics like Total Cost of Ownership (TCO) are of paramount importance. On the other hand, flash devices suffe… ▽ More Recently, the capital expenditure of flash-based Solid State Driver (SSDs) keeps declining and the storage capacity of SSDs keeps increasing. As a result, all-flash storage systems have started to become more economically viable for large shared storage installations in datacenters, where metrics like Total Cost of Ownership (TCO) are of paramount importance. On the other hand, flash devices suffer from write amplification, which, if unaccounted, can substantially increase the TCO of a storage system. In this paper, we first develop a TCO model for datacenter all-flash storage systems, and then plug a Write Amplification model (WAF) of NVMe SSDs we build based on empirical data into this TCO model. Our new WAF model accounts for workload characteristics like write rate and percentage of sequential writes. Furthermore, using both the TCO and WAF models as the optimization criterion, we design new flash resource management schemes (MINTCO) to guide datacenter managers to make workload allocation decisions under the consideration of TCO for SSDs. Based on that, we also develop MINTCO-RAID to support RAID SSDs and MINTCO-OFFLINE to optimize the offline workload-disk deployment problem during the initialization phase. Experimental results show that MINTCO can reduce the TCO and keep relatively high throughput and space utilization of the entire datacenter storage resources. △ Less

Submitted 8 July, 2019; v1 submitted 16 September, 2018; originally announced September 2018.

Showing 1–6 of 6 results for author: Awasthi, M