Search | arXiv e-print repository

Self-Adaptive Reconfigurable Arrays (SARA): Using ML to Assist Scaling GEMM Acceleration

Authors: Ananda Samajdar, Michael Pellauer, Tushar Krishna

Abstract: With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the research community has been investigating flexible/reconfigurable accelerator substrates. This line of research has opened up two challenges. The first is to determine the appropriate amount of flexibility within an accelerator array that that can trade-off the performance benefits versus the area… ▽ More With increasing diversity in Deep Neural Network(DNN) models in terms of layer shapes and sizes, the research community has been investigating flexible/reconfigurable accelerator substrates. This line of research has opened up two challenges. The first is to determine the appropriate amount of flexibility within an accelerator array that that can trade-off the performance benefits versus the area overheads of the reconfigurability. The second is being able to determine the right configuration of the array for the current DNN model and/or layer and reconfigure the accelerator at runtime. This work introduces a new class of accelerators that we call Self Adaptive Reconfigurable Array (SARA). SARA architectures comprise of both a reconfigurable array and a hardware unit capable of determining an optimized configuration for the array at runtime. We demonstrate an instance of SARA with an accelerator we call SAGAR, which introduces a novel reconfigurable systolic array that can be configured to work as a distributed collection of smaller arrays of various sizes or as a single array with flexible aspect ratios. We also develop a novel recommendation neural network called ADAPTNET which recommends an array configuration and dataflow for the current layer parameters. ADAPTNET runs on an integrated custom hardware ADAPTNETX that runs ADAPTNET at runtime and reconfigures the array, making the entire accelerator self-sufficient. SAGAR is capable of providing the same map** flexibility as a collection of 1024 4x4 arrays working as a distributed system while achieving 3.5x more power efficiency and 3.2x higher compute density Furthermore, the runtime achieved on the recommended parameters from ADAPTNET is 99.93% of the best achievable runtime. △ Less

Submitted 23 April, 2022; v1 submitted 12 January, 2021; originally announced January 2021.

arXiv:2012.12563 [pdf, other]

Architecture, Dataflow and Physical Design Implications of 3D-ICs for DNN-Accelerators

Authors: Jan Moritz Joseph, Ananda Samajdar, Lingjun Zhu, Rainer Leupers, Sung-Kyu Lim, Thilo Pionteck, Tushar Krishna

Abstract: The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-acce… ▽ More The everlasting demand for higher computing power for deep neural networks (DNNs) drives the development of parallel computing architectures. 3D integration, in which chips are integrated and connected vertically, can further increase performance because it introduces another level of spatial parallelism. Therefore, we analyze dataflows, performance, area, power and temperature of such 3D-DNN-accelerators. Monolithic and TSV-based stacked 3D-ICs are compared against 2D-ICs. We identify workload properties and architectural parameters for efficient 3D-ICs and achieve up to 9.14x speedup of 3D vs. 2D. We discuss area-performance trade-offs. We demonstrate applicability as the 3D-IC draws similar power as 2D-ICs and is not thermal limited. △ Less

Submitted 18 February, 2021; v1 submitted 23 December, 2020; originally announced December 2020.

arXiv:2011.14755 [pdf, other]

Dataflow-Architecture Co-Design for 2.5D DNN Accelerators using Wireless Network-on-Package

Authors: Robert Guirado, Hyoukjun Kwon, Sergi Abadal, Eduard Alarcón, Tushar Krishna

Abstract: Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration… ▽ More Deep neural network (DNN) models continue to grow in size and complexity, demanding higher computational power to enable real-time inference. To efficiently deliver such computational demands, hardware accelerators are being developed and deployed across scales. This naturally requires an efficient scale-out mechanism for increasing compute density as required by the application. 2.5D integration over interposer has emerged as a promising solution, but as we show in this work, the limited interposer bandwidth and multiple hops in the Network-on-Package (NoP) can diminish the benefits of the approach. To cope with this challenge, we propose WIENNA, a wireless NoP-based 2.5D DNN accelerator. In WIENNA, the wireless NoP connects an array of DNN accelerator chiplets to the global buffer chiplet, providing high-bandwidth multicasting capabilities. Here, we also identify the dataflow style that most efficienty exploits the wireless NoP's high-bandwidth multicasting capability on each layer. With modest area and power overheads, WIENNA achieves 2.2X--5.1X higher throughput and 38.2% lower energy than an interposer-based NoP design. △ Less

Submitted 30 November, 2020; originally announced November 2020.

Comments: ASPDAC '21

arXiv:2009.02010 [pdf, other]

ConfuciuX: Autonomous Hardware Resource Assignment for DNN Accelerators using Reinforcement Learning

Authors: Sheng-Chun Kao, Geonhwa Jeong, Tushar Krishna

Abstract: DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a datafl… ▽ More DNN accelerators provide efficiency by leveraging reuse of activations/weights/outputs during the DNN computations to reduce data movement from DRAM to the chip. The reuse is captured by the accelerator's dataflow. While there has been significant prior work in exploring and comparing various dataflows, the strategy for assigning on-chip hardware resources (i.e., compute and memory) given a dataflow that can optimize for performance/energy while meeting platform constraints of area/power for DNN(s) of interest is still relatively unexplored. The design-space of choices for balancing compute and memory explodes combinatorially, as we show in this work (e.g., as large as O(10^(72)) choices for running \mobilenet), making it infeasible to do manual-tuning via exhaustive searches. It is also difficult to come up with a specific heuristic given that different DNNs and layer types exhibit different amounts of reuse. In this paper, we propose an autonomous strategy called ConfuciuX to find optimized HW resource assignments for a given model and dataflow style. ConfuciuX leverages a reinforcement learning method, REINFORCE, to guide the search process, leveraging a detailed HW performance cost model within the training loop to estimate rewards. We also augment the RL approach with a genetic algorithm for further fine-tuning. ConfuciuX demonstrates the highest sample-efficiency for training compared to other techniques such as Bayesian optimization, genetic algorithm, simulated annealing, and other RL methods. It converges to the optimized hardware configuration 4.7 to 24 times faster than alternate techniques. △ Less

Submitted 4 September, 2020; originally announced September 2020.

arXiv:2008.11881 [pdf, other]

CLAN: Continuous Learning using Asynchronous Neuroevolution on Commodity Edge Devices

Authors: Parth Mannan, Ananda Samajdar, Tushar Krishna

Abstract: Recent advancements in machine learning algorithms, especially the development of Deep Neural Networks (DNNs) have transformed the landscape of Artificial Intelligence (AI). With every passing day, deep learning based methods are applied to solve new problems with exceptional results. The portal to the real world is the edge. The true impact of AI can only be fully realized if we can have AI agent… ▽ More Recent advancements in machine learning algorithms, especially the development of Deep Neural Networks (DNNs) have transformed the landscape of Artificial Intelligence (AI). With every passing day, deep learning based methods are applied to solve new problems with exceptional results. The portal to the real world is the edge. The true impact of AI can only be fully realized if we can have AI agents continuously interacting with the real world and solving everyday problems. Unfortunately, high compute and memory requirements of DNNs acts a huge barrier towards this vision. Today we circumvent this problem by deploying special purpose inference hardware on the edge while procuring trained models from the cloud. This approach, however, relies on constant interaction with the cloud for transmitting all the data, training on massive GPU clusters, and downloading updated models. This is challenging for bandwidth, privacy, and constant connectivity concerns that autonomous agents may exhibit. In this paper we evaluate techniques for enabling adaptive intelligence on edge devices with zero interaction with any high-end cloud/server. We build a prototype distributed system of Raspberry Pis communicating via WiFi running NeuroEvolutionary (NE) learning and inference. We evaluate the performance of such a collaborative system and detail the compute/communication characteristics of different arrangements of the system that trade-off parallelism versus communication. Using insights from our analysis, we also propose algorithmic modifications to reduce communication by up to 3.6x during the learning phase to enhance scalability even further and match performance of higher end computing devices at scale. We believe that these insights will enable algorithm-hardware co-design efforts for enabling continuous learning on the edge. △ Less

Submitted 26 August, 2020; originally announced August 2020.

Comments: Accepted and appears in ISPASS 2020

arXiv:2008.08289 [pdf, other]

Restructuring, Pruning, and Adjustment of Deep Models for Parallel Distributed Inference

Authors: Afshin Abdi, Saeed Rashidi, Faramarz Fekri, Tushar Krishna

Abstract: Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) where the deep model is divided into several parallel… ▽ More Using multiple nodes and parallel computing algorithms has become a principal tool to improve training and execution times of deep neural networks as well as effective collective intelligence in sensor networks. In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers) where the deep model is divided into several parallel sub-models, each of which is executed by a worker. Since latency due to synchronization and data transfer among workers negatively impacts the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models. To achieve this goal, we propose to rearrange the neurons in the neural network and partition them (without changing the general topology of the neural network), such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers. We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. To efficiently apply RePurpose, we propose an approach based on $\ell_0$ optimization and the Munkres assignment algorithm. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation, both in terms of communication and computational complexity. △ Less

Submitted 19 August, 2020; originally announced August 2020.

arXiv:2008.06741 [pdf, other]

Breaking Barriers: Maximizing Array Utilization for Compute In-Memory Fabrics

Authors: Brian Crafton, Samuel Spetalnick, Gauthaman Murali, Tushar Krishna, Sung-Kyu Lim, Arijit Raychowdhury

Abstract: Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM)… ▽ More Compute in-memory (CIM) is a promising technique that minimizes data transport, the primary performance bottleneck and energy cost of most data intensive applications. This has found wide-spread adoption in accelerating neural networks for machine learning applications. Utilizing a crossbar architecture with emerging non-volatile memories (eNVM) such as dense resistive random access memory (RRAM) or phase change random access memory (PCRAM), various forms of neural networks can be implemented to greatly reduce power and increase on chip memory capacity. However, compute in-memory faces its own limitations at both the circuit and the device levels. Although compute in-memory using the crossbar architecture can greatly reduce data transport, the rigid nature of these large fixed weight matrices forfeits the flexibility of traditional CMOS and SRAM based designs. In this work, we explore the different synchronization barriers that occur from the CIM constraints. Furthermore, we propose a new allocation algorithm and data flow based on input data distributions to maximize utilization and performance for compute-in memory based designs. We demonstrate a 7.47$\times$ performance improvement over a naive allocation method for CIM accelerators on ResNet18. △ Less

Submitted 15 August, 2020; originally announced August 2020.

Comments: 6 pages, 9 figures, conference paper

arXiv:2007.03152 [pdf, other]

The gem5 Simulator: Version 20.0+

Authors: Jason Lowe-Power, Abdul Mutaal Ahmad, Ayaz Akram, Mohammad Alian, Rico Amslinger, Matteo Andreozzi, Adrià Armejach, Nils Asmussen, Brad Beckmann, Srikant Bharadwaj, Gabe Black, Gedare Bloom, Bobby R. Bruce, Daniel Rodrigues Carvalho, Jeronimo Castrillon, Lizhong Chen, Nicolas Derumigny, Stephan Diestelhorst, Wendy Elsasser, Carlos Escuin, Marjan Fariborz, Amin Farmahini-Farahani, Pouya Fotouhi, Ryan Gambord, Jayneel Gandhi , et al. (53 additional authors not shown)

Abstract: The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 si… ▽ More The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research. △ Less

Submitted 29 September, 2020; v1 submitted 6 July, 2020; originally announced July 2020.

Comments: Source, comments, and feedback: https://github.com/darchr/gem5-20-paper

arXiv:2007.00156 [pdf, other]

doi 10.1109/ISCA52012.2021.00049

Enabling Compute-Communication Overlap in Distributed Deep Learning Training Platforms

Authors: Saeed Rashidi, Matthew Denton, Srinivas Sridharan, Sudarshan Srinivasan, Amoghavarsha Suresh, Jade Ni, Tushar Krishna

Abstract: Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and commu… ▽ More Deep Learning (DL) training platforms are built by interconnecting multiple DL accelerators (e.g., GPU/TPU) via fast, customized interconnects with 100s of gigabytes (GBs) of bandwidth. However, as we identify in this work, driving this bandwidth is quite challenging. This is because there is a pernicious balance between using the accelerator's compute and memory for both DL computations and communication. This work makes two key contributions. First, via real system measurements and detailed modeling, we provide an understanding of compute and memory bandwidth demands for DL compute and comms. Second, we propose a novel DL collective communication accelerator called Accelerator Collectives Engine (ACE) that sits alongside the compute and networking engines at the accelerator endpoint. ACE frees up the endpoint's compute and memory resources for DL compute, which in turn reduces the required memory BW by 3.5X on average to drive the same network BW compared to state-of-the-art baselines. For modern DL workloads and different network sizes, ACE, on average, increases the effective network bandwidth utilization by 1.44X (up to 2.67X), resulting in an average of 1.41X (up to 1.51X), 1.12X (up to 1.17X), and 1.13X (up to 1.19X) speedup in iteration time for ResNet-50, GNMT and DLRM when compared to the best baseline configuration, respectively. △ Less

Submitted 4 May, 2022; v1 submitted 30 June, 2020; originally announced July 2020.

arXiv:2006.07137 [pdf, other]

STONNE: A Detailed Architectural Simulator for Flexible Neural Network Accelerators

Authors: Francisco Muñoz-Martínez, José L. Abellán, Manuel E. Acacio, Tushar Krishna

Abstract: The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. First-generation rigid proposals have been rapidly replaced by more advanced flexible accelerator architectures able to efficiently support a variety of layer types and dimensions. As the complexity of the designs grows, it is more and more appeali… ▽ More The design of specialized architectures for accelerating the inference procedure of Deep Neural Networks (DNNs) is a booming area of research nowadays. First-generation rigid proposals have been rapidly replaced by more advanced flexible accelerator architectures able to efficiently support a variety of layer types and dimensions. As the complexity of the designs grows, it is more and more appealing for researchers to have cycle-accurate simulation tools at their disposal to allow for fast and accurate design-space exploration, and rapid quantification of the efficacy of architectural enhancements during the early stages of a design. To this end, we present STONNE (Simulation TOol of Neural Network Engines), a cycle-accurate, highly-modular and highly-extensible simulation framework that enables end-to-end evaluation of flexible accelerator architectures running complete contemporary DNN models. We use STONNE to model the recently proposed MAERI architecture and show how it can closely approach the performance results of the publicly available BSV-coded MAERI implementation. Then, we conduct a comprehensive evaluation and demonstrate that the folding strategy implemented for MAERI results in very low compute unit utilization (25% on average across 5 DNN models) which in the end translates into poor performance. △ Less

Submitted 10 June, 2020; originally announced June 2020.

arXiv:2006.03969 [pdf, other]

Conditional Neural Architecture Search

Authors: Sheng-Chun Kao, Arun Ramamurthy, Reed Williams, Tushar Krishna

Abstract: Designing resource-efficient Deep Neural Networks (DNNs) is critical to deploy deep learning solutions over edge platforms due to diverse performance, power, and memory budgets. Unfortunately, it is often the case a well-trained ML model does not fit to the constraint of deploying edge platforms, causing a long iteration of model reduction and retraining process. Moreover, a ML model optimized for… ▽ More Designing resource-efficient Deep Neural Networks (DNNs) is critical to deploy deep learning solutions over edge platforms due to diverse performance, power, and memory budgets. Unfortunately, it is often the case a well-trained ML model does not fit to the constraint of deploying edge platforms, causing a long iteration of model reduction and retraining process. Moreover, a ML model optimized for platform-A often may not be suitable when we deploy it on another platform-B, causing another iteration of model retraining. We propose a conditional neural architecture search method using GAN, which produces feasible ML models for different platforms. We present a new workflow to generate constraint-optimized DNN models. This is the first work of bringing in condition and adversarial technique into Neural Architecture Search domain. We verify the method with regression problems and classification on CIFAR-10. The proposed workflow can successfully generate resource-optimized MLP or CNN-based networks. △ Less

Submitted 6 June, 2020; originally announced June 2020.

arXiv:2006.03968 [pdf, other]

Generative Design of Hardware-aware DNNs

Authors: Sheng-Chun Kao, Arun Ramamurthy, Tushar Krishna

Abstract: To efficiently run DNNs on the edge/cloud, many new DNN inference accelerators are being designed and deployed frequently. To enhance the resource efficiency of DNNs, model quantization is a widely-used approach. However, different accelerator/HW has different resources leading to the need for specialized quantization strategy of each HW. Moreover, using the same quantization for every layer may b… ▽ More To efficiently run DNNs on the edge/cloud, many new DNN inference accelerators are being designed and deployed frequently. To enhance the resource efficiency of DNNs, model quantization is a widely-used approach. However, different accelerator/HW has different resources leading to the need for specialized quantization strategy of each HW. Moreover, using the same quantization for every layer may be sub-optimal, increasing the designspace of possible quantization choices. This makes manual-tuning infeasible. Recent work in automatically determining quantization for each layer is driven by optimization methods such as reinforcement learning. However, these approaches need re-training the RL for every new HW platform. We propose a new way for autonomous quantization and HW-aware tuning. We propose a generative model, AQGAN, which takes a target accuracy as the condition and generates a suite of quantization configurations. With the conditional generative model, the user can autonomously generate different configurations with different targets in inference time. Moreover, we propose a simplified HW-tuning flow, which uses the generative model to generate proposals and execute simple selection based on the HW resource budget, whose process is fast and interactive. We evaluate our model on five of the widely-used efficient models on the ImageNet dataset. We compare with existing uniform quantization and state-of-the-art autonomous quantization methods. Our generative model shows competitive achieved accuracy, however, with around two degrees less search cost for each design point. Our generative model shows the generated quantization configuration can lead to less than 3.5% error across all experiments. △ Less

Submitted 12 July, 2020; v1 submitted 6 June, 2020; originally announced June 2020.

arXiv:2006.00454 [pdf, other]

On the fluidic behavior of an over-expanded planar plug nozzle under lateral confinement

Authors: M. Chaudhary, T. V. Krishna, Sowmya R. Nanda, S. K. Karthick, A. Khan, A. De, S. Mohammed Ibrahim

Abstract: The present work aims to study the fluidic behavior on lateral confinement by placing side-walls on the planar plug nozzle through experiments. The study involves two cases of nozzle pressure ratio (NPR=3, 6), which correspond to over-expanded nozzle operating conditions. Steady-state pressure measurements, together with schlieren and surface oil flow visualization, reveal the presence of over-exp… ▽ More The present work aims to study the fluidic behavior on lateral confinement by placing side-walls on the planar plug nozzle through experiments. The study involves two cases of nozzle pressure ratio (NPR=3, 6), which correspond to over-expanded nozzle operating conditions. Steady-state pressure measurements, together with schlieren and surface oil flow visualization, reveal the presence of over-expansion shock and subsequent interaction and modification of the flow field on the plug surface. The flow remains attached to the plug surface for NPR=3; whereas, for NPR=6, a separated flow field with a recirculation bubble is observed. Spectral analysis of the unsteady pressure signals illustrates a clear difference between the attached and the separated flow. Besides, other flow features with a distinct temporal mode associated with and without lateral confinement are observed. The absence of lateral confinement reduces the intensity of low-frequency unsteadiness; however, on the contrary, the interaction region is relatively reduced under lateral confinement. △ Less

Submitted 2 August, 2020; v1 submitted 31 May, 2020; originally announced June 2020.

Comments: 14 pages, 12 figures, Revision submitted to Phys. Fluids

arXiv:2002.07752 [pdf, other]

Marvel: A Data-centric Compiler for DNN Operators on Spatial Accelerators

Authors: Prasanth Chatarasi, Hyoukjun Kwon, Natesh Raina, Saurabh Malik, Vaisakh Haridas, Angshuman Parashar, Michael Pellauer, Tushar Krishna, Vivek Sarkar

Abstract: The efficiency of a spatial DNN accelerator depends heavily on the compiler and its cost model ability to generate optimized map**s for various operators of DNN models on to the accelerator's compute and memory resources. But, existing cost models lack a formal boundary over the operators for precise and tractable analysis, which poses adaptability challenges for new DNN operators. To address th… ▽ More The efficiency of a spatial DNN accelerator depends heavily on the compiler and its cost model ability to generate optimized map**s for various operators of DNN models on to the accelerator's compute and memory resources. But, existing cost models lack a formal boundary over the operators for precise and tractable analysis, which poses adaptability challenges for new DNN operators. To address this challenge, we leverage the recently introduced Maestro Data-Centric (MDC) notation. We develop a formal understanding of DNN operators whose map**s can be described in the MDC notation, because any map** adhering to the notation is always analyzable by the MDC's cost model. Furthermore, we introduce a transformation for translating map**s into the MDC notation for exploring the map** space. Searching for the optimal map**s is challenging because of the large space of map**s, and this challenge gets exacerbated with new operators and diverse accelerator configurations.To address this challenge, we propose a decoupled off-chip/on-chip approach that decomposes the map** space into off-chip and on-chip subspaces, and first optimizes the off-chip subspace followed by the on-chip subspace. The motivation for this decomposition is to reduce the size of the search space dramatically and also to prioritize the optimization of off-chip data movement, which is 2-3 orders of magnitude more compared to the on-chip data movement. We implemented our approach in a tool called {\em Marvel}, and another major benefit of our approach is that it is applicable to any DNN operator conformable with the MDC notation. △ Less

Submitted 11 June, 2020; v1 submitted 18 February, 2020; originally announced February 2020.

arXiv:2002.04116 [pdf, ps, other]

Co-Exploration of Neural Architectures and Heterogeneous ASIC Accelerator Designs Targeting Multiple Tasks

Authors: Lei Yang, Zheyu Yan, Meng Li, Hyoukjun Kwon, Liangzhen Lai, Tushar Krishna, Vikas Chandra, Weiwen Jiang, Yiyu Shi

Abstract: Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem, how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful AI accelerating platforms. The major bottleneck comes from the large… ▽ More Neural Architecture Search (NAS) has demonstrated its power on various AI accelerating platforms such as Field Programmable Gate Arrays (FPGAs) and Graphic Processing Units (GPUs). However, it remains an open problem, how to integrate NAS with Application-Specific Integrated Circuits (ASICs), despite them being the most powerful AI accelerating platforms. The major bottleneck comes from the large design freedom associated with ASIC designs. Moreover, with the consideration that multiple DNNs will run in parallel for different workloads with diverse layer operations and sizes, integrating heterogeneous ASIC sub-accelerators for distinct DNNs in one design can significantly boost performance, and at the same time further complicate the design space. To address these challenges, in this paper we build ASIC template set based on existing successful designs, described by their unique dataflows, so that the design space is significantly reduced. Based on the templates, we further propose a framework, namely NASAIC, which can simultaneously identify multiple DNN architectures and the associated heterogeneous ASIC accelerator design, such that the design specifications (specs) can be satisfied, while the accuracy can be maximized. Experimental results show that compared with successive NAS and ASIC design optimizations which lead to design spec violations, NASAIC can guarantee the results to meet the design specs with 17.77%, 2.49x, and 2.32x reductions on latency, energy, and area and with 0.76% accuracy loss. To the best of the authors' knowledge, this is the first work on neural architecture and ASIC accelerator design co-exploration. △ Less

Submitted 10 February, 2020; originally announced February 2020.

Comments: Accepted by DAC'20

arXiv:1912.01664 [pdf, other]

Understanding the Impact of On-chip Communication on DNN Accelerator Performance

Authors: Robert Guirado, Hyoukjun Kwon, Eduard Alarcón, Sergi Abadal, Tushar Krishna

Abstract: Deep Neural Networks have flourished at an unprecedented pace in recent years. They have achieved outstanding accuracy in fields such as computer vision, natural language processing, medicine or economics. Specifically, Convolutional Neural Networks (CNN) are particularly suited to object recognition or identification tasks. This, however, comes at a high computational cost, prompting the use of s… ▽ More Deep Neural Networks have flourished at an unprecedented pace in recent years. They have achieved outstanding accuracy in fields such as computer vision, natural language processing, medicine or economics. Specifically, Convolutional Neural Networks (CNN) are particularly suited to object recognition or identification tasks. This, however, comes at a high computational cost, prompting the use of specialized GPU architectures or even ASICs to achieve high speeds and energy efficiency. ASIC accelerators streamline the execution of certain dataflows amenable to CNN computation that imply the constant movement of large amounts of data, thereby turning on-chip communication into a critical function within the accelerator. This paper studies the communication flows within CNN inference accelerators of edge devices, with the aim to justify current and future decisions in the design of the on-chip networks that interconnect their processing elements. Leveraging this analysis, we then qualitatively discuss the potential impact of introducing the novel paradigm of wireless on-chip network in this context. △ Less

Submitted 3 December, 2019; originally announced December 2019.

Comments: ICECS2019

arXiv:1909.07437 [pdf, other]

Heterogeneous Dataflow Accelerators for Multi-DNN Workloads

Authors: Hyoukjun Kwon, Liangzhen Lai, Michael Pellauer, Tushar Krishna, Yu-Hsin Chen, Vikas Chandra

Abstract: Emerging AI-enabled applications such as augmented/virtual reality (AR/VR) leverage multiple deep neural network (DNN) models for sub-tasks such as object detection, hand tracking, and so on. Because of the diversity of the sub-tasks, the layers within and across the DNN models are highly heterogeneous in operation and shape. Such layer heterogeneity is a challenge for a fixed dataflow accelerator… ▽ More Emerging AI-enabled applications such as augmented/virtual reality (AR/VR) leverage multiple deep neural network (DNN) models for sub-tasks such as object detection, hand tracking, and so on. Because of the diversity of the sub-tasks, the layers within and across the DNN models are highly heterogeneous in operation and shape. Such layer heterogeneity is a challenge for a fixed dataflow accelerator (FDA) that employs a fixed dataflow on a single accelerator substrate since each layer prefers different dataflows (computation order and parallelization) and tile sizes. Reconfigurable DNN accelerators (RDAs) have been proposed to adapt their dataflows to diverse layers to address the challenge. However, the dataflow flexibility in RDAs is enabled at the area and energy costs of expensive hardware structures (switches, controller, etc.) and per-layer reconfiguration. Alternatively, this work proposes a new class of accelerators, heterogeneous dataflow accelerators (HDAs), which deploys multiple sub-accelerators each supporting a different dataflow. HDAs enable coarser-grained dataflow flexibility than RDAs with higher energy efficiency and lower area cost comparable to FDAs. To exploit such benefits, hardware resource partitioning across sub-accelerators and layer execution schedule need to be carefully optimized. Therefore, we also present Herald, which co-optimizes hardware partitioning and layer execution schedule. Using Herald on a suite of AR/VR and MLPerf workloads, we identify a promising HDA architecture, Maelstrom, which demonstrates 65.3% lower latency and 5.0% lower energy than the best FDAs and 22.0% lower energy at the cost of 20.7% higher latency than a state-of-the-art RDA. The results suggest that HDA is an alternative class of Pareto-optimal accelerators to RDA with strength in energy, which can be a better choice than RDAs depending on the use cases. △ Less

Submitted 16 December, 2020; v1 submitted 13 September, 2019; originally announced September 2019.

Comments: This paper is accepted at HPCA 2021

arXiv:1908.04484 [pdf, other]

doi 10.1145/3313231.335236

Reinforcement Learning based Interconnection Routing for Adaptive Traffic Optimization

Authors: Sheng-Chun Kao, Chao-Han Huck Yang, Pin-Yu Chen, Xiaoli Ma, Tushar Krishna

Abstract: Applying Machine Learning (ML) techniques to design and optimize computer architectures is a promising research direction. Optimizing the runtime performance of a Network-on-Chip (NoC) necessitates a continuous learning framework. In this work, we demonstrate the promise of applying reinforcement learning (RL) to optimize NoC runtime performance. We present three RL-based methods for learning opti… ▽ More Applying Machine Learning (ML) techniques to design and optimize computer architectures is a promising research direction. Optimizing the runtime performance of a Network-on-Chip (NoC) necessitates a continuous learning framework. In this work, we demonstrate the promise of applying reinforcement learning (RL) to optimize NoC runtime performance. We present three RL-based methods for learning optimal routing algorithms. The experimental results show the algorithms can successfully learn a near-optimal solution across different environment states. Reproducible Code: github.com/huckiyang/interconnect-routing-gym △ Less

Submitted 13 August, 2019; originally announced August 2019.

arXiv:1811.02883 [pdf, other]

SCALE-Sim: Systolic CNN Accelerator Simulator

Authors: Ananda Samajdar, Yuhao Zhu, Paul Whatmough, Matthew Mattina, Tushar Krishna

Abstract: Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient map** strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE… ▽ More Systolic Arrays are one of the most popular compute substrates within Deep Learning accelerators today, as they provide extremely high efficiency for running dense matrix multiplications. However, the research community lacks tools to insights on both the design trade-offs and efficient map** strategies for systolic-array based accelerators. We introduce Systolic CNN Accelerator Simulator (SCALE-Sim), which is a configurable systolic array based cycle accurate DNN accelerator simulator. SCALE-Sim exposes various micro-architectural features as well as system integration parameters to the designer to enable comprehensive design space exploration. This is the first systolic-array simulator tuned for running DNNs to the best of our knowledge. Using SCALE-Sim, we conduct a suite of case studies and demonstrate the effect of bandwidth, data flow and aspect ratio on the overall runtime and energy of Deep Learning kernels across vision, speech, text, and games. We believe that these insights will be highly beneficial to architects and ML practitioners. △ Less

Submitted 1 February, 2019; v1 submitted 16 October, 2018; originally announced November 2018.

arXiv:1808.01363 [pdf, other]

GeneSys: Enabling Continuous Learning through Neural Network Evolution in Hardware

Authors: Ananda Samajdar, Parth Mannan, Kartikay Garg, Tushar Krishna

Abstract: Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive training over large-scale compute resources to build a system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent systems which would need to lear… ▽ More Modern deep learning systems rely on (a) a hand-tuned neural network topology, (b) massive amounts of labeled training data, and (c) extensive training over large-scale compute resources to build a system that can perform efficient image classification or speech recognition. Unfortunately, we are still far away from implementing adaptive general purpose intelligent systems which would need to learn autonomously in unknown environments and may not have access to some or any of these three components. Reinforcement learning and evolutionary algorithm (EA) based methods circumvent this problem by continuously interacting with the environment and updating the models based on obtained rewards. However, deploying these algorithms on ubiquitous autonomous agents at the edge (robots/drones) demands extremely high energy-efficiency due to (i) tight power and energy budgets, (ii) continuous/lifelong interaction with the environment, (iii) intermittent or no connectivity to the cloud to run heavy-weight processing. To address this need, we present GENESYS, an HW-SW prototype of an EA-based learning system, that comprises a closed loop learning engine called EvE and an inference engine called ADAM. EvE can evolve the topology and weights of neural networks completely in hardware for the task at hand, without requiring hand-optimization or backpropagation training. ADAM continuously interacts with the environment and is optimized for efficiently running the irregular neural networks generated by EvE. GENESYS identifies and leverages multiple unique avenues of parallelism unique to EAs that we term 'gene'- level parallelism, and 'population'-level parallelism. We ran GENESYS with a suite of environments from OpenAI gym and observed 2-5 orders of magnitude higher energy-efficiency over state-of-the-art embedded and desktop CPU and GPU systems. △ Less

Submitted 13 September, 2018; v1 submitted 3 August, 2018; originally announced August 2018.

Comments: This work is accepted and will appear in MICRO-51

arXiv:1805.02566 [pdf, other]

Understanding Reuse, Performance, and Hardware Cost of DNN Dataflows: A Data-Centric Approach Using MAESTRO

Authors: Hyoukjun Kwon, Prasanth Chatarasi, Michael Pellauer, Angshuman Parashar, Vivek Sarkar, Tushar Krishna

Abstract: The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large… ▽ More The data partitioning and scheduling strategies used by DNN accelerators to leverage reuse and perform staging are known as dataflow, and they directly impact the performance and energy efficiency of DNN accelerator designs. An accelerator microarchitecture dictates the dataflow(s) that can be employed to execute a layer or network. Selecting an optimal dataflow for a layer shape can have a large impact on utilization and energy efficiency, but there is a lack of understanding on the choices and consequences of dataflows, and of tools and methodologies to help architects explore the co-optimization design space. In this work, we first introduce a set of data-centric directives to concisely specify the space of DNN dataflows in a compilerfriendly form. We then show how these directives can be analyzed to infer various forms of reuse and to exploit them using hardware capabilities. We codify this analysis into an analytical cost model, MAESTRO (Modeling Accelerator Efficiency via Spatio-Temporal Reuse and Occupancy), that estimates various cost-benefit tradeoffs of a dataflow including execution time and energy efficiency for a DNN model and hardware configuration. We demonstrate the use of MAESTRO to drive a hardware design space exploration (DSE) experiment, which searches across 480M designs to identify 2.5M valid designs at an average rate of 0.17M designs per second, including Pareto-optimal throughput- and energy-optimized design points. △ Less

Submitted 11 May, 2020; v1 submitted 4 May, 2018; originally announced May 2018.

arXiv:1707.05399 [pdf, other]

doi 10.1109/ISPASS.2018.00018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube

Authors: Ramyad Hadidi, Bahar Asgari, Jeffrey Young, Burhan Ahmad Mudassar, Kartikay Garg, Tushar Krishna, Hyesoon Kim

Abstract: Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits su… ▽ More Memories that exploit three-dimensional (3D)-stacking technology, which integrate memory and logic dies in a single stack, are becoming popular. These memories, such as Hybrid Memory Cube (HMC), utilize a network-on-chip (NoC) design for connecting their internal structural organizations. This novel usage of NoC, in addition to aiding processing-in-memory capabilities, enables numerous benefits such as high bandwidth and memory-level parallelism. However, the implications of NoCs on the characteristics of 3D-stacked memories in terms of memory access latency and bandwidth have not been fully explored. This paper addresses this knowledge gap by (i) characterizing an HMC prototype on the AC-510 accelerator board and revealing its access latency behaviors, and (ii) by investigating the implications of such behaviors on system and software designs. △ Less

Submitted 13 February, 2018; v1 submitted 17 July, 2017; originally announced July 2017.

Journal ref: 2018 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

arXiv:1702.02313 [pdf, other]

FASHION: Fault-Aware Self-Healing Intelligent On-chip Network

Authors: Pengju Ren, Michel A. Kinsy, Mengjiao Zhu, Shreeya Khadka, Mihailo Isakov, Aniruddh Ramrakhyani, Tushar Krishna, Nanning Zheng

Abstract: To avoid packet loss and deadlock scenarios that arise due to faults or power gating in multicore and many-core systems, the network-on-chip needs to possess resilient communication and load-balancing properties. In this work, we introduce the Fashion router, a self-monitoring and self-reconfiguring design that allows for the on-chip network to dynamically adapt to component failures. First, we in… ▽ More To avoid packet loss and deadlock scenarios that arise due to faults or power gating in multicore and many-core systems, the network-on-chip needs to possess resilient communication and load-balancing properties. In this work, we introduce the Fashion router, a self-monitoring and self-reconfiguring design that allows for the on-chip network to dynamically adapt to component failures. First, we introduce a distributed intelligence unit, called Self-Awareness Module (SAM), which allows the router to detect permanent component failures and build a network connectivity map. Using local information, SAM adapts to faults, guarantees connectivity and deadlock-free routing inside the maximal connected subgraph and keeps routing tables up-to-date. Next, to reconfigure network links or virtual channels around faulty/power-gated components, we add bidirectional link and unified virtual channel structure features to the Fashion router. This version of the router, named Ex-Fashion, further mitigates the negative system performance impacts, leads to larger maximal connected subgraph and sustains a relatively high degree of fault-tolerance. To support the router, we develop a fault diagnosis and recovery algorithm executed by the Built-In Self-Test, self-monitoring, and self-reconfiguration units at runtime to provide fault-tolerant system functionalities. The Fashion router places no restriction on topology, position or number of faults. It drops 54.3-55.4% fewer nodes for same number of faults (between 30 and 60 faults) in an 8x8 2D-mesh over other state-of-the-art solutions. It is scalable and efficient. The area overheads are 2.311% and 2.659% when implemented in 8x8 and 16x16 2D-meshes using the TSMC 65nm library at 1.38GHz clock frequency. △ Less

Submitted 8 February, 2017; originally announced February 2017.

Comments: 14 pages, 12 figures

arXiv:1701.03499 [pdf, other]

VESPA: VIPT Enhancements for Superpage Accesses

Authors: Mayank Parasar, Abhishek Bhattacharjee, Tushar Krishna

Abstract: L1 caches are critical to the performance of modern computer systems. Their design involves a delicate balance between fast lookups, high hit rates, low access energy, and simplicity of implementation. Unfortunately, constraints imposed by virtual memory make it difficult to satisfy all these attributes today. Specifically, the modern staple of supporting virtual-indexing and physical-tagging (VIP… ▽ More L1 caches are critical to the performance of modern computer systems. Their design involves a delicate balance between fast lookups, high hit rates, low access energy, and simplicity of implementation. Unfortunately, constraints imposed by virtual memory make it difficult to satisfy all these attributes today. Specifically, the modern staple of supporting virtual-indexing and physical-tagging (VIPT) for parallel TLB-L1 lookups means that L1 caches are usually grown with greater associativity rather than sets. This compromises performance -- by degrading access times without significantly boosting hit rates -- and increases access energy. We propose VIPT Enhancements for SuperPage Accesses or VESPA in response. VESPA side-steps the traditional problems of VIPT by leveraging the increasing ubiquity of superpages; since superpages have more page offset bits, they can accommodate L1 cache organizations with more sets than baseline pages can. VESPA dynamically adapts to any OS distribution of page sizes to operate L1 caches with good access times, hit rates, and energy, for both single- and multi-threaded workloads. Since the hardware changes are modest, and there are no OS or application changes, VESPA is readily-implementable. By superpages (also called huge or large pages) we refer to any page sizes supported by the architecture bigger than baseline page size. △ Less

Submitted 14 February, 2017; v1 submitted 12 January, 2017; originally announced January 2017.

MSC Class: 68-06

arXiv:1209.0257 [pdf, other]

Achieving Control of Lesion Growth in CNS with Minimal Damage

Authors: Mathankumar Raja, T. R. Krishna Mohan

Abstract: Lesions in central nervous system (CNS) and their growth leads to debilitating diseases like Multiple Sclerosis (MS), Alzheimer's etc. We developed a model earlier which shows how the lesion growth can be arrested through a beneficial auto-immune mechanism. The success of the approach depends on a set of control parameters and their phase space was shown to have a smooth manifold separating the un… ▽ More Lesions in central nervous system (CNS) and their growth leads to debilitating diseases like Multiple Sclerosis (MS), Alzheimer's etc. We developed a model earlier which shows how the lesion growth can be arrested through a beneficial auto-immune mechanism. The success of the approach depends on a set of control parameters and their phase space was shown to have a smooth manifold separating the uncontrolled lesion growth region from the controlled. Here we show that an optimal set of parameter values exist which minimizes system damage while achieving control of lesion growth. △ Less

Submitted 3 September, 2012; originally announced September 2012.

arXiv:1003.4658 [pdf, ps, other]

Earthquake Correlations and Networks- A Comparative Study

Authors: T. R. Krishna Mohan P. G., Revathi

Abstract: We quantify the correlation between earthquakes and use the same to distinguish between relevant causally connected earthquakes. Our correlation metric is a variation on the one introduced by Baiesi and Paczuski (2004). A network of earthquakes is constructed, which is time ordered and with links between the more correlated ones. Recurrences to earthquakes are identified employing correlation t… ▽ More We quantify the correlation between earthquakes and use the same to distinguish between relevant causally connected earthquakes. Our correlation metric is a variation on the one introduced by Baiesi and Paczuski (2004). A network of earthquakes is constructed, which is time ordered and with links between the more correlated ones. Recurrences to earthquakes are identified employing correlation thresholds to demarcate the most meaningful ones in each cluster. Data pertaining to three different seismic regions, viz. California, Japan and Himalayas, are comparatively analyzed using such a network model. The distribution of recurrence lengths and recurrence times are two of the key features analyzed to draw conclusions about the universal aspects of such a network model. We find that the unimodal feature of recurrence length distribution, which helps to associate typical rupture lengths with different magnitude earthquakes, is robust across the different seismic regions. The out-degree of the networks shows a hub structure rooted on the large magnitude earthquakes. In-degree distribution is seen to be dependent on the density of events in the neighborhood. Power laws, with two regimes having different exponents, are obtained with recurrence time distribution. This is in agreement with the Omori law for aftershocks and extends it to spatial recurrences. The crossover to the second power law regime can be taken to be signalling the end of aftershock regime in an objective fashion. △ Less

Submitted 24 March, 2010; originally announced March 2010.

Comments: 17 pages, 6 figures

arXiv:1003.4469 [pdf, ps, other]

doi 10.1007/s10950-010-9208-5

Network of Earthquakes and Recurrences Therein

Authors: T. R. Krishna Mohan, P. G. Revathi

Abstract: We quantify the correlation between earthquakes and use the same to distinguish between relevant causally connected earthquakes. Our correlation metric is a variation on the one introduced by Baiesi and Paczuski (2004). A network of earthquakes is constructed, which is time ordered and with links between the more correlated ones. Data pertaining to the California region has been used in the study.… ▽ More We quantify the correlation between earthquakes and use the same to distinguish between relevant causally connected earthquakes. Our correlation metric is a variation on the one introduced by Baiesi and Paczuski (2004). A network of earthquakes is constructed, which is time ordered and with links between the more correlated ones. Data pertaining to the California region has been used in the study. Recurrences to earthquakes are identified employing correlation thresholds to demarcate the most meaningful ones in each cluster. The distribution of recurrence lengths and recurrence times are analyzed subsequently to extract information about the complex dynamics. We find that the unimodal feature of recurrence lengths helps to associate typical rupture lengths with different magnitude earthquakes. The out-degree of the network shows a hub structure rooted on the large magnitude earthquakes. In-degree distribution is seen to be dependent on the density of events in the neighborhood. Power laws are also obtained with recurrence time distribution agreeing with the Omori law. △ Less

Submitted 23 March, 2010; originally announced March 2010.

Comments: 17 pages, 5 figures

arXiv:1001.4258 [pdf, ps, other]

doi 10.1142/9789814355391_0006

Network of Recurrent events - A case study of Japan

Authors: P. G. Revathi, T. R. Krishnamohan

Abstract: A recently proposed method of constructing seismic networks from 'record breaking events' from the earthquake catalog of California (Phy. Rev. E, 77 6,066104, 2008) was successfull in establishing causal features to seismicity and arrive at estimates for rupture length and its scaling with magnitude. The results of our implementation of this procedure on the earthquake catalog of Japan establish… ▽ More A recently proposed method of constructing seismic networks from 'record breaking events' from the earthquake catalog of California (Phy. Rev. E, 77 6,066104, 2008) was successfull in establishing causal features to seismicity and arrive at estimates for rupture length and its scaling with magnitude. The results of our implementation of this procedure on the earthquake catalog of Japan establishes the robustness of the procedure. Additionally, we find that the temporal distributions are able to detect heterogeneties in the seismicity of the region. △ Less

Submitted 24 January, 2010; originally announced January 2010.

Comments: 13 pages, 6 figures, 1 table

arXiv:0708.3271 [pdf, ps, other]

Simulation of Spread and Control of Lesions in Brain

Authors: T. R. Krishna Mohan

Abstract: A simulation model for the spread and control of lesions in the brain is constructed using a planar network (graph) representation for the Central Nervous System (CNS). The model is inspired by the lesion structures observed in the case of Multiple Sclerosis (MS), a chronic disease of the CNS. The initial lesion site is at the center of a unit square and spreads outwards based on the success rat… ▽ More A simulation model for the spread and control of lesions in the brain is constructed using a planar network (graph) representation for the Central Nervous System (CNS). The model is inspired by the lesion structures observed in the case of Multiple Sclerosis (MS), a chronic disease of the CNS. The initial lesion site is at the center of a unit square and spreads outwards based on the success rate in damaging edges (axons) of the network. The damaged edges send out alarm signals which, at appropriate intensity levels, generate programmed cell death. Depending on the extent and timing of the programmed cell death, the lesion may get controlled or aggravated akin to the control of wild fires by burning of peripheral vegetation. The parameter phase space of the model shows smooth transition from uncontrolled situation to controlled situation. The simulations show that the model is capable of generating a wide variety of lesion growth and arrest scenarios. △ Less

Submitted 24 August, 2007; originally announced August 2007.

Comments: 5 pages, 3 postscript figures, submitted for publication

Showing 51–79 of 79 results for author: Krishna, T