Search | arXiv e-print repository

A Fully-Configurable Open-Source Software-Defined Digital Quantized Spiking Neural Core Architecture

Authors: Shadi Matinizadeh, Noah Pacik-Nelson, Ioannis Polykretis, Krupa Tishbi, Suman Kumar, M. L. Varshika, Arghavan Mohammadhassani, Abhishek Mishra, Nagarajan Kandasamy, James Shackleford, Eric Gallo, Anup Das

Abstract: We introduce QUANTISENC, a fully configurable open-source software-defined digital quantized spiking neural core architecture to advance research in neuromorphic computing. QUANTISENC is designed hierarchically using a bottom-up methodology with multiple neurons in each layer and multiple layers in each core. The number of layers and neurons per layer can be configured via software in a top-down m… ▽ More We introduce QUANTISENC, a fully configurable open-source software-defined digital quantized spiking neural core architecture to advance research in neuromorphic computing. QUANTISENC is designed hierarchically using a bottom-up methodology with multiple neurons in each layer and multiple layers in each core. The number of layers and neurons per layer can be configured via software in a top-down methodology to generate the hardware for a target spiking neural network (SNN) model. QUANTISENC uses leaky integrate and fire neurons (LIF) and current-based excitatory and inhibitory synapses (CUBA). The nonlinear dynamics of a neuron can be configured at run-time via programming its internal control registers. Each neuron performs signed fixed-point arithmetic with user-defined quantization and decimal precision. QUANTISENC supports all-to-all, one-to-one, and Gaussian connections between layers. Its hardware-software interface is integrated with a PyTorch-based SNN simulator. This integration allows to define and train an SNN model in PyTorch and evaluate the hardware performance (e.g., area, power, latency, and throughput) through FPGA prototy** and ASIC design. The hardware-software interface also takes advantage of the layer-based architecture and distributed memory organization of QUANTISENC to enable pipelining by overlap** computations on streaming data. Overall, the proposed software-defined hardware design methodology offers flexibility similar to that of high-level synthesis (HLS), but provides better hardware performance with zero hardware development effort. We evaluate QUANTISENC using three spiking datasets and show its superior performance against state-of the-art designs. △ Less

Submitted 2 April, 2024; originally announced April 2024.

arXiv:2203.05311 [pdf, other]

Design-Technology Co-Optimization for NVM-based Neuromorphic Processing Elements

Authors: Shihao Song, Adarsha Balaji, Anup Das, Nagarajan Kandasamy

Abstract: Neuromorphic hardware platforms can significantly lower the energy overhead of a machine learning inference task. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a Non- Volatile Memory (NVM)-based neuromorphic hardware. Through detailed circuit-level simulations at scaled process technology nodes, we show the negative impact of… ▽ More Neuromorphic hardware platforms can significantly lower the energy overhead of a machine learning inference task. We present a design-technology tradeoff analysis to implement such inference tasks on the processing elements (PEs) of a Non- Volatile Memory (NVM)-based neuromorphic hardware. Through detailed circuit-level simulations at scaled process technology nodes, we show the negative impact of technology scaling on the information-processing latency, which impacts the quality-of-service (QoS) of an embedded ML system. At a finer granularity, the latency inside a PE depends on 1) the delay introduced by parasitic components on its current paths, and 2) the varying delay to sense different resistance states of its NVM cells. Based on these two observations, we make the following three contributions. First, on the technology front, we propose an optimization scheme where the NVM resistance state that takes the longest time to sense is set on current paths having the least delay, and vice versa, reducing the average PE latency, which improves the QoS. Second, on the architecture front, we introduce isolation transistors within each PE to partition it into regions that can be individually power-gated, reducing both latency and energy. Finally, on the system-software front, we propose a mechanism to leverage the proposed technological and architectural enhancements when implementing a machine-learning inference task on neuromorphic PEs of the hardware. Evaluations with a recent neuromorphic hardware architecture show that our proposed design-technology co-optimization approach improves both performance and energy efficiency of machine-learning inference tasks without incurring high cost-per-bit. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: Accepted for publication at ACM TECS

arXiv:2202.06717 [pdf, other]

A Data-Centric Approach to Generate Invariants for a Smart Grid Using Machine Learning

Authors: Danish Hudani, Muhammad Haseeb, Muhammad Taufiq, Muhammad Azmi Umer, Nandha Kumar Kandasamy

Abstract: Cyber-Physical Systems (CPS) have gained popularity due to the increased requirements on their uninterrupted connectivity and process automation. Due to their connectivity over the network including intranet and internet, dependence on sensitive data, heterogeneous nature, and large-scale deployment, they are highly vulnerable to cyber-attacks. Cyber-attacks are performed by creating anomalies in… ▽ More Cyber-Physical Systems (CPS) have gained popularity due to the increased requirements on their uninterrupted connectivity and process automation. Due to their connectivity over the network including intranet and internet, dependence on sensitive data, heterogeneous nature, and large-scale deployment, they are highly vulnerable to cyber-attacks. Cyber-attacks are performed by creating anomalies in the normal operation of the systems with a goal either to disrupt the operation or destroy the system completely. The study proposed here focuses on detecting those anomalies which could be the cause of cyber-attacks. This is achieved by deriving the rules that govern the physical behavior of a process within a plant. These rules are called Invariants. We have proposed a Data-Centric approach (DaC) to generate such invariants. The entire study was conducted using the operational data of a functional smart power grid which is also a living lab. △ Less

Submitted 14 February, 2022; originally announced February 2022.

Comments: Accepted in ACM SaT-CPS workshop in conjunction with CODASPY 2022

arXiv:2108.12444 [pdf, other]

A Design Flow for Map** Spiking Neural Networks to Many-Core Neuromorphic Hardware

Authors: Shihao Song, M. Lakshmi Varshika, Anup Das, Nagarajan Kandasamy

Abstract: The design of many-core neuromorphic hardware is getting more and more complex as these systems are expected to execute large machine learning models. To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance such as latency and throughput without significantly increasing the buffer requirement of computing cores. Synchronous Data Flow Graphs (SDFGs… ▽ More The design of many-core neuromorphic hardware is getting more and more complex as these systems are expected to execute large machine learning models. To deal with the design complexity, a predictable design flow is needed to guarantee real-time performance such as latency and throughput without significantly increasing the buffer requirement of computing cores. Synchronous Data Flow Graphs (SDFGs) are used for predictable map** of streaming applications to multiprocessor systems. We propose an SDFG-based design flow for map** spiking neural networks (SNNs) to many-core neuromorphic hardware with the objective of exploring the tradeoff between throughput and buffer size. The proposed design flow integrates an iterative partitioning approach, based on Kernighan-Lin graph partitioning heuristic, creating SNN clusters such that each cluster can be mapped to a core of the hardware. The partitioning approach minimizes the inter-cluster spike communication, which improves latency on the shared interconnect of the hardware. Next, the design flow maps clusters to cores using an instance of the Particle Swarm Optimization (PSO), an evolutionary algorithm, exploring the design space of throughput and buffer size. Pareto optimal map**s are retained from the design flow, allowing system designers to select a Pareto map** that satisfies throughput and buffer size requirements of the design. We evaluated the design flow using five large-scale convolutional neural network (CNN) models. Results demonstrate 63% higher maximum throughput and 10% lower buffer size requirement compared to state-of-the-art dataflow-based map** solutions. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: To appear in ICCAD 2021

arXiv:2108.02023 [pdf, other]

DFSynthesizer: Dataflow-based Synthesis of Spiking Neural Networks to Neuromorphic Hardware

Authors: Shihao Song, Harry Chong, Adarsha Balaji, Anup Das, James Shackleford, Nagarajan Kandasamy

Abstract: Spiking Neural Networks (SNN) are an emerging computation model, which uses event-driven activation and bio-inspired learning algorithms. SNN-based machine-learning programs are typically executed on tile- based neuromorphic hardware platforms, where each tile consists of a computation unit called crossbar, which maps neurons and synapses of the program. However, synthesizing such programs on an o… ▽ More Spiking Neural Networks (SNN) are an emerging computation model, which uses event-driven activation and bio-inspired learning algorithms. SNN-based machine-learning programs are typically executed on tile- based neuromorphic hardware platforms, where each tile consists of a computation unit called crossbar, which maps neurons and synapses of the program. However, synthesizing such programs on an off-the-shelf neuromorphic hardware is challenging. This is because of the inherent resource and latency limitations of the hardware, which impact both model performance, e.g., accuracy, and hardware performance, e.g., throughput. We propose DFSynthesizer, an end-to-end framework for synthesizing SNN-based machine learning programs to neuromorphic hardware. The proposed framework works in four steps. First, it analyzes a machine-learning program and generates SNN workload using representative data. Second, it partitions the SNN workload and generates clusters that fit on crossbars of the target neuromorphic hardware. Third, it exploits the rich semantics of Synchronous Dataflow Graph (SDFG) to represent a clustered SNN program, allowing for performance analysis in terms of key hardware constraints such as number of crossbars, dimension of each crossbar, buffer space on tiles, and tile communication bandwidth. Finally, it uses a novel scheduling algorithm to execute clusters on crossbars of the hardware, guaranteeing hardware performance. We evaluate DFSynthesizer with 10 commonly used machine-learning programs. Our results demonstrate that DFSynthesizer provides much tighter performance guarantee compared to current map** approaches. △ Less

Submitted 4 August, 2021; originally announced August 2021.

Comments: Accepted for publication at ACM Transactions on Embedded Computing

arXiv:2105.04260 [pdf, other]

EPICTWIN: An Electric Power Digital Twin for Cyber Security Testing, Research and Education

Authors: Nandha Kumar Kandasamy, Sarad Venugopalan, Tin Kit Wong, Leu Junming Nicholas

Abstract: Cyber-Physical Systems (CPS) rely on advanced communication and control technologies to efficiently manage devices and the flow of information in the system. However, a wide variety of potential security challenges has emerged due to the evolution of critical infrastructures (CI) from siloed sub-systems into connected and integrated networks. This is also the case for CI such as a smart grid. Smar… ▽ More Cyber-Physical Systems (CPS) rely on advanced communication and control technologies to efficiently manage devices and the flow of information in the system. However, a wide variety of potential security challenges has emerged due to the evolution of critical infrastructures (CI) from siloed sub-systems into connected and integrated networks. This is also the case for CI such as a smart grid. Smart grid security studies are carried out on physical test-beds to provide its users a platform to train and test cyber attacks, in a safe and controlled environment. However, it has limitations w.r.t modifying physical configuration and difficulty to scale. To overcome these shortcomings, we built a digital power twin for a physical test-bed that is used for cyber security studies on smart grids. On the developed twin, the users can deploy real world attacks and countermeasures, to test and study its effectiveness. The difference from the physical test-bed is that its users may easily modify their power system components and configurations. Further, reproducing the twin for using and advancing the research is significantly cheaper. The developed twin has advanced features compared to any equivalent system in the literature. To illustrate a typical use case, we present a case study where a cyber attack is launched and discuss its implications. △ Less

Submitted 10 May, 2021; originally announced May 2021.

arXiv:2105.02038 [pdf, other]

Dynamic Reliability Management in Neuromorphic Computing

Authors: Shihao Song, Jui Hanamshet, Adarsha Balaji, Anup Das, Jeffrey L. Krichmar, Nikil D. Dutt, Nagarajan Kandasamy, Francky Catthoor

Abstract: Neuromorphic computing systems uses non-volatile memory (NVM) to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to operate NVMs cause aging of CMOS-based transistors in each neuron and synapse circuit in the hardware, drifting the transistor's parameters from their nominal values. Aggressive device scaling increases power density and temperature, whic… ▽ More Neuromorphic computing systems uses non-volatile memory (NVM) to implement high-density and low-energy synaptic storage. Elevated voltages and currents needed to operate NVMs cause aging of CMOS-based transistors in each neuron and synapse circuit in the hardware, drifting the transistor's parameters from their nominal values. Aggressive device scaling increases power density and temperature, which accelerates the aging, challenging the reliable operation of neuromorphic systems. Existing reliability-oriented techniques periodically de-stress all neuron and synapse circuits in the hardware at fixed intervals, assuming worst-case operating conditions, without actually tracking their aging at run time. To de-stress these circuits, normal operation must be interrupted, which introduces latency in spike generation and propagation, impacting the inter-spike interval and hence, performance, e.g., accuracy. We propose a new architectural technique to mitigate the aging-related reliability problems in neuromorphic systems, by designing an intelligent run-time manager (NCRTM), which dynamically destresses neuron and synapse circuits in response to the short-term aging in their CMOS transistors during the execution of machine learning workloads, with the objective of meeting a reliability target. NCRTM de-stresses these circuits only when it is absolutely necessary to do so, otherwise reducing the performance impact by scheduling de-stress operations off the critical path. We evaluate NCRTM with state-of-the-art machine learning workloads on a neuromorphic hardware. Our results demonstrate that NCRTM significantly improves the reliability of neuromorphic hardware, with marginal impact on performance. △ Less

Submitted 5 May, 2021; originally announced May 2021.

Comments: Accepted in ACM JETC

arXiv:2105.01795 [pdf, other]

NeuroXplorer 1.0: An Extensible Framework for Architectural Exploration with Spiking Neural Networks

Authors: Adarsha Balaji, Shihao Song, Twisha Titirsha, Anup Das, Jeffrey Krichmar, Nikil Dutt, James Shackleford, Nagarajan Kandasamy, Francky Catthoor

Abstract: Recently, both industry and academia have proposed many different neuromorphic architectures to execute applications that are designed with Spiking Neural Network (SNN). Consequently, there is a growing need for an extensible simulation framework that can perform architectural explorations with SNNs, including both platform-based design of today's hardware, and hardware-software co-design and desi… ▽ More Recently, both industry and academia have proposed many different neuromorphic architectures to execute applications that are designed with Spiking Neural Network (SNN). Consequently, there is a growing need for an extensible simulation framework that can perform architectural explorations with SNNs, including both platform-based design of today's hardware, and hardware-software co-design and design-technology co-optimization of the future. We present NeuroXplorer, a fast and extensible framework that is based on a generalized template for modeling a neuromorphic architecture that can be infused with the specific details of a given hardware and/or technology. NeuroXplorer can perform both low-level cycle-accurate architectural simulations and high-level analysis with data-flow abstractions. NeuroXplorer's optimization engine can incorporate hardware-oriented metrics such as energy, throughput, and latency, as well as SNN-oriented metrics such as inter-spike interval distortion and spike disorder, which directly impact SNN performance. We demonstrate the architectural exploration capabilities of NeuroXplorer through case studies with many state-of-the-art machine learning models. △ Less

Submitted 4 May, 2021; originally announced May 2021.

arXiv:2103.05707 [pdf, other]

Endurance-Aware Map** of Spiking Neural Networks to Neuromorphic Hardware

Authors: Twisha Titirsha, Shihao Song, Anup Das, Jeffrey Krichmar, Nikil Dutt, Nagarajan Kandasamy, Francky Catthoor

Abstract: Neuromorphic computing systems are embracing memristors to implement high density and low power synaptic storage as crossbar arrays in hardware. These systems are energy efficient in executing Spiking Neural Networks (SNNs). We observe that long bitlines and wordlines in a memristive crossbar are a major source of parasitic voltage drops, which create current asymmetry. Through circuit simulations… ▽ More Neuromorphic computing systems are embracing memristors to implement high density and low power synaptic storage as crossbar arrays in hardware. These systems are energy efficient in executing Spiking Neural Networks (SNNs). We observe that long bitlines and wordlines in a memristive crossbar are a major source of parasitic voltage drops, which create current asymmetry. Through circuit simulations, we show the significant endurance variation that results from this asymmetry. Therefore, if the critical memristors (ones with lower endurance) are overutilized, they may lead to a reduction of the crossbar's lifetime. We propose eSpine, a novel technique to improve lifetime by incorporating the endurance variation within each crossbar in map** machine learning workloads, ensuring that synapses with higher activation are always implemented on memristors with higher endurance, and vice versa. eSpine works in two steps. First, it uses the Kernighan-Lin Graph Partitioning algorithm to partition a workload into clusters of neurons and synapses, where each cluster can fit in a crossbar. Second, it uses an instance of Particle Swarm Optimization (PSO) to map clusters to tiles, where the placement of synapses of a cluster to memristors of a crossbar is performed by analyzing their activation within the workload. We evaluate eSpine for a state-of-the-art neuromorphic hardware model with phase-change memory (PCM)-based memristors. Using 10 SNN workloads, we demonstrate a significant improvement in the effective lifetime. △ Less

Submitted 9 March, 2021; originally announced March 2021.

Comments: Accepted for publication in IEEE Transactions on Parallel and Distributed Systems (TPDS)

arXiv:2012.00050 [pdf, other]

doi 10.1145/3394885.3431529

Aging-Aware Request Scheduling for Non-Volatile Main Memory

Authors: Shihao Song, Anup Das, Onur Mutlu, Nagarajan Kandasamy

Abstract: Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memo… ▽ More Modern computing systems are embracing non-volatile memory (NVM) to implement high-capacity and low-cost main memory. Elevated operating voltages of NVM accelerate the aging of CMOS transistors in the peripheral circuitry of each memory bank. Aggressive device scaling increases power density and temperature, which further accelerates aging, challenging the reliable operation of NVM-based main memory. We propose HEBE, an architectural technique to mitigate the circuit aging-related problems of NVM-based main memory. HEBE is built on three contributions. First, we propose a new analytical model that can dynamically track the aging in the peripheral circuitry of each memory bank based on the bank's utilization. Second, we develop an intelligent memory request scheduler that exploits this aging model at run time to de-stress the peripheral circuitry of a memory bank only when its aging exceeds a critical threshold. Third, we introduce an isolation transistor to decouple parts of a peripheral circuit operating at different voltages, allowing the decoupled logic blocks to undergo long-latency de-stress operations independently and off the critical path of memory read and write accesses, improving performance. We evaluate HEBE with workloads from the SPEC CPU2017 Benchmark suite. Our results show that HEBE significantly improves both performance and lifetime of NVM-based main memory. △ Less

Submitted 30 November, 2020; originally announced December 2020.

Comments: To appear in ASP-DAC 2021

arXiv:2009.09298 [pdf, other]

Enabling Resource-Aware Map** of Spiking Neural Networks via Spatial Decomposition

Authors: Adarsha Balaji, Shihao Song, Anup Das, Jeffrey Krichmar, Nikil Dutt, James Shackleford, Nagarajan Kandasamy, Francky Catthoor

Abstract: With growing model complexity, map** Spiking Neural Network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. a crossbar, can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. For complex SNN models that have many pre-synaptic connections per neuron,… ▽ More With growing model complexity, map** Spiking Neural Network (SNN)-based applications to tile-based neuromorphic hardware is becoming increasingly challenging. This is because the synaptic storage resources on a tile, viz. a crossbar, can accommodate only a fixed number of pre-synaptic connections per post-synaptic neuron. For complex SNN models that have many pre-synaptic connections per neuron, some connections may need to be pruned after training to fit onto the tile resources, leading to a loss in model quality, e.g., accuracy. In this work, we propose a novel unrolling technique that decomposes a neuron function with many pre-synaptic connections into a sequence of homogeneous neural units, where each neural unit is a function computation node, with two pre-synaptic connections. This spatial decomposition technique significantly improves crossbar utilization and retains all pre-synaptic connections, resulting in no loss of the model quality derived from connection pruning. We integrate the proposed technique within an existing SNN map** framework and evaluate it using machine learning applications on the DYNAP-SE state-of-the-art neuromorphic hardware. Our results demonstrate an average 60% lower crossbar requirement, 9x higher synapse utilization, 62% lower wasted energy on the hardware, and between 0.8% and 4.6% increase in model quality. △ Less

Submitted 19 September, 2020; originally announced September 2020.

Comments: Accepted for publication of IEEE Embedded Systems Letters

arXiv:2006.05868 [pdf, other]

Improving Dependability of Neuromorphic Computing With Non-Volatile Memory

Authors: Shihao Song, Anup Das, Nagarajan Kandasamy

Abstract: As process technology continues to scale aggressively, circuit aging in a neuromorphic hardware due to negative bias temperature instability (NBTI) and time-dependent dielectric breakdown (TDDB) is becoming a critical reliability issue and is expected to proliferate when using non-volatile memory (NVM) for synaptic storage. This is because an NVM requires high voltage and current to access its syn… ▽ More As process technology continues to scale aggressively, circuit aging in a neuromorphic hardware due to negative bias temperature instability (NBTI) and time-dependent dielectric breakdown (TDDB) is becoming a critical reliability issue and is expected to proliferate when using non-volatile memory (NVM) for synaptic storage. This is because an NVM requires high voltage and current to access its synaptic weight, which further accelerates the circuit aging in a neuromorphic hardware. Current methods for qualifying reliability are overly conservative, since they estimate circuit aging considering worst-case operating conditions and unnecessarily constrain performance. This paper proposes RENEU, a reliability-oriented approach to map machine learning applications to neuromorphic hardware, with the aim of improving system-wide reliability without compromising key performance metrics such as execution time of these applications on the hardware. Fundamental to RENEU is a novel formulation of the aging of CMOS-based circuits in a neuromorphic hardware considering different failure mechanisms. Using this formulation, RENEU develops a system-wide reliability model which can be used inside a design-space exploration framework involving the map** of neurons and synapses to the hardware. To this end, RENEU uses an instance of Particle Swarm Optimization (PSO) to generate map**s that are Pareto-optimal in terms of performance and reliability. We evaluate RENEU using different machine learning applications on a state-of-the-art neuromorphic hardware with NVM synapses. Our results demonstrate an average 38\% reduction in circuit aging, leading to an average 18% improvement in the lifetime of the hardware compared to current practices. RENEU only introduces a marginal performance overhead of 5% compared to a performance-oriented state-of-the-art. △ Less

Submitted 10 June, 2020; originally announced June 2020.

Comments: 8 pages, 13 figures, accepted in 16th European Dependable Computing Conference

arXiv:2005.04753 [pdf, other]

doi 10.1145/3381898.3397210

Improving Phase Change Memory Performance with Data Content Aware Access

Authors: Shihao Song, Anup Das, Onur Mutlu, Nagarajan Kandasamy

Abstract: A prominent characteristic of write operation in Phase-Change Memory (PCM) is that its latency and energy are sensitive to the data to be written as well as the content that is overwritten. We observe that overwriting unknown memory content can incur significantly higher latency and energy compared to overwriting known all-zeros or all-ones content. This is because all-zeros or all-ones content is… ▽ More A prominent characteristic of write operation in Phase-Change Memory (PCM) is that its latency and energy are sensitive to the data to be written as well as the content that is overwritten. We observe that overwriting unknown memory content can incur significantly higher latency and energy compared to overwriting known all-zeros or all-ones content. This is because all-zeros or all-ones content is overwritten by programming the PCM cells only in one direction, i.e., using either SET or RESET operations, not both. In this paper, we propose data content aware PCM writes (DATACON), a new mechanism that reduces the latency and energy of PCM writes by redirecting these requests to overwrite memory locations containing all-zeros or all-ones. DATACON operates in three steps. First, it estimates how much a PCM write access would benefit from overwriting known content (e.g., all-zeros, or all-ones) by comprehensively considering the number of set bits in the data to be written, and the energy-latency trade-offs for SET and RESET operations in PCM. Second, it translates the write address to a physical address within memory that contains the best type of content to overwrite, and records this translation in a table for future accesses. We exploit data access locality in workloads to minimize the address translation overhead. Third, it re-initializes unused memory locations with known all-zeros or all-ones content in a manner that does not interfere with regular read and write accesses. DATACON overwrites unknown content only when it is absolutely necessary to do so. We evaluate DATACON with workloads from state-of-the-art machine learning applications, SPEC CPU2017, and NAS Parallel Benchmarks. Results demonstrate that DATACON significantly improves system performance and memory system energy consumption compared to the best of performance-oriented state-of-the-art techniques. △ Less

Submitted 10 May, 2020; originally announced May 2020.

Comments: 18 pages, 21 figures, accepted at ACM SIGPLAN International Symposium on Memory Management (ISMM)

arXiv:2005.04750 [pdf, other]

doi 10.1145/3381898.3397215

Exploiting Inter- and Intra-Memory Asymmetries for Data Map** in Hybrid Tiered-Memories

Authors: Shihao Song, Anup Das, Nagarajan Kandasamy

Abstract: Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that p… ▽ More Modern computing systems are embracing hybrid memory comprising of DRAM and non-volatile memory (NVM) to combine the best properties of both memory technologies, achieving low latency, high reliability, and high density. A prominent characteristic of DRAM-NVM hybrid memory is that it has NVM access latency much higher than DRAM access latency. We call this inter-memory asymmetry. We observe that parasitic components on a long bitline are a major source of high latency in both DRAM and NVM, and a significant factor contributing to high-voltage operations in NVM, which impact their reliability. We propose an architectural change, where each long bitline in DRAM and NVM is split into two segments by an isolation transistor. One segment can be accessed with lower latency and operating voltage than the other. By introducing tiers, we enable non-uniform accesses within each memory type (which we call intra-memory asymmetry), leading to performance and reliability trade-offs in DRAM-NVM hybrid memory. We extend existing NVM-DRAM OS in three ways. First, we exploit both inter- and intra-memory asymmetries to allocate and migrate memory pages between the tiers in DRAM and NVM. Second, we improve the OS's page allocation decisions by predicting the access intensity of a newly-referenced memory page in a program and placing it to a matching tier during its initial allocation. This minimizes page migrations during program execution, lowering the performance overhead. Third, we propose a solution to migrate pages between the tiers of the same memory without transferring data over the memory channel, minimizing channel occupancy and improving performance. Our overall approach, which we call MNEME, to enable and exploit asymmetries in DRAM-NVM hybrid tiered memory improves both performance and reliability for both single-core and multi-programmed workloads. △ Less

Submitted 10 May, 2020; originally announced May 2020.

Comments: 15 pages, 29 figures, accepted at ACM SIGPLAN International Symposium on Memory Management

arXiv:2004.03717 [pdf, other]

doi 10.1145/3372799.3394364

Compiling Spiking Neural Networks to Neuromorphic Hardware

Authors: Shihao Song, Adarsha Balaji, Anup Das, Nagarajan Kandasamy, James Shackleford

Abstract: Machine learning applications that are implemented with spike-based computation model, e.g., Spiking Neural Network (SNN), have a great potential to lower the energy consumption when they are executed on a neuromorphic hardware. However, compiling and map** an SNN to the hardware is challenging, especially when compute and storage resources of the hardware (viz. crossbar) need to be shared among… ▽ More Machine learning applications that are implemented with spike-based computation model, e.g., Spiking Neural Network (SNN), have a great potential to lower the energy consumption when they are executed on a neuromorphic hardware. However, compiling and map** an SNN to the hardware is challenging, especially when compute and storage resources of the hardware (viz. crossbar) need to be shared among the neurons and synapses of the SNN. We propose an approach to analyze and compile SNNs on a resource-constrained neuromorphic hardware, providing guarantee on key performance metrics such as execution time and throughput. Our approach makes the following three key contributions. First, we propose a greedy technique to partition an SNN into clusters of neurons and synapses such that each cluster can fit on to the resources of a crossbar. Second, we exploit the rich semantics and expressiveness of Synchronous Dataflow Graphs (SDFGs) to represent a clustered SNN and analyze its performance using Max-Plus Algebra, considering the available compute and storage capacities, buffer sizes, and communication bandwidth. Third, we propose a self-timed execution-based fast technique to compile and admit SNN-based applications to a neuromorphic hardware at run-time, adapting dynamically to the available resources on the hardware. We evaluate our approach with standard SNN-based applications and demonstrate a significant performance improvement compared to current practices. △ Less

Submitted 12 May, 2020; v1 submitted 7 April, 2020; originally announced April 2020.

Comments: 10 pages, 17 figures, accepted at 21st ACM SIGPLAN/SIGBED International Conference on Languages, Compilers, and Tools for Embedded Systems (LCTES 2020)

arXiv:1911.00548 [pdf, other]

A Framework to Explore Workload-Specific Performance and Lifetime Trade-offs in Neuromorphic Computing

Authors: Adarsha Balaji, Shihao Song, Anup Das, Nikil Dutt, Jeff Krichmar, Nagarajan Kandasamy, Francky Catthoor

Abstract: Neuromorphic hardware with non-volatile memory (NVM) can implement machine learning workload in an energy-efficient manner. Unfortunately, certain NVMs such as phase change memory (PCM) require high voltages for correct operation. These voltages are supplied from an on-chip charge pump. If the charge pump is activated too frequently, its internal CMOS devices do not recover from stress, accelerati… ▽ More Neuromorphic hardware with non-volatile memory (NVM) can implement machine learning workload in an energy-efficient manner. Unfortunately, certain NVMs such as phase change memory (PCM) require high voltages for correct operation. These voltages are supplied from an on-chip charge pump. If the charge pump is activated too frequently, its internal CMOS devices do not recover from stress, accelerating their aging and leading to negative bias temperature instability (NBTI) generated defects. Forcefully discharging the stressed charge pump can lower the aging rate of its CMOS devices, but makes the neuromorphic hardware unavailable to perform computations while its charge pump is being discharged. This negatively impacts performance such as latency and accuracy of the machine learning workload being executed. In this paper, we propose a novel framework to exploit workload-specific performance and lifetime trade-offs in neuromorphic computing. Our framework first extracts the precise times at which a charge pump in the hardware is activated to support neural computations within a workload. This timing information is then used with a characterized NBTI reliability model to estimate the charge pump's aging during the workload execution. We use our framework to evaluate workload-specific performance and reliability impacts of using 1) different SNN map** strategies and 2) different charge pump discharge strategies. We show that our framework can be used by system designers to explore performance and reliability trade-offs early in the design of neuromorphic hardware such that appropriate reliability-oriented design margins can be set. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: 4 pages, 5 figures, 13 references, accepted for publication at IEEE Computer Architecture Letters

arXiv:1908.07966 [pdf, other]

Enabling and Exploiting Partition-Level Parallelism (PALP) in Phase Change Memories

Authors: Shihao Song, Anup Das, Onur Mutlu, Nagarajan Kandasamy

Abstract: Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a modern PCM bank is implemented as a collection of partitions that operate mostly independently while sharing a few global peripheral structures, which include… ▽ More Phase-change memory (PCM) devices have multiple banks to serve memory requests in parallel. Unfortunately, if two requests go to the same bank, they have to be served one after another, leading to lower system performance. We observe that a modern PCM bank is implemented as a collection of partitions that operate mostly independently while sharing a few global peripheral structures, which include the sense amplifiers (to read) and the write drivers (to write). Based on this observation, we propose PALP, a new mechanism that enables partition-level parallelism within each PCM bank, and exploits such parallelism by using the memory controller's access scheduling decisions. PALP consists of three new contributions. First, we introduce new PCM commands to enable parallelism in a bank's partitions in order to resolve the read-write bank conflicts, with minimal changes needed to PCM logic and its interface. Second, we propose simple circuit modifications that introduce a new operating mode for the write drivers, in addition to their default mode of serving write requests. When configured in this new mode, the write drivers can resolve the read-read bank conflicts, working jointly with the sense amplifiers. Finally, we propose a new access scheduling mechanism in PCM that improves performance by prioritizing those requests that exploit partition-level parallelism over other requests, including the long outstanding ones. While doing so, the memory controller also guarantees starvation-freedom and the PCM's running-average-power-limit (RAPL). We evaluate PALP with workloads from the MiBench and SPEC CPU2017 Benchmark suites. Our results show that PALP reduces average PCM access latency by 23%, and improves average system performance by 28% compared to the state-of-the-art approaches. △ Less

Submitted 21 August, 2019; originally announced August 2019.

Comments: 13 pages, 16 figures, 71 references. Published at ACM CASES 2019

arXiv:1611.05787 [pdf, other]

FPGA Implementation of a Scalable and Run-Time Adaptable Multi-Standard Packet Detector

Authors: James Chacko, Marko Jacovic, Nagarajan Kandasamy, Kapil R. Dandekar

Abstract: This paper describes a step by step approach for implementing a scalable and run-time adaptable multi-standard packet detector for orthogonal frequency divisional multiplexing (OFDM) based communication standards. The paper briefly describes considerations and design choices in making a modular system block with generic control supporting rapid prototy** and implementation of preamble-based pack… ▽ More This paper describes a step by step approach for implementing a scalable and run-time adaptable multi-standard packet detector for orthogonal frequency divisional multiplexing (OFDM) based communication standards. The paper briefly describes considerations and design choices in making a modular system block with generic control supporting rapid prototy** and implementation of preamble-based packet detectors. The results were generated through implementation on a Xilinx Virtex-6 FPGA with a MicroBlaze processor instantiated to provide run-time control and adaptability. △ Less

Submitted 13 September, 2016; originally announced November 2016.

arXiv:1606.04552 [pdf, ps, other]

A New Approach to Dimensionality Reduction for Anomaly Detection in Data Traffic

Authors: Tingshan Huang, Harish Sethu, Nagarajan Kandasamy

Abstract: The monitoring and management of high-volume feature-rich traffic in large networks offers significant challenges in storage, transmission and computational costs. The predominant approach to reducing these costs is based on performing a linear map** of the data to a low-dimensional subspace such that a certain large percentage of the variance in the data is preserved in the low-dimensional repr… ▽ More The monitoring and management of high-volume feature-rich traffic in large networks offers significant challenges in storage, transmission and computational costs. The predominant approach to reducing these costs is based on performing a linear map** of the data to a low-dimensional subspace such that a certain large percentage of the variance in the data is preserved in the low-dimensional representation. This variance-based subspace approach to dimensionality reduction forces a fixed choice of the number of dimensions, is not responsive to real-time shifts in observed traffic patterns, and is vulnerable to normal traffic spoofing. Based on theoretical insights proved in this paper, we propose a new distance-based approach to dimensionality reduction motivated by the fact that the real-time structural differences between the covariance matrices of the observed and the normal traffic is more relevant to anomaly detection than the structure of the training data alone. Our approach, called the distance-based subspace method, allows a different number of reduced dimensions in different time windows and arrives at only the number of dimensions necessary for effective anomaly detection. We present centralized and distributed versions of our algorithm and, using simulation on real traffic traces, demonstrate the qualitative and quantitative advantages of the distance-based subspace approach. △ Less

Submitted 14 June, 2016; originally announced June 2016.

arXiv:1603.08987 [pdf, other]

Experimental Evaluation of a Reconfigurable Antenna System for Blind Interference Alignment

Authors: Simon Begashaw, James Chacko, Nikhil Gulati, Danh H. Nguyen, Nagarajan Kandasamy, Kapil R. Dandekar

Abstract: In recent years, several experimental studies have come out to validate the theoretical findings of interference alignment (IA), but only a handful of studies have focused on blind interference alignment. Unlike IA and other interference mitigation techniques, blind IA does not require channel state information at the transmitter (CSIT). The key insight is that the transmitter uses the knowledge o… ▽ More In recent years, several experimental studies have come out to validate the theoretical findings of interference alignment (IA), but only a handful of studies have focused on blind interference alignment. Unlike IA and other interference mitigation techniques, blind IA does not require channel state information at the transmitter (CSIT). The key insight is that the transmitter uses the knowledge of channel coherence intervals and receivers utilize reconfigurable antennas to create channel fluctuations exploited by the transmitter. In this work, we present a novel experimental evaluation of a reconfigurable antenna system for achieving blind IA. We present a blind IA technique based on reconfigurable antennas for a 2-user multipleinput single-output (MISO) broadcast channel implemented on a software defined radio platform where each of the receivers is equipped with a reconfigurable antenna. We further compare this blind IA implementation with traditional TDMA scheme for benchmarking purposes. We show that the achievable rates for blind IA can be realized in practice using measured channels under practical channel conditions. Additionally, the average error vector magnitude and bit error rate (BER) performances are evaluated. △ Less

Submitted 29 March, 2016; originally announced March 2016.

Comments: 6 pages, 4 figures

Showing 1–20 of 20 results for author: Kandasamy, N