-
A Precision-Optimized Fixed-Point Near-Memory Digital Processing Unit for Analog In-Memory Computing
Authors:
Elena Ferro,
Athanasios Vasilopoulos,
Corey Lammie,
Manuel Le Gallo,
Luca Benini,
Irem Boybat,
Abu Sebastian
Abstract:
Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing syst…
▽ More
Analog In-Memory Computing (AIMC) is an emerging technology for fast and energy-efficient Deep Learning (DL) inference. However, a certain amount of digital post-processing is required to deal with circuit mismatches and non-idealities associated with the memory devices. Efficient near-memory digital logic is critical to retain the high area/energy efficiency and low latency of AIMC. Existing systems adopt Floating Point 16 (FP16) arithmetic with limited parallelization capability and high latency. To overcome these limitations, we propose a Near-Memory digital Processing Unit (NMPU) based on fixed-point arithmetic. It achieves competitive accuracy and higher computing throughput than previous approaches while minimizing the area overhead. Moreover, the NMPU supports standard DL activation steps, such as ReLU and Batch Normalization. We perform a physical implementation of the NMPU design in a 14 nm CMOS technology and provide detailed performance, power, and area assessments. We validate the efficacy of the NMPU by using data from an AIMC chip and demonstrate that a simulated AIMC system with the proposed NMPU outperforms existing FP16-based implementations, providing 139$\times$ speed-up, 7.8$\times$ smaller area, and a competitive power consumption. Additionally, our approach achieves an inference accuracy of 86.65 %/65.06 %, with an accuracy drop of just 0.12 %/0.4 % compared to the FP16 baseline when benchmarked with ResNet9/ResNet32 networks trained on the CIFAR10/CIFAR100 datasets, respectively.
△ Less
Submitted 12 February, 2024;
originally announced February 2024.
-
Improving the Accuracy of Analog-Based In-Memory Computing Accelerators Post-Training
Authors:
Corey Lammie,
Athanasios Vasilopoulos,
Julian Büchel,
Giacomo Camposampiero,
Manuel Le Gallo,
Malte Rasch,
Abu Sebastian
Abstract:
Analog-Based In-Memory Computing (AIMC) inference accelerators can be used to efficiently execute Deep Neural Network (DNN) inference workloads. However, to mitigate accuracy losses, due to circuit and device non-idealities, Hardware-Aware (HWA) training methodologies must be employed. These typically require significant information about the underlying hardware. In this paper, we propose two Post…
▽ More
Analog-Based In-Memory Computing (AIMC) inference accelerators can be used to efficiently execute Deep Neural Network (DNN) inference workloads. However, to mitigate accuracy losses, due to circuit and device non-idealities, Hardware-Aware (HWA) training methodologies must be employed. These typically require significant information about the underlying hardware. In this paper, we propose two Post-Training (PT) optimization methods to improve accuracy after training is performed. For each crossbar, the first optimizes the conductance range of each column, and the second optimizes the input, i.e, Digital-to-Analog Converter (DAC), range. It is demonstrated that, when these methods are employed, the complexity during training, and the amount of information about the underlying hardware can be reduced, with no notable change in accuracy ($\leq$0.1%) when finetuning the pretrained RoBERTa transformer model for all General Language Understanding Evaluation (GLUE) benchmark tasks. Additionally, it is demonstrated that further optimizing learned parameters PT improves accuracy.
△ Less
Submitted 18 January, 2024;
originally announced January 2024.
-
LionHeart: A Layer-based Map** Framework for Heterogeneous Systems with Analog In-Memory Computing Tiles
Authors:
Corey Lammie,
Flavio Ponzina,
Yuxuan Wang,
Joshua Klein,
Marina Zapater,
Irem Boybat,
Abu Sebastian,
Giovanni Ansaloni,
David Atienza
Abstract:
When arranged in a crossbar configuration, resistive memory devices can be used to execute MVM, the most dominant operation of many ML algorithms, in constant time complexity. Nonetheless, when performing computations in the analog domain, novel challenges are introduced in terms of arithmetic precision and stochasticity, due to non-ideal circuit and device behaviour. Moreover, these non-idealitie…
▽ More
When arranged in a crossbar configuration, resistive memory devices can be used to execute MVM, the most dominant operation of many ML algorithms, in constant time complexity. Nonetheless, when performing computations in the analog domain, novel challenges are introduced in terms of arithmetic precision and stochasticity, due to non-ideal circuit and device behaviour. Moreover, these non-idealities have a temporal dimension, resulting in a degrading application accuracy over time. Facing these challenges, we propose a novel framework, named LionHeart, to obtain hybrid analog-digital map**s to execute DL inference workloads using heterogeneous accelerators. The accuracy-constrained map**s derived by LionHeart showcase, across different DNNs and datasets, high accuracy and potential for speedup. The results of the full system simulations highlight run-time reductions and energy efficiency gains that exceed 6X, with a user-defined accuracy threshold with respect to a fully digital floating point implementation.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Using the IBM Analog In-Memory Hardware Acceleration Kit for Neural Network Training and Inference
Authors:
Manuel Le Gallo,
Corey Lammie,
Julian Buechel,
Fabio Carta,
Omobayode Fagbohungbe,
Charles Mackin,
Hsinyu Tsai,
Vijay Narayanan,
Abu Sebastian,
Kaoutar El Maghraoui,
Malte J. Rasch
Abstract:
Analog In-Memory Computing (AIMC) is a promising approach to reduce the latency and energy consumption of Deep Neural Network (DNN) inference and training. However, the noisy and non-linear device characteristics, and the non-ideal peripheral circuitry in AIMC chips, require adapting DNNs to be deployed on such hardware to achieve equivalent accuracy to digital computing. In this tutorial, we prov…
▽ More
Analog In-Memory Computing (AIMC) is a promising approach to reduce the latency and energy consumption of Deep Neural Network (DNN) inference and training. However, the noisy and non-linear device characteristics, and the non-ideal peripheral circuitry in AIMC chips, require adapting DNNs to be deployed on such hardware to achieve equivalent accuracy to digital computing. In this tutorial, we provide a deep dive into how such adaptations can be achieved and evaluated using the recently released IBM Analog Hardware Acceleration Kit (AIHWKit), freely available at https://github.com/IBM/aihwkit. The AIHWKit is a Python library that simulates inference and training of DNNs using AIMC. We present an in-depth description of the AIHWKit design, functionality, and best practices to properly perform inference and training. We also present an overview of the Analog AI Cloud Composer, a platform that provides the benefits of using the AIHWKit simulation in a fully managed cloud setting along with physical AIMC hardware access, freely available at https://aihw-composer.draco.res.ibm.com. Finally, we show examples on how users can expand and customize AIHWKit for their own needs. This tutorial is accompanied by comprehensive Jupyter Notebook code examples that can be run using AIHWKit, which can be downloaded from https://github.com/IBM/aihwkit/tree/master/notebooks/tutorial.
△ Less
Submitted 26 January, 2024; v1 submitted 18 July, 2023;
originally announced July 2023.
-
AnalogNAS: A Neural Network Design Framework for Accurate Inference with Analog In-Memory Computing
Authors:
Hadjer Benmeziane,
Corey Lammie,
Irem Boybat,
Malte Rasch,
Manuel Le Gallo,
Hsinyu Tsai,
Ramachandran Muralidhar,
Smail Niar,
Ouarnoughi Hamza,
Vijay Narayanan,
Abu Sebastian,
Kaoutar El Maghraoui
Abstract:
The advancement of Deep Learning (DL) is driven by efficient Deep Neural Network (DNN) design and new hardware accelerators. Current DNN design is primarily tailored for general-purpose use and deployment on commercially viable platforms. Inference at the edge requires low latency, compact and power-efficient models, and must be cost-effective. Digital processors based on typical von Neumann archi…
▽ More
The advancement of Deep Learning (DL) is driven by efficient Deep Neural Network (DNN) design and new hardware accelerators. Current DNN design is primarily tailored for general-purpose use and deployment on commercially viable platforms. Inference at the edge requires low latency, compact and power-efficient models, and must be cost-effective. Digital processors based on typical von Neumann architectures are not conducive to edge AI given the large amounts of required data movement in and out of memory. Conversely, analog/mixed signal in-memory computing hardware accelerators can easily transcend the memory wall of von Neuman architectures when accelerating inference workloads. They offer increased area and power efficiency, which are paramount in edge resource-constrained environments. In this paper, we propose AnalogNAS, a framework for automated DNN design targeting deployment on analog In-Memory Computing (IMC) inference accelerators. We conduct extensive hardware simulations to demonstrate the performance of AnalogNAS on State-Of-The-Art (SOTA) models in terms of accuracy and deployment efficiency on various Tiny Machine Learning (TinyML) tasks. We also present experimental results that show AnalogNAS models achieving higher accuracy than SOTA models when implemented on a 64-core IMC chip based on Phase Change Memory (PCM). The AnalogNAS search code is released: https://github.com/IBM/analog-nas
△ Less
Submitted 17 May, 2023;
originally announced May 2023.
-
Seizure Detection and Prediction by Parallel Memristive Convolutional Neural Networks
Authors:
Chenqi Li,
Corey Lammie,
Xuening Dong,
Amirali Amirsoleimani,
Mostafa Rahimi Azghadi,
Roman Genov
Abstract:
During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and area-constrained settings remains a challenging task; especially when many recording channels are used. In t…
▽ More
During the past two decades, epileptic seizure detection and prediction algorithms have evolved rapidly. However, despite significant performance improvements, their hardware implementation using conventional technologies, such as Complementary Metal-Oxide-Semiconductor (CMOS), in power and area-constrained settings remains a challenging task; especially when many recording channels are used. In this paper, we propose a novel low-latency parallel Convolutional Neural Network (CNN) architecture that has between 2-2,800x fewer network parameters compared to SOTA CNN architectures and achieves 5-fold cross validation accuracy of 99.84% for epileptic seizure detection, and 99.01% and 97.54% for epileptic seizure prediction, when evaluated using the University of Bonn Electroencephalogram (EEG), CHB-MIT and SWEC-ETHZ seizure datasets, respectively. We subsequently implement our network onto analog crossbar arrays comprising Resistive Random-Access Memory (RRAM) devices, and provide a comprehensive benchmark by simulating, laying out, and determining hardware requirements of the CNN component of our system. To the best of our knowledge, we are the first to parallelize the execution of convolution layer kernels on separate analog crossbars to enable 2 orders of magnitude reduction in latency compared to SOTA hybrid Memristive-CMOS DL accelerators. Furthermore, we investigate the effects of non-idealities on our system and investigate Quantization Aware Training (QAT) to mitigate the performance degradation due to low ADC/DAC resolution. Finally, we propose a stuck weight offsetting methodology to mitigate performance degradation due to stuck RON/ROFF memristor weights, recovering up to 32% accuracy, without requiring retraining. The CNN component of our platform is estimated to consume approximately 2.791W of power while occupying an area of 31.255mm$^2$ in a 22nm FDSOI CMOS process.
△ Less
Submitted 20 June, 2022;
originally announced June 2022.
-
Toward A Formalized Approach for Spike Sorting Algorithms and Hardware Evaluation
Authors:
Tim Zhang,
Corey Lammie,
Mostafa Rahimi Azghadi,
Amirali Amirsoleimani,
Majid Ahmadi,
Roman Genov
Abstract:
Spike sorting algorithms are used to separate extracellular recordings of neuronal populations into single-unit spike activities. The development of customized hardware implementing spike sorting algorithms is burgeoning. However, there is a lack of a systematic approach and a set of standardized evaluation criteria to facilitate direct comparison of both software and hardware implementations. In…
▽ More
Spike sorting algorithms are used to separate extracellular recordings of neuronal populations into single-unit spike activities. The development of customized hardware implementing spike sorting algorithms is burgeoning. However, there is a lack of a systematic approach and a set of standardized evaluation criteria to facilitate direct comparison of both software and hardware implementations. In this paper, we formalize a set of standardized criteria and a publicly available synthetic dataset entitled Synthetic Simulations Of Extracellular Recordings (SSOER), which was constructed by aggregating existing synthetic datasets with varying Signal-To-Noise Ratios (SNRs). Furthermore, we present a benchmark for future comparison, and use our criteria to evaluate a simulated Resistive Random-Access Memory (RRAM) In-Memory Computing (IMC) system using the Discrete Wavelet Transform (DWT) for feature extraction. Our system consumes approximately (per channel) 10.72mW and occupies an area of 0.66mm$^2$ in a 22nm FDSOI Complementary Metal-Oxide-Semiconductor (CMOS) process.
△ Less
Submitted 13 May, 2022;
originally announced May 2022.
-
Navigating Local Minima in Quantized Spiking Neural Networks
Authors:
Jason K. Eshraghian,
Corey Lammie,
Mostafa Rahimi Azghadi,
Wei D. Lu
Abstract:
Spiking and Quantized Neural Networks (NNs) are becoming exceedingly important for hyper-efficient implementations of Deep Learning (DL) algorithms. However, these networks face challenges when trained using error backpropagation, due to the absence of gradient signals when applying hard thresholds. The broadly accepted trick to overcoming this is through the use of biased gradient estimators: sur…
▽ More
Spiking and Quantized Neural Networks (NNs) are becoming exceedingly important for hyper-efficient implementations of Deep Learning (DL) algorithms. However, these networks face challenges when trained using error backpropagation, due to the absence of gradient signals when applying hard thresholds. The broadly accepted trick to overcoming this is through the use of biased gradient estimators: surrogate gradients which approximate thresholding in Spiking Neural Networks (SNNs), and Straight-Through Estimators (STEs), which completely bypass thresholding in Quantized Neural Networks (QNNs). While noisy gradient feedback has enabled reasonable performance on simple supervised learning tasks, it is thought that such noise increases the difficulty of finding optima in loss landscapes, especially during the later stages of optimization. By periodically boosting the Learning Rate (LR) during training, we expect the network can navigate unexplored solution spaces that would otherwise be difficult to reach due to local minima, barriers, or flat surfaces. This paper presents a systematic evaluation of a cosine-annealed LR schedule coupled with weight-independent adaptive moment estimation as applied to Quantized SNNs (QSNNs). We provide a rigorous empirical evaluation of this technique on high precision and 4-bit quantized SNNs across three datasets, demonstrating (close to) state-of-the-art performance on the more complex datasets. Our source code is available at this link: https://github.com/jeshraghian/QSNNs.
△ Less
Submitted 15 February, 2022;
originally announced February 2022.
-
Design Space Exploration of Dense and Sparse Map** Schemes for RRAM Architectures
Authors:
Corey Lammie,
Jason K. Eshraghian,
Chenqi Li,
Amirali Amirsoleimani,
Roman Genov,
Wei D. Lu,
Mostafa Rahimi Azghadi
Abstract:
The impact of device and circuit-level effects in mixed-signal Resistive Random Access Memory (RRAM) accelerators typically manifest as performance degradation of Deep Learning (DL) algorithms, but the degree of impact varies based on algorithmic features. These include network architecture, capacity, weight distribution, and the type of inter-layer connections. Techniques are continuously emergin…
▽ More
The impact of device and circuit-level effects in mixed-signal Resistive Random Access Memory (RRAM) accelerators typically manifest as performance degradation of Deep Learning (DL) algorithms, but the degree of impact varies based on algorithmic features. These include network architecture, capacity, weight distribution, and the type of inter-layer connections. Techniques are continuously emerging to efficiently train sparse neural networks, which may have activation sparsity, quantization, and memristive noise. In this paper, we present an extended Design Space Exploration (DSE) methodology to quantify the benefits and limitations of dense and sparse map** schemes for a variety of network architectures. While sparsity of connectivity promotes less power consumption and is often optimized for extracting localized features, its performance on tiled RRAM arrays may be more susceptible to noise due to under-parameterization, when compared to dense map** schemes. Moreover, we present a case study quantifying and formalizing the trade-offs of typical non-idealities introduced into 1-Transistor-1-Resistor (1T1R) tiled memristive architectures and the size of modular crossbar tiles using the CIFAR-10 dataset.
△ Less
Submitted 24 January, 2022; v1 submitted 17 January, 2022;
originally announced January 2022.
-
A Deep Learning Localization Method for Measuring Abdominal Muscle Dimensions in Ultrasound Images
Authors:
Alzayat Saleh,
Issam H. Laradji,
Corey Lammie,
David Vazquez,
Carol A Flavell,
Mostafa Rahimi Azghadi
Abstract:
Health professionals extensively use Two- Dimensional (2D) Ultrasound (US) videos and images to visualize and measure internal organs for various purposes including evaluation of muscle architectural changes. US images can be used to measure abdominal muscles dimensions for the diagnosis and creation of customized treatment plans for patients with Low Back Pain (LBP), however, they are difficult t…
▽ More
Health professionals extensively use Two- Dimensional (2D) Ultrasound (US) videos and images to visualize and measure internal organs for various purposes including evaluation of muscle architectural changes. US images can be used to measure abdominal muscles dimensions for the diagnosis and creation of customized treatment plans for patients with Low Back Pain (LBP), however, they are difficult to interpret. Due to high variability, skilled professionals with specialized training are required to take measurements to avoid low intra-observer reliability. This variability stems from the challenging nature of accurately finding the correct spatial location of measurement endpoints in abdominal US images. In this paper, we use a Deep Learning (DL) approach to automate the measurement of the abdominal muscle thickness in 2D US images. By treating the problem as a localization task, we develop a modified Fully Convolutional Network (FCN) architecture to generate blobs of coordinate locations of measurement endpoints, similar to what a human operator does. We demonstrate that using the TrA400 US image dataset, our network achieves a Mean Absolute Error (MAE) of 0.3125 on the test set, which almost matches the performance of skilled ultrasound technicians. Our approach can facilitate next steps for automating the process of measurements in 2D US images, while reducing inter-observer as well as intra-observer variability for more effective clinical outcomes.
△ Less
Submitted 30 September, 2021;
originally announced September 2021.
-
Memristive Stochastic Computing for Deep Learning Parameter Optimization
Authors:
Corey Lammie,
Jason K. Eshraghian,
Wei D. Lu,
Mostafa Rahimi Azghadi
Abstract:
Stochastic Computing (SC) is a computing paradigm that allows for the low-cost and low-power computation of various arithmetic operations using stochastic bit streams and digital logic. In contrast to conventional representation schemes used within the binary domain, the sequence of bit streams in the stochastic domain is inconsequential, and computation is usually non-deterministic. In this brief…
▽ More
Stochastic Computing (SC) is a computing paradigm that allows for the low-cost and low-power computation of various arithmetic operations using stochastic bit streams and digital logic. In contrast to conventional representation schemes used within the binary domain, the sequence of bit streams in the stochastic domain is inconsequential, and computation is usually non-deterministic. In this brief, we exploit the stochasticity during switching of probabilistic Conductive Bridging RAM (CBRAM) devices to efficiently generate stochastic bit streams in order to perform Deep Learning (DL) parameter optimization, reducing the size of Multiply and Accumulate (MAC) units by 5 orders of magnitude. We demonstrate that in using a 40-nm Complementary Metal Oxide Semiconductor (CMOS) process our scalable architecture occupies 1.55mm$^2$ and consumes approximately 167$μ$W when optimizing parameters of a Convolutional Neural Network (CNN) while it is being trained for a character recognition task, observing no notable reduction in accuracy post-training.
△ Less
Submitted 11 March, 2021;
originally announced March 2021.
-
Towards Memristive Deep Learning Systems for Real-time Mobile Epileptic Seizure Prediction
Authors:
Corey Lammie,
Wei Xiang,
Mostafa Rahimi Azghadi
Abstract:
The unpredictability of seizures continues to distress many people with drug-resistant epilepsy. On account of recent technological advances, considerable efforts have been made using different hardware technologies to realize smart devices for the real-time detection and prediction of seizures. In this paper, we investigate the feasibility of using Memristive Deep Learning Systems (MDLSs) to perf…
▽ More
The unpredictability of seizures continues to distress many people with drug-resistant epilepsy. On account of recent technological advances, considerable efforts have been made using different hardware technologies to realize smart devices for the real-time detection and prediction of seizures. In this paper, we investigate the feasibility of using Memristive Deep Learning Systems (MDLSs) to perform real-time epileptic seizure prediction on the edge. Using the MemTorch simulation framework and the Children's Hospital Boston (CHB)-Massachusetts Institute of Technology (MIT) dataset we determine the performance of various simulated MDLS configurations. An average sensitivity of 77.4% and a Area Under the Receiver Operating Characteristic Curve (AUROC) of 0.85 are reported for the optimal configuration that can process Electroencephalogram (EEG) spectrograms with 7,680 samples in 1.408ms while consuming 0.0133W and occupying an area of 0.1269mm$^2$ in a 65nm Complementary Metal-Oxide-Semiconductor (CMOS) process.
△ Less
Submitted 16 February, 2021;
originally announced February 2021.
-
Hardware Implementation of Deep Network Accelerators Towards Healthcare and Biomedical Applications
Authors:
Mostafa Rahimi Azghadi,
Corey Lammie,
Jason K. Eshraghian,
Melika Payvand,
Elisa Donati,
Bernabe Linares-Barranco,
Giacomo Indiveri
Abstract:
The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing…
▽ More
The advent of dedicated Deep Learning (DL) accelerators and neuromorphic processors has brought on new opportunities for applying both Deep and Spiking Neural Network (SNN) algorithms to healthcare and biomedical applications at the edge. This can facilitate the advancement of medical Internet of Things (IoT) systems and Point of Care (PoC) devices. In this paper, we provide a tutorial describing how various technologies including emerging memristive devices, Field Programmable Gate Arrays (FPGAs), and Complementary Metal Oxide Semiconductor (CMOS) can be used to develop efficient DL accelerators to solve a wide variety of diagnostic, pattern recognition, and signal processing problems in healthcare. Furthermore, we explore how spiking neuromorphic processors can complement their DL counterparts for processing biomedical signals. The tutorial is augmented with case studies of the vast literature on neural network and neuromorphic hardware as applied to the healthcare domain. We benchmark various hardware platforms by performing a sensor fusion signal processing task combining electromyography (EMG) signals with computer vision. Comparisons are made between dedicated neuromorphic processors and embedded AI accelerators in terms of inference latency and energy. Finally, we provide our analysis of the field and share a perspective on the advantages, disadvantages, challenges, and opportunities that various accelerators and neuromorphic processors introduce to healthcare and biomedical domains.
△ Less
Submitted 28 April, 2021; v1 submitted 10 July, 2020;
originally announced July 2020.
-
MemTorch: An Open-source Simulation Framework for Memristive Deep Learning Systems
Authors:
Corey Lammie,
Wei Xiang,
Bernabé Linares-Barranco,
Mostafa Rahimi Azghadi
Abstract:
Memristive devices have shown great promise to facilitate the acceleration and improve the power efficiency of Deep Learning (DL) systems. Crossbar architectures constructed using these Resistive Random-Access Memory (RRAM) devices can be used to efficiently implement various in-memory computing operations, such as Multiply Accumulate (MAC) and unrolled-convolutions, which are used extensively in…
▽ More
Memristive devices have shown great promise to facilitate the acceleration and improve the power efficiency of Deep Learning (DL) systems. Crossbar architectures constructed using these Resistive Random-Access Memory (RRAM) devices can be used to efficiently implement various in-memory computing operations, such as Multiply Accumulate (MAC) and unrolled-convolutions, which are used extensively in Deep Neural Networks (DNNs) and Convolutional Neural Networks (CNNs). However, memristive devices face concerns of aging and non-idealities, which limit the accuracy, reliability, and robustness of Memristive Deep Learning Systems (MDLSs), that should be considered prior to circuit-level realization. This Original Software Publication (OSP) presents MemTorch, an open-source framework for customized large-scale memristive DL simulations, with a refined focus on the co-simulation of device non-idealities. MemTorch also facilitates co-modelling of key crossbar peripheral circuitry. MemTorch adopts a modernized soft-ware engineering methodology and integrates directly with the well-known PyTorch Machine Learning (ML) library
△ Less
Submitted 18 February, 2022; v1 submitted 23 April, 2020;
originally announced April 2020.
-
Training Progressively Binarizing Deep Networks Using FPGAs
Authors:
Corey Lammie,
Wei Xiang,
Mostafa Rahimi Azghadi
Abstract:
While hardware implementations of inference routines for Binarized Neural Networks (BNNs) are plentiful, current realizations of efficient BNN hardware training accelerators, suitable for Internet of Things (IoT) edge devices, leave much to be desired. Conventional BNN hardware training accelerators perform forward and backward propagations with parameters adopting binary representations, and opti…
▽ More
While hardware implementations of inference routines for Binarized Neural Networks (BNNs) are plentiful, current realizations of efficient BNN hardware training accelerators, suitable for Internet of Things (IoT) edge devices, leave much to be desired. Conventional BNN hardware training accelerators perform forward and backward propagations with parameters adopting binary representations, and optimization using parameters adopting floating or fixed-point real-valued representations--requiring two distinct sets of network parameters. In this paper, we propose a hardware-friendly training method that, contrary to conventional methods, progressively binarizes a singular set of fixed-point network parameters, yielding notable reductions in power and resource utilizations. We use the Intel FPGA SDK for OpenCL development environment to train our progressively binarizing DNNs on an OpenVINO FPGA. We benchmark our training approach on both GPUs and FPGAs using CIFAR-10 and compare it to conventional BNNs.
△ Less
Submitted 8 January, 2020;
originally announced January 2020.
-
Variation-aware Binarized Memristive Networks
Authors:
Corey Lammie,
Olga Krestinskaya,
Alex James,
Mostafa Rahimi Azghadi
Abstract:
The quantization of weights to binary states in Deep Neural Networks (DNNs) can replace resource-hungry multiply accumulate operations with simple accumulations. Such Binarized Neural Networks (BNNs) exhibit greatly reduced resource and power requirements. In addition, memristors have been shown as promising synaptic weight elements in DNNs. In this paper, we propose and simulate novel Binarized M…
▽ More
The quantization of weights to binary states in Deep Neural Networks (DNNs) can replace resource-hungry multiply accumulate operations with simple accumulations. Such Binarized Neural Networks (BNNs) exhibit greatly reduced resource and power requirements. In addition, memristors have been shown as promising synaptic weight elements in DNNs. In this paper, we propose and simulate novel Binarized Memristive Convolutional Neural Network (BMCNN) architectures employing hybrid weight and parameter representations. We train the proposed architectures offline and then map the trained parameters to our binarized memristive devices for inference. To take into account the variations in memristive devices, and to study their effect on the performance, we introduce variations in $R_{ON}$ and $R_{OFF}$. Moreover, we introduce means to mitigate the adverse effect of memristive variations in our proposed networks. Finally, we benchmark our BMCNNs and variation-aware BMCNNs using the MNIST dataset.
△ Less
Submitted 14 October, 2019;
originally announced October 2019.
-
Accelerating Deterministic and Stochastic Binarized Neural Networks on FPGAs Using OpenCL
Authors:
Corey Lammie,
Wei Xiang,
Mostafa Rahimi Azghadi
Abstract:
Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art pe…
▽ More
Recent technological advances have proliferated the available computing power, memory, and speed of modern Central Processing Units (CPUs), Graphics Processing Units (GPUs), and Field Programmable Gate Arrays (FPGAs). Consequently, the performance and complexity of Artificial Neural Networks (ANNs) is burgeoning. While GPU accelerated Deep Neural Networks (DNNs) currently offer state-of-the-art performance, they consume large amounts of power. Training such networks on CPUs is inefficient, as data throughput and parallel computation is limited. FPGAs are considered a suitable candidate for performance critical, low power systems, e.g. the Internet of Things (IOT) edge devices. Using the Xilinx SDAccel or Intel FPGA SDK for OpenCL development environment, networks described using the high-level OpenCL framework can be accelerated on heterogeneous platforms. Moreover, the resource utilization and power consumption of DNNs can be further enhanced by utilizing regularization techniques that binarize network weights. In this paper, we introduce, to the best of our knowledge, the first FPGA-accelerated stochastically binarized DNN implementations, and compare them to implementations accelerated using both GPUs and FPGAs. Our developed networks are trained and benchmarked using the popular MNIST and CIFAR-10 datasets, and achieve near state-of-the-art performance, while offering a >16-fold improvement in power consumption, compared to conventional GPU-accelerated networks. Both our FPGA-accelerated determinsitic and stochastic BNNs reduce inference times on MNIST and CIFAR-10 by >9.89x and >9.91x, respectively.
△ Less
Submitted 15 May, 2019;
originally announced May 2019.