Search | arXiv e-print repository

Mitigating Imperfections in Mixed-Signal Neuromorphic Circuits

Authors: Z. Fahimi, M. R. Mahmoodi, M. Klachko, H. Nili, H. Kim, D. B. Strukov

Abstract: The progress in neuromorphic computing is fueled by the development of novel nonvolatile memories capable of storing analog information and implementing neural computation efficiently. However, like most other analog circuits, these devices and circuits are prone to imperfections, such as temperature dependency, noise, tuning error, etc., often leading to considerable performance degradation in ne… ▽ More The progress in neuromorphic computing is fueled by the development of novel nonvolatile memories capable of storing analog information and implementing neural computation efficiently. However, like most other analog circuits, these devices and circuits are prone to imperfections, such as temperature dependency, noise, tuning error, etc., often leading to considerable performance degradation in neural network implementations. Indeed, imperfections are major obstacles in the path of further progress and ultimate commercialization of these technologies. Hence, a practically viable approach should be developed to deal with these nonidealities and unleash the full potential of nonvolatile memories in neuromorphic systems. Here, for the first time, we report a comprehensive characterization of critical imperfections in two analog-grade memories, namely passively-integrated memristors and redesigned eFlash memories, which both feature long-term retention, high endurance, analog storage, low-power operation, and compact nano-scale footprint. Then, we propose a holistic approach that includes modifications in the training, tuning algorithm, memory state optimization, and circuit design to mitigate these imperfections. Our proposed methodology is corroborated on a hybrid software/experimental framework using two benchmarks: a moderate-size convolutional neural network and ResNet-18 trained on CIFAR-10 and ImageNet datasets, respectively. Our proposed approaches allow 2.5x to 9x improvements in the energy consumption of memory arrays during inference and sub-percent accuracy drop across 25-100 C temperature range. The defect tolerance is improved by >100x, and a sub-percent accuracy drop is demonstrated in deep neural networks built with 64x64 passive memristive crossbars featuring 25% normalized switching threshold variations. △ Less

Submitted 9 July, 2021; originally announced July 2021.

arXiv:1908.02472 [pdf]

3D-aCortex: An Ultra-Compact Energy-Efficient Neurocomputing Platform Based on Commercial 3D-NAND Flash Memories

Authors: Mohammad Bavandpour, Shubham Sahay, Mohammad Reza Mahmoodi, Dmitri B. Strukov

Abstract: The first contribution of this paper is the development of extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. Our detailed simulations have shown that, for example, the 5-bit VMM of 200-element ve… ▽ More The first contribution of this paper is the development of extremely dense, energy-efficient mixed-signal vector-by-matrix-multiplication (VMM) circuits based on the existing 3D-NAND flash memory blocks, without any need for their modification. Such compatibility is achieved using time-domain-encoded VMM design. Our detailed simulations have shown that, for example, the 5-bit VMM of 200-element vectors, using the commercially available 64-layer gate-all-around macaroni-type 3D-NAND memory blocks designed in the 55-nm technology node, may provide an unprecedented area efficiency of 0.14 um2/byte and energy efficiency of ~10 fJ/Op, including the input/output and other peripheral circuitry overheads. Our second major contribution is the development of 3D-aCortex, a multi-purpose neuromorphic inference processor that utilizes the proposed 3D-VMM blocks as its core processing units. We have performed rigorous performance simulations of such a processor on both circuit and system levels, taking into account non-idealities such as drain-induced barrier lowering, capacitive coupling, charge injection, parasitics, process variations, and noise. Our modeling of the 3D-aCortex performing several state-of-the-art neuromorphic-network benchmarks has shown that it may provide the record-breaking storage efficiency of 4.34 MB/mm2, the peak energy efficiency of 70.43 TOps/J, and the computational throughput up to 10.66 TOps/s. The storage efficiency can be further improved seven-fold by aggressively sharing VMM peripheral circuits at the cost of slight decrease in energy efficiency and throughput. △ Less

Submitted 7 August, 2019; originally announced August 2019.

Comments: 14 pages, 9 figures, 2 tables

arXiv:1905.09454 [pdf]

Energy-Efficient Moderate Precision Time-Domain Mixed-signal Vector-by-Matrix Multiplier Exploiting 1T-1R Arrays

Authors: Shubham Sahay, Mohammad Bavandpour, Mohammad Reza Mahmoodi, Dmitri Strukov

Abstract: The emerging mobile devices in this era of internet-of-things (IoT) require a dedicated processor to enable computationally intensive applications such as neuromorphic computing and signal processing. Vector-by-matrix multiplication (VMM) is the most prominent operation in these applications. Therefore, there is a critical need for compact and ultralow-power VMM blocks to perform resource-intensiv… ▽ More The emerging mobile devices in this era of internet-of-things (IoT) require a dedicated processor to enable computationally intensive applications such as neuromorphic computing and signal processing. Vector-by-matrix multiplication (VMM) is the most prominent operation in these applications. Therefore, there is a critical need for compact and ultralow-power VMM blocks to perform resource-intensive low-to-moderate precision computations. To this end, in this work, for the first time, we propose a time-domain mixed-signal VMM exploiting a modified configuration of 1MOSFET-1RRAM (1T-1R) array. The proposed VMM overcomes the energy inefficiency of the current-mode VMM approaches based on RRAMs. A rigorous analysis of the different non-ideal factors affecting the computational precision indicates that the non-negligible minimum cell currents, channel length modulation (CLM) and drain-induced barrier lowering (DIBL) are the dominant mechanisms degrading the precision of the proposed VMM. Our results also indicate that there exists a trade-off between the computational precision, dynamic range, and the area- and energy-efficiency of the proposed VMM approach. Therefore, we provide the necessary design guidelines for optimizing the performance. Our preliminary results show that an effective computational precision of 6-bits is achievable owing to an inherent compensation effect in the modified 1T-1R blocks. Furthermore, a 4-bit 200x200 VMM utilizing the proposed approach exhibits a significantly high energy efficiency of ~1.5 POps/J and a throughput of 2.5 TOps/s including the contribution from the input/output (I/O) circuitry. △ Less

Submitted 6 January, 2020; v1 submitted 23 May, 2019; originally announced May 2019.

arXiv:1904.01705 [pdf]

Improving Noise Tolerance of Mixed-Signal Neural Networks

Authors: Michael Klachko, Mohammad Reza Mahmoodi, Dmitri B. Strukov

Abstract: Mixed-signal hardware accelerators for deep learning achieve orders of magnitude better power efficiency than their digital counterparts. In the ultra-low power consumption regime, limited signal precision inherent to analog computation becomes a challenge. We perform a case study of a 6-layer convolutional neural network running on a mixed-signal accelerator and evaluate its sensitivity to hardwa… ▽ More Mixed-signal hardware accelerators for deep learning achieve orders of magnitude better power efficiency than their digital counterparts. In the ultra-low power consumption regime, limited signal precision inherent to analog computation becomes a challenge. We perform a case study of a 6-layer convolutional neural network running on a mixed-signal accelerator and evaluate its sensitivity to hardware specific noise. We apply various methods to improve noise robustness of the network and demonstrate an effective way to optimize useful signal ranges through adaptive signal clip**. The resulting model is robust enough to achieve 80.2% classification accuracy on CIFAR-10 dataset with just 1.4 mW power budget, while 6 mW budget allows us to achieve 87.1% accuracy, which is within 1% of the software baseline. For comparison, the unoptimized version of the same model achieves only 67.7% accuracy at 1.4 mW and 78.6% at 6 mW. △ Less

Submitted 2 April, 2019; originally announced April 2019.

Comments: Accepted for publication in IJCNN 2019

arXiv:1711.10673 [pdf, other]

Energy-Efficient Time-Domain Vector-by-Matrix Multiplier for Neurocomputing and Beyond

Authors: Mohammad Bavandpour, Mohammad Reza Mahmoodi, Dmitri B. Strukov

Abstract: We propose an extremely energy-efficient mixed-signal approach for performing vector-by-matrix multiplication in a time domain. In such implementation, multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while multi-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime. With our approach, multipl… ▽ More We propose an extremely energy-efficient mixed-signal approach for performing vector-by-matrix multiplication in a time domain. In such implementation, multi-bit values of the input and output vector elements are represented with time-encoded digital signals, while multi-bit matrix weights are realized with current sources, e.g. transistors biased in subthreshold regime. With our approach, multipliers can be chained together to implement large-scale circuits completely in a time domain. Multiplier operation does not rely on energy-taxing static currents, which are typical for peripheral and input/output conversion circuits of the conventional mixed-signal implementations. As a case study, we have designed a multilayer perceptron, based on two layers of 10x10 four-quadrant vector-by-matrix multipliers, in 55-nm process with embedded NOR flash memory technology, which allows for compact implementation of adjustable current sources. Our analysis, based on memory cell measurements, shows that at high computing speed the drain-induced barrier lowering is a major factor limiting multiplier precision to ~6 bit. Post-layout estimates for a conservative 6-bit digital input/output NxN multiplier designed in 55 nm process, including I/O circuitry for converting between digital and time domain representations, show ~7 fJ/Op for N>200, which can be further lowered well below 1 fJ/Op for more optimal and aggressive design. △ Less

Submitted 28 November, 2017; originally announced November 2017.

Comments: 6 pages, 6 figures

arXiv:1701.05595 [pdf]

Fast and Efficient Skin Detection for Facial Detection

Authors: Mohammad Reza Mahmoodi

Abstract: In this paper, an efficient skin detection system is proposed. The algorithm is based on a very fast efficient pre-processing step utilizing the concept of ternary conversion in order to identify candidate windows and subsequently, a novel local two-stage diffusion method which has F-score accuracy of 0.5978 on SDD dataset. The pre-processing step has been proven to be useful to boost the speed of… ▽ More In this paper, an efficient skin detection system is proposed. The algorithm is based on a very fast efficient pre-processing step utilizing the concept of ternary conversion in order to identify candidate windows and subsequently, a novel local two-stage diffusion method which has F-score accuracy of 0.5978 on SDD dataset. The pre-processing step has been proven to be useful to boost the speed of the system by eliminating 82% of an image in average. This is obtained by kee** the true positive rate above 98%. In addition, a novel segmentation algorithm is also designed to process candidate windows which is quantitatively and qualitatively proven to be very efficient in term of accuracy. The algorithm has been implemented in FPGA to obtain real-time processing speed. The system is designed fully pipeline and the inherent parallel structure of the algorithm is fully exploited to maximize the performance. The system is implemented on a Spartan-6 LXT45 Xilinx FPGA and it is capable of processing 98 frames of 640*480 24-bit color images per second. △ Less

Submitted 19 January, 2017; originally announced January 2017.

arXiv:1701.05588 [pdf]

High Performance Novel Skin Segmentation Algorithm for Images With Complex Background

Authors: Mohammad Reza Mahmoodi

Abstract: Skin Segmentation is widely used in biometric applications such as face detection, face recognition, face tracking, and hand gesture recognition. However, several challenges such as nonlinear illumination, equipment effects, personal interferences, ethnicity variations, etc., are involved in detection process that result in the inefficiency of color based methods. Even though many ideas have alrea… ▽ More Skin Segmentation is widely used in biometric applications such as face detection, face recognition, face tracking, and hand gesture recognition. However, several challenges such as nonlinear illumination, equipment effects, personal interferences, ethnicity variations, etc., are involved in detection process that result in the inefficiency of color based methods. Even though many ideas have already been proposed, the problem has not been satisfactorily solved yet. This paper introduces a technique that addresses some limitations of the previous works. The proposed algorithm consists of three main steps including initial seed generation of skin map, Otsu segmentation in color images, and finally a two-stage diffusion. The initial seed of skin pixels is provided based on the idea of ternary image as there are certain pixels in images which are associated to human complexion with very high probability. The Otsu segmentation is performed on several color channels in order to identify homogeneous regions. The result accompanying with the edge map of the image is utilized in two consecutive diffusion steps in order to annex initially unidentified skin pixels to the seed. Both quantitative and qualitative results demonstrate the effectiveness of the proposed system in compare with the state-of-the-art works. △ Less

Submitted 19 January, 2017; originally announced January 2017.

Showing 1–7 of 7 results for author: Mahmoodi, M R