-
ADC/DAC-Free Analog Acceleration of Deep Neural Networks with Frequency Transformation
Authors:
Nastaran Darabi,
Maeesha Binte Hashem,
Hongyi Pan,
Ahmet Cetin,
Wilfred Gomes,
Amit Ranjan Trivedi
Abstract:
The edge processing of deep neural networks (DNNs) is becoming increasingly important due to its ability to extract valuable information directly at the data source to minimize latency and energy consumption. Frequency-domain model compression, such as with the Walsh-Hadamard transform (WHT), has been identified as an efficient alternative. However, the benefits of frequency-domain processing are…
▽ More
The edge processing of deep neural networks (DNNs) is becoming increasingly important due to its ability to extract valuable information directly at the data source to minimize latency and energy consumption. Frequency-domain model compression, such as with the Walsh-Hadamard transform (WHT), has been identified as an efficient alternative. However, the benefits of frequency-domain processing are often offset by the increased multiply-accumulate (MAC) operations required. This paper proposes a novel approach to an energy-efficient acceleration of frequency-domain neural networks by utilizing analog-domain frequency-based tensor transformations. Our approach offers unique opportunities to enhance computational efficiency, resulting in several high-level advantages, including array micro-architecture with parallelism, ADC/DAC-free analog computations, and increased output sparsity. Our approach achieves more compact cells by eliminating the need for trainable parameters in the transformation matrix. Moreover, our novel array micro-architecture enables adaptive stitching of cells column-wise and row-wise, thereby facilitating perfect parallelism in computations. Additionally, our scheme enables ADC/DAC-free computations by training against highly quantized matrix-vector products, leveraging the parameter-free nature of matrix multiplications. Another crucial aspect of our design is its ability to handle signed-bit processing for frequency-based transformations. This leads to increased output sparsity and reduced digitization workload. On a 16$\times$16 crossbars, for 8-bit input processing, the proposed approach achieves the energy efficiency of 1602 tera operations per second per Watt (TOPS/W) without early termination strategy and 5311 TOPS/W with early termination strategy at VDD = 0.8 V.
△ Less
Submitted 4 September, 2023;
originally announced September 2023.
-
Memory-Immersed Collaborative Digitization for Area-Efficient Compute-in-Memory Deep Learning
Authors:
Shamma Nasrin,
Maeesha Binte Hashem,
Nastaran Darabi,
Benjamin Parpillon,
Farah Fahim,
Wilfred Gomes,
Amit Ranjan Trivedi
Abstract:
This work discusses memory-immersed collaborative digitization among compute-in-memory (CiM) arrays to minimize the area overheads of a conventional analog-to-digital converter (ADC) for deep learning inference. Thereby, using the proposed scheme, significantly more CiM arrays can be accommodated within limited footprint designs to improve parallelism and minimize external memory accesses. Under t…
▽ More
This work discusses memory-immersed collaborative digitization among compute-in-memory (CiM) arrays to minimize the area overheads of a conventional analog-to-digital converter (ADC) for deep learning inference. Thereby, using the proposed scheme, significantly more CiM arrays can be accommodated within limited footprint designs to improve parallelism and minimize external memory accesses. Under the digitization scheme, CiM arrays exploit their parasitic bit lines to form a within-memory capacitive digital-to-analog converter (DAC) that facilitates area-efficient successive approximation (SA) digitization. CiM arrays collaborate where a proximal array digitizes the analog-domain product-sums when an array computes the scalar product of input and weights. We discuss various networking configurations among CiM arrays where Flash, SA, and their hybrid digitization steps can be efficiently implemented using the proposed memory-immersed scheme. The results are demonstrated using a 65 nm CMOS test chip. Compared to a 40 nm-node 5-bit SAR ADC, our 65 nm design requires $\sim$25$\times$ less area and $\sim$1.4$\times$ less energy by leveraging in-memory computing structures. Compared to a 40 nm-node 5-bit Flash ADC, our design requires $\sim$51$\times$ less area and $\sim$13$\times$ less energy.
△ Less
Submitted 7 July, 2023;
originally announced July 2023.
-
MC-CIM: Compute-in-Memory with Monte-Carlo Dropouts for Bayesian Edge Intelligence
Authors:
Priyesh Shukla,
Shamma Nasrin,
Nastaran Darabi,
Wilfred Gomes,
Amit Ranjan Trivedi
Abstract:
We propose MC-CIM, a compute-in-memory (CIM) framework for robust, yet low power, Bayesian edge intelligence. Deep neural networks (DNN) with deterministic weights cannot express their prediction uncertainties, thereby pose critical risks for applications where the consequences of mispredictions are fatal such as surgical robotics. To address this limitation, Bayesian inference of a DNN has gained…
▽ More
We propose MC-CIM, a compute-in-memory (CIM) framework for robust, yet low power, Bayesian edge intelligence. Deep neural networks (DNN) with deterministic weights cannot express their prediction uncertainties, thereby pose critical risks for applications where the consequences of mispredictions are fatal such as surgical robotics. To address this limitation, Bayesian inference of a DNN has gained attention. Using Bayesian inference, not only the prediction itself, but the prediction confidence can also be extracted for planning risk-aware actions. However, Bayesian inference of a DNN is computationally expensive, ill-suited for real-time and/or edge deployment. An approximation to Bayesian DNN using Monte Carlo Dropout (MC-Dropout) has shown high robustness along with low computational complexity. Enhancing the computational efficiency of the method, we discuss a novel CIM module that can perform in-memory probabilistic dropout in addition to in-memory weight-input scalar product to support the method. We also propose a compute-reuse reformulation of MC-Dropout where each successive instance can utilize the product-sum computations from the previous iteration. Even more, we discuss how the random instances can be optimally ordered to minimize the overall MC-Dropout workload by exploiting combinatorial optimization methods. Application of the proposed CIM-based MC-Dropout execution is discussed for MNIST character recognition and visual odometry (VO) of autonomous drones. The framework reliably gives prediction confidence amidst non-idealities imposed by MC-CIM to a good extent. Proposed MC-CIM with 16x31 SRAM array, 0.85 V supply, 16nm low-standby power (LSTP) technology consumes 27.8 pJ for 30 MC-Dropout instances of probabilistic inference in its most optimal computing and peripheral configuration, saving 43% energy compared to typical execution.
△ Less
Submitted 13 November, 2021;
originally announced November 2021.
-
ENOS: Energy-Aware Network Operator Search for Hybrid Digital and Compute-in-Memory DNN Accelerators
Authors:
Shamma Nasrin,
Ahish Shylendra,
Yuti Kadakia,
Nick Iliev,
Wilfred Gomes,
Theja Tulabandhula,
Amit Ranjan Trivedi
Abstract:
This work proposes a novel Energy-Aware Network Operator Search (ENOS) approach to address the energy-accuracy trade-offs of a deep neural network (DNN) accelerator. In recent years, novel inference operators have been proposed to improve the computational efficiency of a DNN. Augmenting the operators, their corresponding novel computing modes have also been explored. However, simplification of DN…
▽ More
This work proposes a novel Energy-Aware Network Operator Search (ENOS) approach to address the energy-accuracy trade-offs of a deep neural network (DNN) accelerator. In recent years, novel inference operators have been proposed to improve the computational efficiency of a DNN. Augmenting the operators, their corresponding novel computing modes have also been explored. However, simplification of DNN operators invariably comes at the cost of lower accuracy, especially on complex processing tasks. Our proposed ENOS framework allows an optimal layer-wise integration of inference operators and computing modes to achieve the desired balance of energy and accuracy. The search in ENOS is formulated as a continuous optimization problem, solvable using typical gradient descent methods, thereby scalable to larger DNNs with minimal increase in training cost. We characterize ENOS under two settings. In the first setting, for digital accelerators, we discuss ENOS on multiply-accumulate (MAC) cores that can be reconfigured to different operators. ENOS training methods with single and bi-level optimization objectives are discussed and compared. We also discuss a sequential operator assignment strategy in ENOS that only learns the assignment for one layer in one training step, enabling greater flexibility in converging towards the optimal operator allocations. Furthermore, following Bayesian principles, a sampling-based variational mode of ENOS is also presented. ENOS is characterized on popular DNNs ShuffleNet and SqueezeNet on CIFAR10 and CIFAR100.
△ Less
Submitted 12 April, 2021;
originally announced April 2021.
-
Using exoskeletons to assist medical staff during prone positioning of mechanically ventilated COVID-19 patients: a pilot study
Authors:
Serena Ivaldi,
Pauline Maurice,
Waldez Gomes,
Jean Theurel,
Liên Wioland,
Jean-Jacques Atain-Kouadio,
Laurent Claudon,
Hind Hani,
Antoine Kimmoun,
Jean-Marc Sellal,
Bruno Levy,
Jean Paysant,
Sergueï Malikov,
Bruno Chenuel,
Nicla Settembre
Abstract:
We conducted a pilot study to evaluate the potential and feasibility of back-support exoskeletons to help the caregivers in the Intensive Care Unit (ICU) of the University Hospital of Nancy (France) executing Prone Positioning (PP) maneuvers on patients suffering from severe COVID-19-related Acute Respiratory Distress Syndrome. After comparing four commercial exoskeletons, the Laevo passive exoske…
▽ More
We conducted a pilot study to evaluate the potential and feasibility of back-support exoskeletons to help the caregivers in the Intensive Care Unit (ICU) of the University Hospital of Nancy (France) executing Prone Positioning (PP) maneuvers on patients suffering from severe COVID-19-related Acute Respiratory Distress Syndrome. After comparing four commercial exoskeletons, the Laevo passive exoskeleton was selected and used in the ICU in April 2020. The first volunteers using the Laevo reported very positive feedback and reduction of effort, confirmed by EMG and ECG analysis. Laevo has been since used to physically assist during PP in the ICU of the Hospital of Nancy, following the recrudescence of COVID-19, with an overall positive feedback.
△ Less
Submitted 11 February, 2021;
originally announced February 2021.
-
MF-Net: Compute-In-Memory SRAM for Multibit Precision Inference using Memory-immersed Data Conversion and Multiplication-free Operators
Authors:
Shamma Nasrin,
Diaa Badawi,
Ahmet Enis Cetin,
Wilfred Gomes,
Amit Ranjan Trivedi
Abstract:
We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We use multiplication-free function approximators based on ell_1 norm along with a co-adapted processing array and compute flow. Using the approach, we overcame many deficiencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM…
▽ More
We propose a co-design approach for compute-in-memory inference for deep neural networks (DNN). We use multiplication-free function approximators based on ell_1 norm along with a co-adapted processing array and compute flow. Using the approach, we overcame many deficiencies in the current art of in-SRAM DNN processing such as the need for digital-to-analog converters (DACs) at each operating SRAM row/column, the need for high precision analog-to-digital converters (ADCs), limited support for multi-bit precision weights, and limited vector-scale parallelism. Our co-adapted implementation seamlessly extends to multi-bit precision weights, it doesn't require DACs, and it easily extends to higher vector-scale parallelism. We also propose an SRAM-immersed successive approximation ADC (SA-ADC), where we exploit the parasitic capacitance of bit lines of SRAM array as a capacitive DAC. Since the dominant area overhead in SA-ADC comes due to its capacitive DAC, by exploiting the intrinsic parasitic of SRAM array, our approach allows low area implementation of within-SRAM SA-ADC. Our 8$\times$62 SRAM macro, which requires a 5-bit ADC, achieves $\sim$105 tera operations per second per Watt (TOPS/W) with 8-bit input/weight processing at 45 nm CMOS.
△ Less
Submitted 29 January, 2021;
originally announced February 2021.