-
Low-Power Hardware-Based Deep-Learning Diagnostics Support Case Study
Authors:
Khushal Sethi,
Vivek Parmar,
Manan Suri
Abstract:
Deep learning research has generated widespread interest leading to emergence of a large variety of technological innovations and applications. As significant proportion of deep learning research focuses on vision based applications, there exists a potential for using some of these techniques to enable low-power portable health-care diagnostic support solutions. In this paper, we propose an embedd…
▽ More
Deep learning research has generated widespread interest leading to emergence of a large variety of technological innovations and applications. As significant proportion of deep learning research focuses on vision based applications, there exists a potential for using some of these techniques to enable low-power portable health-care diagnostic support solutions. In this paper, we propose an embedded-hardware-based implementation of microscopy diagnostic support system for PoC case study on: (a) Malaria in thick blood smears, (b) Tuberculosis in sputum samples, and (c) Intestinal parasite infection in stool samples. We use a Squeeze-Net based model to reduce the network size and computation time. We also utilize the Trained Quantization technique to further reduce memory footprint of the learned models. This enables microscopy-based detection of pathogens that classifies with laboratory expert level accuracy as a standalone embedded hardware platform. The proposed implementation is 6x more power-efficient compared to conventional CPU-based implementation and has an inference time of $\sim$ 3 ms/sample.
△ Less
Submitted 3 September, 2022;
originally announced September 2022.
-
Fully-Binarized, Parallel, RRAM-based Computing Primitive for In-Memory Similarity Search
Authors:
Sandeep Kaur Kingra,
Vivek Parmar,
Deepak Verma,
Alessandro Bricalli,
Giuseppe Piccolboni,
Gabriel Molas,
Amir Regev,
Manan Suri
Abstract:
In this work, we propose a fully-binarized XOR-based IMSS (In-Memory Similarity Search) using RRAM (Resistive Random Access Memory) arrays. XOR (Exclusive OR) operation is realized using 2T-2R bitcells arranged along the column in an array. This enables simultaneous match operation across multiple stored data vectors by performing analog column-wise XOR operation and summation to compute HD (Hammi…
▽ More
In this work, we propose a fully-binarized XOR-based IMSS (In-Memory Similarity Search) using RRAM (Resistive Random Access Memory) arrays. XOR (Exclusive OR) operation is realized using 2T-2R bitcells arranged along the column in an array. This enables simultaneous match operation across multiple stored data vectors by performing analog column-wise XOR operation and summation to compute HD (Hamming Distance). The proposed scheme is experimentally validated on fabricated RRAM arrays. Full-system validation is performed through SPICE simulations using open source Skywater 130 nm CMOS PDK demonstrating energy of 17 fJ per XOR operation using the proposed bitcell with a full-system power dissipation of 145 $μ$W. Using projected estimations at advanced nodes (28 nm) energy savings of $\approx$1.5$\times$ compared to the state-of-the-art can be observed for a fixed workload. Application-level validation is performed on HSI (Hyper-Spectral Image) pixel classification task using the Salinas dataset demonstrating an accuracy of 90%.
△ Less
Submitted 18 September, 2022; v1 submitted 4 August, 2022;
originally announced August 2022.
-
Memory-Oriented Design-Space Exploration of Edge-AI Hardware for XR Applications
Authors:
Vivek Parmar,
Syed Shakib Sarwar,
Ziyun Li,
Hsien-Hsin S. Lee,
Barbara De Salvo,
Manan Suri
Abstract:
Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse. In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration. For both applications, we train deep neural networks and analyze the impact of quantization and hardware specific bottlenec…
▽ More
Low-Power Edge-AI capabilities are essential for on-device extended reality (XR) applications to support the vision of Metaverse. In this work, we investigate two representative XR workloads: (i) Hand detection and (ii) Eye segmentation, for hardware design space exploration. For both applications, we train deep neural networks and analyze the impact of quantization and hardware specific bottlenecks. Through simulations, we evaluate a CPU and two systolic inference accelerator implementations. Next, we compare these hardware solutions with advanced technology nodes. The impact of integrating state-of-the-art emerging non-volatile memory technology (STT/SOT/VGSOT MRAM) into the XR-AI inference pipeline is evaluated. We found that significant energy benefits (>=24%) can be achieved for hand detection (IPS=10) and eye segmentation (IPS=0.1) by introducing non-volatile memory in the memory hierarchy for designs at 7nm node while meeting minimum IPS (inference per second). Moreover, we can realize substantial reduction in area (>=30%) owing to the small form factor of MRAM compared to traditional SRAM.
△ Less
Submitted 28 March, 2023; v1 submitted 8 June, 2022;
originally announced June 2022.
-
Time-multiplexed In-memory computation scheme for map** Quantized Neural Networks on hybrid CMOS-OxRAM building blocks
Authors:
Sandeep Kaur Kingra,
Vivek Parmar,
Manoj Sharma,
Manan Suri
Abstract:
In this work, we experimentally demonstrate two key building blocks for realizing Binary/Ternary Neural Networks (BNNs/TNNs): (i) 130 nm CMOS based sigmoidal neurons and (ii) HfOx based multi-level (MLC) OxRAM-synaptic blocks. An optimized vector matrix multiplication programming scheme that utilizes the two building blocks is also presented. Compared to prior approaches that utilize differential…
▽ More
In this work, we experimentally demonstrate two key building blocks for realizing Binary/Ternary Neural Networks (BNNs/TNNs): (i) 130 nm CMOS based sigmoidal neurons and (ii) HfOx based multi-level (MLC) OxRAM-synaptic blocks. An optimized vector matrix multiplication programming scheme that utilizes the two building blocks is also presented. Compared to prior approaches that utilize differential synaptic structures, a single device per synapse with two sets of READ operations is used. Proposed hardware map** strategy shows performance change of <5% (decrease of 2-5% for TNN, increase of 0.2% for BNN) compared to ideal quantized neural networks (QNN) with significant memory savings in the order of 16-32x for classification problem on Fashion MNIST (FMNIST) dataset. Impact of OxRAM device variability on the performance of Hardware QNN (BNN/TNN) is also analyzed.
△ Less
Submitted 28 July, 2022; v1 submitted 1 June, 2022;
originally announced June 2022.
-
Patent Sentiment Analysis to Highlight Patent Paragraphs
Authors:
Renukswamy Chikkamath,
Vishvapalsinhji Ramsinh Parmar,
Christoph Hewel,
Markus Endres
Abstract:
Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic i…
▽ More
Given a patent document, identifying distinct semantic annotations is an interesting research aspect. Text annotation helps the patent practitioners such as examiners and patent attorneys to quickly identify the key arguments of any invention, successively providing a timely marking of a patent text. In the process of manual patent analysis, to attain better readability, recognising the semantic information by marking paragraphs is in practice. This semantic annotation process is laborious and time-consuming. To alleviate such a problem, we proposed a novel dataset to train Machine Learning algorithms to automate the highlighting process. The contributions of this work are: i) we developed a multi-class, novel dataset of size 150k samples by traversing USPTO patents over a decade, ii) articulated statistics and distributions of data using imperative exploratory data analysis, iii) baseline Machine Learning models are developed to utilize the dataset to address patent paragraph highlighting task, iv) dataset and codes relating to this task are open-sourced through a dedicated GIT web page: https://github.com/Renuk9390/Patent_Sentiment_Analysis and v) future path to extend this work using Deep Learning and domain specific pre-trained language models to develop a tool to highlight is provided. This work assist patent practitioners in highlighting semantic information automatically and aid to create a sustainable and efficient patent analysis using the aptitude of Machine Learning.
△ Less
Submitted 6 November, 2021;
originally announced November 2021.
-
Exploration of Optimized Semantic Segmentation Architectures for edge-Deployment on Drones
Authors:
Vivek Parmar,
Narayani Bhatia,
Shubham Negi,
Manan Suri
Abstract:
In this paper, we present an analysis on the impact of network parameters for semantic segmentation architectures in context of UAV data processing. We present the analysis on the DroneDeploy Segmentation benchmark. Based on the comparative analysis we identify the optimal network architecture to be FPN-EfficientNetB3 with pretrained encoder backbones based on Imagenet Dataset. The network achieve…
▽ More
In this paper, we present an analysis on the impact of network parameters for semantic segmentation architectures in context of UAV data processing. We present the analysis on the DroneDeploy Segmentation benchmark. Based on the comparative analysis we identify the optimal network architecture to be FPN-EfficientNetB3 with pretrained encoder backbones based on Imagenet Dataset. The network achieves IoU score of 0.65 and F1-score of 0.71 over the validation dataset. We also compare the various architectures in terms of their memory footprint and inference latency with further exploration of the impact of TensorRT based optimizations. We achieve memory savings of ~4.1x and latency improvement of 10% compared to Model: FPN and Backbone: InceptionResnetV2.
△ Less
Submitted 6 July, 2020;
originally announced July 2020.
-
Methodology for Realizing VMM with Binary RRAM Arrays: Experimental Demonstration of Binarized-ADALINE Using OxRAM Crossbar
Authors:
Sandeep Kaur Kingra,
Vivek Parmar,
Shubham Negi,
Sufyan Khan,
Boris Hudec,
Tuo-Hung Hou,
Manan Suri
Abstract:
In this paper, we present an efficient hardware map** methodology for realizing vector matrix multiplication (VMM) on resistive memory (RRAM) arrays. Using the proposed VMM computation technique, we experimentally demonstrate a binarized-ADALINE (Adaptive Linear) classifier on an OxRAM crossbar. An 8x8 OxRAM crossbar with Ni/3-nm HfO2/7 nm Al-doped-TiO2/TiN device stack is used. Weight training…
▽ More
In this paper, we present an efficient hardware map** methodology for realizing vector matrix multiplication (VMM) on resistive memory (RRAM) arrays. Using the proposed VMM computation technique, we experimentally demonstrate a binarized-ADALINE (Adaptive Linear) classifier on an OxRAM crossbar. An 8x8 OxRAM crossbar with Ni/3-nm HfO2/7 nm Al-doped-TiO2/TiN device stack is used. Weight training for the binarized-ADALINE classifier is performed ex-situ on UCI cancer dataset. Post weight generation the OxRAM array is carefully programmed to binary weight-states using the proposed weight map** technique on a custom-built testbench. Our VMM powered binarized-ADALINE network achieves a classification accuracy of 78% in simulation and 67% in experiments. Experimental accuracy was found to drop mainly due to crossbar inherent sneak-path issues and RRAM device programming variability.
△ Less
Submitted 10 June, 2020;
originally announced June 2020.
-
SLIM: Simultaneous Logic-in-Memory Computing Exploiting Bilayer Analog OxRAM Devices
Authors:
Sandeep Kaur Kingra,
Vivek Parmar,
Che-Chia Chang,
Boris Hudec,
Tuo-Hung Hou,
Manan Suri
Abstract:
Von Neumann architecture based computers isolate/physically separate computation and storage units i.e. data is shuttled between computation unit (processor) and memory unit to realize logic/ arithmetic and storage functions. This to-and-fro movement of data leads to a fundamental limitation of modern computers, known as the memory wall. Logic in-Memory (LIM) approaches aim to address this bottlen…
▽ More
Von Neumann architecture based computers isolate/physically separate computation and storage units i.e. data is shuttled between computation unit (processor) and memory unit to realize logic/ arithmetic and storage functions. This to-and-fro movement of data leads to a fundamental limitation of modern computers, known as the memory wall. Logic in-Memory (LIM) approaches aim to address this bottleneck by computing inside the memory units and thereby eliminating the energy-intensive and time-consuming data movement. However, most LIM approaches reported in literature are not truly "simultaneous" as during LIM operation the bitcell can be used only as a Memory cell or only as a Logic cell. The bitcell is not capable of storing both the Memory/Logic outputs simultaneously. Here, we propose a novel 'Simultaneous Logic in-Memory' (SLIM) methodology that allows to implement both Memory and Logic operations simultaneously on the same bitcell in a non-destructive manner without losing the previously stored Memory state. Through extensive experiments we demonstrate the SLIM methodology using non-filamentary bilayer analog OxRAM devices with NMOS transistors (2T-1R bitcell). Detailed programming scheme, array level implementation and controller architecture are also proposed. Furthermore, to study the impact of introducing SLIM array in the memory hierarchy, a simple image processing application (edge detection) is also investigated. It has been estimated that by performing all computations inside the SLIM array, the total Energy Delay Product (EDP) reduces by ~ 40x in comparison to a modern-day computer. EDP saving owing to reduction in data transfer between CPU Memory is observed to be ~ 780x.
△ Less
Submitted 13 February, 2020; v1 submitted 14 November, 2018;
originally announced November 2018.
-
MASTISK
Authors:
Tinish Bhattacharya,
Vivek Parmar,
Manan Suri
Abstract:
In this paper, we present MASTISK (MAchine-learning and Synaptic-plasticity Technology Integrated Simulation frameworK). MASTISK is an open-source versatile and flexible tool developed in MATLAB for design exploration of dedicated neuromorphic hardware using nanodevices and hybrid CMOS-nanodevice circuits. MASTISK has a hierarchical organization capturing details at the level of devices, circuits…
▽ More
In this paper, we present MASTISK (MAchine-learning and Synaptic-plasticity Technology Integrated Simulation frameworK). MASTISK is an open-source versatile and flexible tool developed in MATLAB for design exploration of dedicated neuromorphic hardware using nanodevices and hybrid CMOS-nanodevice circuits. MASTISK has a hierarchical organization capturing details at the level of devices, circuits (i.e. neurons or activation functions, synapses or weights) and architectures (i.e. topology, learning-rules, algorithms). In the current version, MASTISK provides user-friendly interface for design and simulation of spiking neural networks (SNN) powered by spatio-temporal learning rules such as Spike-Timing Dependent Plasticity (STDP). Users may provide network definition as a simple input parameter file and the framework is capable of performing automated learning/inference simulations. Validation case-studies of the proposed open source simulator will be published in the proceedings of IJCNN 2018. The proposed framework offers new functionalities, compared to similar simulation tools in literature, such as: (i) arbitrary synaptic circuit modeling capability with both identical and non-identical stimuli, (ii) arbitrary spike modeling, and (iii) nanodevice based neuron emulation. The code of MASTISK is available on request at: https://gitlab.com/NVM IITD Research/MASTISK/wikis/home
△ Less
Submitted 3 April, 2018;
originally announced April 2018.
-
Design Exploration of Hybrid CMOS-OxRAM Deep Generative Architectures
Authors:
Vivek Parmar,
Manan Suri
Abstract:
Deep Learning and its applications have gained tremendous interest recently in both academia and industry. Restricted Boltzmann Machines (RBMs) offer a key methodology to implement deep learning paradigms. This paper presents a novel approach for realizing hybrid CMOS-OxRAM based deep generative models (DGM). In our proposed hybrid DGM architectures, HfOx based (filamentary-type switching) OxRAM d…
▽ More
Deep Learning and its applications have gained tremendous interest recently in both academia and industry. Restricted Boltzmann Machines (RBMs) offer a key methodology to implement deep learning paradigms. This paper presents a novel approach for realizing hybrid CMOS-OxRAM based deep generative models (DGM). In our proposed hybrid DGM architectures, HfOx based (filamentary-type switching) OxRAM devices are extensively used for realizing multiple computational and non-computational functions such as: (i) Synapses (weights), (ii) internal neuron-state storage, (iii) stochastic neuron activation and (iv) programmable signal normalization. To validate the proposed scheme we have simulated two different architectures: (i) Deep Belief Network (DBN) and (ii) Stacked Denoising Autoencoder for classification and reconstruction of hand-written digits from a reduced MNIST dataset of 6000 images. Contrastive-divergence (CD) specially optimized for OxRAM devices was used to drive the synaptic weight update mechanism of each layer in the network. Overall learning rule was based on greedy-layer wise learning with no back propagation which allows the network to be trained to a good pre-training stage. Performance of the simulated hybrid CMOS-RRAM DGM model matches closely with software based model for a 2-layers deep network. Top-3 test accuracy achieved by the DBN was 95.5%. MSE of the SDA network was 0.003, lower than software based approach. Endurance analysis of the simulated architectures show that for 200 epochs of training (single RBM layer), maximum switching events/per OxRAM device was ~ 7000 cycles.
△ Less
Submitted 6 January, 2018;
originally announced January 2018.