Search | arXiv e-print repository

Block Selective Reprogramming for On-device Training of Vision Transformers

Authors: Sreetama Sarkar, Souvik Kundu, Kai Zheng, Peter A. Beerel

Abstract: The ubiquity of vision transformers (ViTs) for various edge applications, including personalized learning, has created the demand for on-device fine-tuning. However, training with the limited memory and computation power of edge devices remains a significant challenge. In particular, the memory required for training is much higher than that needed for inference, primarily due to the need to store… ▽ More The ubiquity of vision transformers (ViTs) for various edge applications, including personalized learning, has created the demand for on-device fine-tuning. However, training with the limited memory and computation power of edge devices remains a significant challenge. In particular, the memory required for training is much higher than that needed for inference, primarily due to the need to store activations across all layers in order to compute the gradients needed for weight updates. Previous works have explored reducing this memory requirement via frozen-weight training as well storing the activations in a compressed format. However, these methods are deemed inefficient due to their inability to provide training or inference speedup. In this paper, we first investigate the limitations of existing on-device training methods aimed at reducing memory and compute requirements. We then present block selective reprogramming (BSR) in which we fine-tune only a fraction of total blocks of a pre-trained model and selectively drop tokens based on self-attention scores of the frozen layers. To show the efficacy of BSR, we present extensive evaluations on ViT-B and DeiT-S with five different datasets. Compared to the existing alternatives, our approach simultaneously reduces training memory by up to 1.4x and compute cost by up to 2x while maintaining similar accuracy. We also showcase results for Mixture-of-Expert (MoE) models, demonstrating the effectiveness of our approach in multitask learning scenarios. △ Less

Submitted 25 March, 2024; originally announced May 2024.

arXiv:2403.13269 [pdf, other]

AFLoRA: Adaptive Freezing of Low Rank Adaptation in Parameter Efficient Fine-Tuning of Large Models

Authors: Zeyu Liu, Souvik Kundu, Anni Li, Junrui Wan, Lianghao Jiang, Peter Anthony Beerel

Abstract: We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we the incre… ▽ More We present a novel Parameter-Efficient Fine-Tuning (PEFT) method, dubbed as Adaptive Freezing of Low Rank Adaptation (AFLoRA). Specifically, for each pre-trained frozen weight tensor, we add a parallel path of trainable low-rank matrices, namely a down-projection and an up-projection matrix, each of which is followed by a feature transformation vector. Based on a novel freezing score, we the incrementally freeze these projection matrices during fine-tuning to reduce the computation and alleviate over-fitting. Our experimental results demonstrate that we can achieve state-of-the-art performance with an average improvement of up to $0.85\%$ as evaluated on GLUE benchmark while yeilding up to $9.5\times$ fewer average trainable parameters. While compared in terms of runtime, AFLoRA can yield up to $1.86\times$ improvement as opposed to similar PEFT alternatives. Besides the practical utility of our approach, we provide insights on the trainability requirements of LoRA paths at different modules and the freezing schedule for the different projection matrices. Code will be released. △ Less

Submitted 16 April, 2024; v1 submitted 19 March, 2024; originally announced March 2024.

Comments: 5 pages, 5 figures

arXiv:2402.15121 [pdf, other]

Toward High Performance, Programmable Extreme-Edge Intelligence for Neuromorphic Vision Sensors utilizing Magnetic Domain Wall Motion-based MTJ

Authors: Md Abdullah-Al Kaiser, Gourav Datta, Peter A. Beerel, Akhilesh R. Jaiswal

Abstract: The desire to empower resource-limited edge devices with computer vision (CV) must overcome the high energy consumption of collecting and processing vast sensory data. To address the challenge, this work proposes an energy-efficient non-von-Neumann in-pixel processing solution for neuromorphic vision sensors employing emerging (X) magnetic domain wall magnetic tunnel junction (MDWMTJ) for the firs… ▽ More The desire to empower resource-limited edge devices with computer vision (CV) must overcome the high energy consumption of collecting and processing vast sensory data. To address the challenge, this work proposes an energy-efficient non-von-Neumann in-pixel processing solution for neuromorphic vision sensors employing emerging (X) magnetic domain wall magnetic tunnel junction (MDWMTJ) for the first time, in conjunction with CMOS-based neuromorphic pixels. Our hybrid CMOS+X approach performs in-situ massively parallel asynchronous analog convolution, exhibiting low power consumption and high accuracy across various CV applications by leveraging the non-volatility and programmability of the MDWMTJ. Moreover, our developed device-circuit-algorithm co-design framework captures device constraints (low tunnel-magnetoresistance, low dynamic range) and circuit constraints (non-linearity, process variation, area consideration) based on monte-carlo simulations and device parameters utilizing GF22nm FD-SOI technology. Our experimental results suggest we can achieve an average of 45.3% reduction in backend-processor energy, maintaining similar front-end energy compared to the state-of-the-art and high accuracy of 79.17% and 95.99% on the DVS-CIFAR10 and IBM DVS128-Gesture datasets, respectively. △ Less

Submitted 23 February, 2024; originally announced February 2024.

Comments: 11 pages, 7 figures, 2 table

arXiv:2402.05521 [pdf, other]

Linearizing Models for Efficient yet Robust Private Inference

Authors: Sreetama Sarkar, Souvik Kundu, Peter A. Beerel

Abstract: The growing concern about data privacy has led to the development of private inference (PI) frameworks in client-server applications which protects both data privacy and model IP. However, the cryptographic primitives required yield significant latency overhead which limits its wide-spread application. At the same time, changing environments demand the PI service to be robust against various natur… ▽ More The growing concern about data privacy has led to the development of private inference (PI) frameworks in client-server applications which protects both data privacy and model IP. However, the cryptographic primitives required yield significant latency overhead which limits its wide-spread application. At the same time, changing environments demand the PI service to be robust against various naturally occurring and gradient-based perturbations. Despite several works focused on the development of latency-efficient models suitable for PI, the impact of these models on robustness has remained unexplored. Towards this goal, this paper presents RLNet, a class of robust linearized networks that can yield latency improvement via reduction of high-latency ReLU operations while improving the model performance on both clean and corrupted images. In particular, RLNet models provide a "triple win ticket" of improved classification accuracy on clean, naturally perturbed, and gradient-based perturbed images using a shared-mask shared-weight architecture with over an order of magnitude fewer ReLUs than baseline models. To demonstrate the efficacy of RLNet, we perform extensive experiments with ResNet and WRN model variants on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Our experimental evaluations show that RLNet can yield models with up to 11.14x fewer ReLUs, with accuracy close to the all-ReLU models, on clean, naturally perturbed, and gradient-based perturbed images. Compared with the SoTA non-robust linearized models at similar ReLU budgets, RLNet achieves an improvement in adversarial accuracy of up to ~47%, naturally perturbed accuracy up to ~16.4%, while improving clean image accuracy up to ~1.5%. △ Less

Submitted 8 February, 2024; originally announced February 2024.

arXiv:2402.04882 [pdf, other]

LMUFormer: Low Complexity Yet Powerful Spiking Model With Legendre Memory Units

Authors: Zeyu Liu, Gourav Datta, Anni Li, Peter Anthony Beerel

Abstract: Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation… ▽ More Transformer models have demonstrated high accuracy in numerous applications but have high complexity and lack sequential processing capability making them ill-suited for many streaming applications at the edge where devices are heavily resource-constrained. Thus motivated, many researchers have proposed reformulating the transformer models as RNN modules which modify the self-attention computation with explicit states. However, these approaches often incur significant performance degradation. The ultimate goal is to develop a model that has the following properties: parallel training, streaming and low-cost inference, and SOTA performance. In this paper, we propose a new direction to achieve this goal. We show how architectural modifications to a recurrent model can help push its performance toward Transformer models while retaining its sequential processing capability. Specifically, inspired by the recent success of Legendre Memory Units (LMU) in sequence learning tasks, we propose LMUFormer, which augments the LMU with convolutional patch embedding and convolutional channel mixer. Moreover, we present a spiking version of this architecture, which introduces the benefit of states within the patch embedding and channel mixer modules while simultaneously reducing the computing complexity. We evaluated our architectures on multiple sequence datasets. In comparison to SOTA transformer-based models within the ANN domain on the SCv2 dataset, our LMUFormer demonstrates comparable performance while necessitating a remarkable 53 times reduction in parameters and a substantial 65 times decrement in FLOPs. Additionally, owing to our model's proficiency in real-time data processing, we can achieve a 32.03% reduction in sequence length, all while incurring an inconsequential decline in performance. Our code is publicly available at https://github.com/zeyuliu1037/LMUFormer.git. △ Less

Submitted 19 January, 2024; originally announced February 2024.

Comments: The 12th International Conference on Learning Representations (ICLR 2024)

arXiv:2401.07393 [pdf, other]

A Novel Optimization Algorithm for Buffer and Splitter Minimization in Phase-Skip** Adiabatic Quantum-Flux-Parametron Circuits

Authors: Robert S. Aviles, Peter A. Beerel

Abstract: Adiabatic Quantum-Flux-Parametron (AQFP) logic is a promising emerging device technology that promises six orders of magnitude lower power than CMOS. However, AQFP is challenged by operation at only ultra-low temperatures, has high latency and area, and requires a complex clocking scheme. In particular, every logic gate, buffer, and splitter must be clocked and each pair of connected clocked gates… ▽ More Adiabatic Quantum-Flux-Parametron (AQFP) logic is a promising emerging device technology that promises six orders of magnitude lower power than CMOS. However, AQFP is challenged by operation at only ultra-low temperatures, has high latency and area, and requires a complex clocking scheme. In particular, every logic gate, buffer, and splitter must be clocked and each pair of connected clocked gates requires overlap** alternating current (AC) clock signals. In particular, clocked buffers need to be used to balance re-convergent logic paths, a problem that is exacerbated by every multi-node fanout needing a tree of clocked splitters. To reduce circuit area many works have proposed buffer and splitter insertion optimization algorithms and recent works have demonstrated a phase-skip** clocking scheme that reduces latency and area. This paper proposes the first algorithm to optimize buffer and splitter insertion for circuits that adopt phase-skip** and demonstrate the resulting performance improvements for a suite of AQFP benchmark circuits. △ Less

Submitted 14 January, 2024; originally announced January 2024.

arXiv:2401.06411 [pdf, other]

An Efficient and Scalable Clocking Assignment Algorithm for Multi-Threaded Multi-Phase Single Flux Quantum Circuits

Authors: Robert S. Aviles, Xi Li, Lei Lu, Zhaorui Ni, Peter A. Beerel

Abstract: A key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing buffers for multi-threaded mu… ▽ More A key distinguishing feature of single flux quantum (SFQ) circuits is that each logic gate is clocked. This feature forces the introduction of path-balancing flip-flops to ensure proper synchronization of inputs at each gate. This paper proposes a polynomial time complexity approximation algorithm for clocking assignments that minimizes the insertion of path balancing buffers for multi-threaded multi-phase clocking of SFQ circuits. Existing SFQ multi-phase clocking solutions have been shown to effectively reduce the number of required buffers inserted while maintaining high throughput, however, the associated clock assignment algorithms have exponential complexity and can have prohibitively long runtimes for large circuits, limiting the scalability of this approach. Our proposed algorithm is based on a linear program (LP) that leads to solutions that are experimentally on average within 5% of the optimum and helps accelerate convergence towards the optimal integer linear program (ILP) based solution. The improved LP and ILP runtimes permit multi-phase clocking schemes to scale to larger SFQ circuits than previous state of the art clocking assignment methods. We further extend the existing algorithm to support fanout sharing of the added buffers, saving, on average, an additional 10% of the inserted DFFs. Compared to traditional full path balancing (FPB) methods across 10 benchmarks, our enhanced LP saves 79.9%, 87.8%, and 91.2% of the inserted buffers for 2, 3, and 4 clock phases respectively. Finally, we extend this approach to the generation of circuits that completely mitigate potential hold-time violations at the cost of either adding on average less than 10% more buffers (for designs with 3 or more clock phases) or, more generally, adding a clock phase and thereby reducing throughput. △ Less

Submitted 12 January, 2024; originally announced January 2024.

arXiv:2312.06900 [pdf, other]

When Bio-Inspired Computing meets Deep Learning: Low-Latency, Accurate, & Energy-Efficient Spiking Neural Networks from Artificial Neural Networks

Authors: Gourav Datta, Zeyu Liu, James Diffenderfer, Bhavya Kailkhura, Peter A. Beerel

Abstract: Bio-inspired Spiking Neural Networks (SNN) are now demonstrating comparable accuracy to intricate convolutional neural networks (CNN), all while delivering remarkable energy and latency efficiency when deployed on neuromorphic hardware. In particular, ANN-to-SNN conversion has recently gained significant traction in develo** deep SNNs with close to state-of-the-art (SOTA) test accuracy on comple… ▽ More Bio-inspired Spiking Neural Networks (SNN) are now demonstrating comparable accuracy to intricate convolutional neural networks (CNN), all while delivering remarkable energy and latency efficiency when deployed on neuromorphic hardware. In particular, ANN-to-SNN conversion has recently gained significant traction in develo** deep SNNs with close to state-of-the-art (SOTA) test accuracy on complex image recognition tasks. However, advanced ANN-to-SNN conversion approaches demonstrate that for lossless conversion, the number of SNN time steps must equal the number of quantization steps in the ANN activation function. Reducing the number of time steps significantly increases the conversion error. Moreover, the spiking activity of the SNN, which dominates the compute energy in neuromorphic chips, does not reduce proportionally with the number of time steps. To mitigate the accuracy concern, we propose a novel ANN-to-SNN conversion framework, that incurs an exponentially lower number of time steps compared to that required in the SOTA conversion approaches. Our framework modifies the SNN integrate-and-fire (IF) neuron model with identical complexity and shifts the bias term of each batch normalization (BN) layer in the trained ANN. To mitigate the spiking activity concern, we propose training the source ANN with a fine-grained L1 regularizer with surrogate gradients that encourages high spike sparsity in the converted SNN. Our proposed framework thus yields lossless SNNs with ultra-low latency, ultra-low compute energy, thanks to the ultra-low timesteps and high spike sparsity, and ultra-high test accuracy, for example, 73.30% with only 4 time steps on the ImageNet dataset. △ Less

Submitted 11 December, 2023; originally announced December 2023.

Comments: Under review

arXiv:2312.01213 [pdf, other]

Recent Advances in Scalable Energy-Efficient and Trustworthy Spiking Neural networks: from Algorithms to Technology

Authors: Souvik Kundu, Rui-Jie Zhu, Akhilesh Jaiswal, Peter A. Beerel

Abstract: Neuromorphic computing and, in particular, spiking neural networks (SNNs) have become an attractive alternative to deep neural networks for a broad range of signal processing applications, processing static and/or temporal inputs from different sensory modalities, including audio and vision sensors. In this paper, we start with a description of recent advances in algorithmic and optimization innov… ▽ More Neuromorphic computing and, in particular, spiking neural networks (SNNs) have become an attractive alternative to deep neural networks for a broad range of signal processing applications, processing static and/or temporal inputs from different sensory modalities, including audio and vision sensors. In this paper, we start with a description of recent advances in algorithmic and optimization innovations to efficiently train and scale low-latency, and energy-efficient spiking neural networks (SNNs) for complex machine learning applications. We then discuss the recent efforts in algorithm-architecture co-design that explores the inherent trade-offs between achieving high energy-efficiency and low latency while still providing high accuracy and trustworthiness. We then describe the underlying hardware that has been developed to leverage such algorithmic innovations in an efficient way. In particular, we describe a hybrid method to integrate significant portions of the model's computation within both memory components as well as the sensor itself. Finally, we discuss the potential path forward for research in building deployable SNN systems identifying key challenges in the algorithm-hardware-application co-design space with an emphasis on trustworthiness. △ Less

Submitted 2 December, 2023; originally announced December 2023.

Comments: 5 pages, 3 figures

arXiv:2311.16456 [pdf, other]

Spiking Neural Networks with Dynamic Time Steps for Vision Transformers

Authors: Gourav Datta, Zeyu Liu, Anni Li, Peter A. Beerel

Abstract: Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision tra… ▽ More Spiking Neural Networks (SNNs) have emerged as a popular spatio-temporal computing paradigm for complex vision tasks. Recently proposed SNN training algorithms have significantly reduced the number of time steps (down to 1) for improved latency and energy efficiency, however, they target only convolutional neural networks (CNN). These algorithms, when applied on the recently spotlighted vision transformers (ViT), either require a large number of time steps or fail to converge. Based on analysis of the histograms of the ANN and SNN activation maps, we hypothesize that each ViT block has a different sensitivity to the number of time steps. We propose a novel training framework that dynamically allocates the number of time steps to each ViT module depending on a trainable score assigned to each timestep. In particular, we generate a scalar binary time step mask that filters spikes emitted by each neuron in a leaky-integrate-and-fire (LIF) layer. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), except for the input embedding layer, in contrast to expensive multiply-and-accumulates (MAC) needed in traditional ViTs. This yields significant improvements in energy efficiency. We evaluate our training framework and resulting SNNs on image recognition tasks including CIFAR10, CIFAR100, and ImageNet with different ViT architectures. We obtain a test accuracy of 95.97% with 4.97 time steps with direct encoding on CIFAR10. △ Less

Submitted 27 November, 2023; originally announced November 2023.

Comments: Under review

arXiv:2310.11637 [pdf, other]

FixPix: Fixing Bad Pixels using Deep Learning

Authors: Sreetama Sarkar, Xinan Ye, Gourav Datta, Peter A. Beerel

Abstract: Efficient and effective on-line detection and correction of bad pixels can improve yield and increase the expected lifetime of image sensors. This paper presents a comprehensive Deep Learning (DL) based on-line detection-correction approach, suitable for a wide range of pixel corruption rates. A confidence calibrated segmentation approach is introduced, which achieves nearly perfect bad pixel dete… ▽ More Efficient and effective on-line detection and correction of bad pixels can improve yield and increase the expected lifetime of image sensors. This paper presents a comprehensive Deep Learning (DL) based on-line detection-correction approach, suitable for a wide range of pixel corruption rates. A confidence calibrated segmentation approach is introduced, which achieves nearly perfect bad pixel detection, even with few training samples. A computationally light-weight correction algorithm is proposed for low rates of pixel corruption, that surpasses the accuracy of traditional interpolation-based techniques. We also propose an autoencoder based image reconstruction approach which alleviates the need for prior bad pixel detection and yields promising results for high rates of pixel corruption. Unlike previous methods, which use proprietary images, we demonstrate the efficacy of the proposed methods on the open-source Samsung S7 ISP and MIT-Adobe FiveK datasets. Our approaches yield up to 99.6% detection accuracy with <0.6% false positives and corrected images within 1.5% average pixel error from 70% corrupted images. △ Less

Submitted 17 October, 2023; originally announced October 2023.

arXiv:2309.07254 [pdf, other]

Mitigate Replication and Copying in Diffusion Models with Generalized Caption and Dual Fusion Enhancement

Authors: Chenghao Li, Dake Chen, Yuke Zhang, Peter A. Beerel

Abstract: While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate' training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our… ▽ More While diffusion models demonstrate a remarkable capability for generating high-quality images, their tendency to `replicate' training data raises privacy concerns. Although recent research suggests that this replication may stem from the insufficient generalization of training data captions and duplication of training images, effective mitigation strategies remain elusive. To address this gap, our paper first introduces a generality score that measures the caption generality and employ large language model (LLM) to generalize training captions. Subsequently, we leverage generalized captions and propose a novel dual fusion enhancement approach to mitigate the replication of diffusion models. Our empirical results demonstrate that our proposed methods can significantly reduce replication by 43.5% compared to the original diffusion model while maintaining the diversity and quality of generations. Code is available at https://github.com/HowardLi0816/dual-fusion-diffusion. △ Less

Submitted 23 January, 2024; v1 submitted 13 September, 2023; originally announced September 2023.

Comments: This paper has been accepted for presentation at 2024 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2024)

arXiv:2306.04859 [pdf, other]

doi 10.1145/3583781.3590266

Island-based Random Dynamic Voltage Scaling vs ML-Enhanced Power Side-Channel Attacks

Authors: Dake Chen, Christine Goins, Maxwell Waugaman, Georgios D. Dimou, Peter A. Beerel

Abstract: In this paper, we describe and analyze an island-based random dynamic voltage scaling (iRDVS) approach to thwart power side-channel attacks. We first analyze the impact of the number of independent voltage islands on the resulting signal-to-noise ratio and trace misalignment. As part of our analysis of misalignment, we propose a novel unsupervised machine learning (ML) based attack that is effecti… ▽ More In this paper, we describe and analyze an island-based random dynamic voltage scaling (iRDVS) approach to thwart power side-channel attacks. We first analyze the impact of the number of independent voltage islands on the resulting signal-to-noise ratio and trace misalignment. As part of our analysis of misalignment, we propose a novel unsupervised machine learning (ML) based attack that is effective on systems with three or fewer independent voltages. Our results show that iRDVS with four voltage islands, however, cannot be broken with 200k encryption traces, suggesting that iRDVS can be effective. We finish the talk by describing an iRDVS test chip in a 12nm FinFet process that incorporates three variants of an AES-256 accelerator, all originating from the same RTL. This included a synchronous core, an asynchronous core with no protection, and a core employing the iRDVS technique using asynchronous logic. Lab measurements from the chips indicated that both unprotected variants failed the test vector leakage assessment (TVLA) security metric test, while the iRDVS was proven secure in a variety of configurations. △ Less

Submitted 13 June, 2023; v1 submitted 7 June, 2023; originally announced June 2023.

Journal ref: Proceedings of the Great Lakes Symposium on VLSI 2023

arXiv:2305.00107 [pdf, other]

Unraveling Latch Locking Using Machine Learning, Boolean Analysis, and ILP

Authors: Dake Chen, Xuan Zhou, Yinghua Hu, Yuke Zhang, Kaixin Yang, Andrew Rittenbach, Pierluigi Nuzzo, Peter A. Beerel

Abstract: Logic locking has become a promising approach to provide hardware security in the face of a possibly insecure fabrication supply chain. While many techniques have focused on locking combinational logic (CL), an alternative latch-locking approach in which the sequential elements are locked has also gained significant attention. Latch (LAT) locking duplicates a subset of the flip-flops (FF) of a des… ▽ More Logic locking has become a promising approach to provide hardware security in the face of a possibly insecure fabrication supply chain. While many techniques have focused on locking combinational logic (CL), an alternative latch-locking approach in which the sequential elements are locked has also gained significant attention. Latch (LAT) locking duplicates a subset of the flip-flops (FF) of a design, retimes these FFs and replaces them with latches, and adds two types of decoy latches to obfuscate the netlist. It then adds control circuitry (CC) such that all latches must be correctly keyed for the circuit to function correctly. This paper presents a two-phase attack on latch-locked circuits that uses a novel combination of deep learning, Boolean analysis, and integer linear programming (ILP). The attack requires access to the reverse-engineered netlist but, unlike SAT attacks, is oracle-less, not needing access to the unlocked circuit or correct input/output pairs. We trained and evaluated the attack using the ISCAS'89 and ITC'99 benchmark circuits. The attack successfully identifies a key that is, on average, 96.9% accurate and fully discloses the correct functionality in 8 of the tested 19 circuits and leads to low function corruptibility (less than 4%) in 3 additional circuits. The attack run-times are manageable. △ Less

Submitted 28 April, 2023; originally announced May 2023.

Comments: 8 pages, 7 figures, accepted by ISQED 2023

ACM Class: B.m; C.m

arXiv:2304.13274 [pdf, other]

Making Models Shallow Again: Jointly Learning to Reduce Non-Linearity and Depth for Latency-Efficient Private Inference

Authors: Souvik Kundu, Yuke Zhang, Dake Chen, Peter A. Beerel

Abstract: Large number of ReLU and MAC operations of Deep neural networks make them ill-suited for latency and compute-efficient private inference. In this paper, we present a model optimization method that allows a model to learn to be shallow. In particular, we leverage the ReLU sensitivity of a convolutional block to remove a ReLU layer and merge its succeeding and preceding convolution layers to a shall… ▽ More Large number of ReLU and MAC operations of Deep neural networks make them ill-suited for latency and compute-efficient private inference. In this paper, we present a model optimization method that allows a model to learn to be shallow. In particular, we leverage the ReLU sensitivity of a convolutional block to remove a ReLU layer and merge its succeeding and preceding convolution layers to a shallow block. Unlike existing ReLU reduction methods, our joint reduction method can yield models with improved reduction of both ReLUs and linear operations by up to 1.73x and 1.47x, respectively, evaluated with ResNet18 on CIFAR-100 without any significant accuracy-drop. △ Less

Submitted 26 April, 2023; originally announced April 2023.

arXiv:2304.13266 [pdf, other]

C2PI: An Efficient Crypto-Clear Two-Party Neural Network Private Inference

Authors: Yuke Zhang, Dake Chen, Souvik Kundu, Haomei Liu, Ruiheng Peng, Peter A. Beerel

Abstract: Recently, private inference (PI) has addressed the rising concern over data and model privacy in machine learning inference as a service. However, existing PI frameworks suffer from high computational and communication costs due to the expensive multi-party computation (MPC) protocols. Existing literature has developed lighter MPC protocols to yield more efficient PI schemes. We, in contrast, prop… ▽ More Recently, private inference (PI) has addressed the rising concern over data and model privacy in machine learning inference as a service. However, existing PI frameworks suffer from high computational and communication costs due to the expensive multi-party computation (MPC) protocols. Existing literature has developed lighter MPC protocols to yield more efficient PI schemes. We, in contrast, propose to lighten them by introducing an empirically-defined privacy evaluation. To that end, we reformulate the threat model of PI and use inference data privacy attacks (IDPAs) to evaluate data privacy. We then present an enhanced IDPA, named distillation-based inverse-network attack (DINA), for improved privacy evaluation. Finally, we leverage the findings from DINA and propose C2PI, a two-party PI framework presenting an efficient partitioning of the neural network model and requiring only the initial few layers to be performed with MPC protocols. Based on our experimental evaluations, relaxing the formal data privacy guarantees C2PI can speed up existing PI frameworks, including Delphi [1] and Cheetah [2], up to 2.89x and 3.88x under LAN and WAN settings, respectively, and save up to 2.75x communication costs. △ Less

Submitted 25 April, 2023; originally announced April 2023.

arXiv:2304.02968 [pdf, other]

doi 10.1145/3583781.3590235

Technology-Circuit-Algorithm Tri-Design for Processing-in-Pixel-in-Memory (P2M)

Authors: Md Abdullah-Al Kaiser, Gourav Datta, Sreetama Sarkar, Souvik Kundu, Zihan Yin, Manas Garg, Ajey P. Jacob, Peter A. Beerel, Akhilesh R. Jaiswal

Abstract: The massive amounts of data generated by camera sensors motivate data processing inside pixel arrays, i.e., at the extreme-edge. Several critical developments have fueled recent interest in the processing-in-pixel-in-memory paradigm for a wide range of visual machine intelligence tasks, including (1) advances in 3D integration technology to enable complex processing inside each pixel in a 3D integ… ▽ More The massive amounts of data generated by camera sensors motivate data processing inside pixel arrays, i.e., at the extreme-edge. Several critical developments have fueled recent interest in the processing-in-pixel-in-memory paradigm for a wide range of visual machine intelligence tasks, including (1) advances in 3D integration technology to enable complex processing inside each pixel in a 3D integrated manner while maintaining pixel density, (2) analog processing circuit techniques for massively parallel low-energy in-pixel computations, and (3) algorithmic techniques to mitigate non-idealities associated with analog processing through hardware-aware training schemes. This article presents a comprehensive technology-circuit-algorithm landscape that connects technology capabilities, circuit design strategies, and algorithmic optimizations to power, performance, area, bandwidth reduction, and application-level accuracy metrics. We present our results using a comprehensive co-design framework incorporating hardware and algorithmic optimizations for various complex real-life visual intelligence tasks mapped onto our P2M paradigm. △ Less

Submitted 6 April, 2023; originally announced April 2023.

Journal ref: GLSVLSI '23: Great Lakes Symposium on VLSI 2023 Proceedings

arXiv:2302.09108 [pdf, other]

doi 10.1109/ISCAS46773.2023.10181988

ViTA: A Vision Transformer Inference Accelerator for Edge Applications

Authors: Shashank Nag, Gourav Datta, Souvik Kundu, Nitin Chandrachoodan, Peter A. Beerel

Abstract: Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including t… ▽ More Vision Transformer models, such as ViT, Swin Transformer, and Transformer-in-Transformer, have recently gained significant traction in computer vision tasks due to their ability to capture the global relation between features which leads to superior performance. However, they are compute-heavy and difficult to deploy in resource-constrained edge devices. Existing hardware accelerators, including those for the closely-related BERT transformer models, do not target highly resource-constrained environments. In this paper, we address this gap and propose ViTA - a configurable hardware accelerator for inference of vision transformer models, targeting resource-constrained edge computing devices and avoiding repeated off-chip memory accesses. We employ a head-level pipeline and inter-layer MLP optimizations, and can support several commonly used vision transformer models with changes solely in our control logic. We achieve nearly 90% hardware utilization efficiency on most vision transformer models, report a power of 0.88W when synthesised with a clock of 150 MHz, and get reasonable frame rates - all of which makes ViTA suitable for edge applications. △ Less

Submitted 17 February, 2023; originally announced February 2023.

Comments: Accepted at ISCAS 2023

Journal ref: 2023 IEEE International Symposium on Circuits and Systems (ISCAS), Monterey, CA, USA, 2023, pp. 1-5

arXiv:2302.03523 [pdf, other]

Sparse Mixture Once-for-all Adversarial Training for Efficient In-Situ Trade-Off Between Accuracy and Robustness of DNNs

Authors: Souvik Kundu, Sairam Sundaresan, Sharath Nittur Sridhar, Shunlin Lu, Han Tang, Peter A. Beerel

Abstract: Existing deep neural networks (DNNs) that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on either activation or weight conditioned convolution operations. However, such conditional learning costs additional multiply-accumulate (MAC) or addition operations, increasing inference memory and compute costs. To that end, we present a sparse mixture onc… ▽ More Existing deep neural networks (DNNs) that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on either activation or weight conditioned convolution operations. However, such conditional learning costs additional multiply-accumulate (MAC) or addition operations, increasing inference memory and compute costs. To that end, we present a sparse mixture once for all adversarial training (SMART), that allows a model to train once and then in-situ trade-off between accuracy and robustness, that too at a reduced compute and parameter overhead. In particular, SMART develops two expert paths, for clean and adversarial images, respectively, that are then conditionally trained via respective dedicated sets of binary sparsity masks. Extensive evaluations on multiple image classification datasets across different models show SMART to have up to 2.72x fewer non-zero parameters costing proportional reduction in compute overhead, while yielding SOTA accuracy-robustness trade-off. Additionally, we present insightful observations in designing sparse masks to successfully condition on both clean and perturbed images. △ Less

Submitted 27 December, 2022; originally announced February 2023.

Comments: 5 pages, 5 figures, 2 tables

arXiv:2301.09254 [pdf, other]

Learning to Linearize Deep Neural Networks for Secure and Efficient Private Inference

Authors: Souvik Kundu, Shunlin Lu, Yuke Zhang, Jacqueline Liu, Peter A. Beerel

Abstract: The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual… ▽ More The large number of ReLU non-linearity operations in existing deep neural networks makes them ill-suited for latency-efficient private inference (PI). Existing techniques to reduce ReLU operations often involve manual effort and sacrifice significant accuracy. In this paper, we first present a novel measure of non-linearity layers' ReLU sensitivity, enabling mitigation of the time-consuming manual efforts in identifying the same. Based on this sensitivity, we then present SENet, a three-stage training method that for a given ReLU budget, automatically assigns per-layer ReLU counts, decides the ReLU locations for each layer's activation map, and trains a model with significantly fewer ReLUs to potentially yield latency and communication efficient PI. Experimental evaluations with multiple models on various datasets show SENet's superior performance both in terms of reduced ReLUs and improved classification accuracy compared to existing alternatives. In particular, SENet can yield models that require up to ~2x fewer ReLUs while yielding similar accuracy. For a similar ReLU budget SENet can yield models with ~2.32% improved classification accuracy, evaluated on CIFAR-100. △ Less

Submitted 22 January, 2023; originally announced January 2023.

Comments: 15 pages, 10 figures, 11 tables. Accepted as a conference paper at ICLR 2023

arXiv:2301.09111 [pdf, other]

Neuromorphic-P2M: Processing-in-Pixel-in-Memory Paradigm for Neuromorphic Image Sensors

Authors: Md Abdullah-Al Kaiser, Gourav Datta, Zixu Wang, Ajey P. Jacob, Peter A. Beerel, Akhilesh R. Jaiswal

Abstract: Edge devices equipped with computer vision must deal with vast amounts of sensory data with limited computing resources. Hence, researchers have been exploring different energy-efficient solutions such as near-sensor processing, in-sensor processing, and in-pixel processing, bringing the computation closer to the sensor. In particular, in-pixel processing embeds the computation capabilities inside… ▽ More Edge devices equipped with computer vision must deal with vast amounts of sensory data with limited computing resources. Hence, researchers have been exploring different energy-efficient solutions such as near-sensor processing, in-sensor processing, and in-pixel processing, bringing the computation closer to the sensor. In particular, in-pixel processing embeds the computation capabilities inside the pixel array and achieves high energy efficiency by generating low-level features instead of the raw data stream from CMOS image sensors. Many different in-pixel processing techniques and approaches have been demonstrated on conventional frame-based CMOS imagers, however, the processing-in-pixel approach for neuromorphic vision sensors has not been explored so far. In this work, we for the first time, propose an asynchronous non-von-Neumann analog processing-in-pixel paradigm to perform convolution operations by integrating in-situ multi-bit multi-channel convolution inside the pixel array performing analog multiply and accumulate (MAC) operations that consume significantly less energy than their digital MAC alternative. To make this approach viable, we incorporate the circuit's non-ideality, leakage, and process variations into a novel hardware-algorithm co-design framework that leverages extensive HSpice simulations of our proposed circuit using the GF22nm FD-SOI technology node. We verified our framework on state-of-the-art neuromorphic vision sensor datasets and show that our solution consumes ~2x lower backend-processor energy while maintaining almost similar front-end (sensor) energy on the IBM DVS128-Gesture dataset than the state-of-the-art while maintaining a high test accuracy of 88.36%. △ Less

Submitted 22 January, 2023; originally announced January 2023.

Comments: 17 pages, 11 figures, 2 tables

arXiv:2212.10881 [pdf, other]

In-Sensor & Neuromorphic Computing are all you need for Energy Efficient Computer Vision

Authors: Gourav Datta, Zeyu Liu, Md Abdullah-Al Kaiser, Souvik Kundu, Joe Mathai, Zihan Yin, Ajey P. Jacob, Akhilesh R. Jaiswal, Peter A. Beerel

Abstract: Due to the high activation sparsity and use of accumulates (AC) instead of expensive multiply-and-accumulates (MAC), neuromorphic spiking neural networks (SNNs) have emerged as a promising low-power alternative to traditional DNNs for several computer vision (CV) applications. However, most existing SNNs require multiple time steps for acceptable inference accuracy, hindering real-time deployment… ▽ More Due to the high activation sparsity and use of accumulates (AC) instead of expensive multiply-and-accumulates (MAC), neuromorphic spiking neural networks (SNNs) have emerged as a promising low-power alternative to traditional DNNs for several computer vision (CV) applications. However, most existing SNNs require multiple time steps for acceptable inference accuracy, hindering real-time deployment and increasing spiking activity and, consequently, energy consumption. Recent works proposed direct encoding that directly feeds the analog pixel values in the first layer of the SNN in order to significantly reduce the number of time steps. Although the overhead for the first layer MACs with direct encoding is negligible for deep SNNs and the CV processing is efficient using SNNs, the data transfer between the image sensors and the downstream processing costs significant bandwidth and may dominate the total energy. To mitigate this concern, we propose an in-sensor computing hardware-software co-design framework for SNNs targeting image recognition tasks. Our approach reduces the bandwidth between sensing and processing by 12-96x and the resulting total energy by 2.32x compared to traditional CV processing, with a 3.8% reduction in accuracy on ImageNet. △ Less

Submitted 21 December, 2022; originally announced December 2022.

arXiv:2212.10170 [pdf, other]

Hoyer regularizer is all you need for ultra low-latency spiking neural networks

Authors: Gourav Datta, Zeyu Liu, Peter A. Beerel

Abstract: Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, state-of-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for one… ▽ More Spiking Neural networks (SNN) have emerged as an attractive spatio-temporal computing paradigm for a wide range of low-power vision tasks. However, state-of-the-art (SOTA) SNN models either incur multiple time steps which hinder their deployment in real-time use cases or increase the training complexity significantly. To mitigate this concern, we present a training framework (from scratch) for one-time-step SNNs that uses a novel variant of the recently proposed Hoyer regularizer. We estimate the threshold of each SNN layer as the Hoyer extremum of a clipped version of its activation map, where the clip** threshold is trained using gradient descent with our Hoyer regularizer. This approach not only downscales the value of the trainable threshold, thereby emitting a large number of spikes for weight update with a limited number of iterations (due to only one time step) but also shifts the membrane potential values away from the threshold, thereby mitigating the effect of noise that can degrade the SNN accuracy. Our approach outperforms existing spiking, binary, and adder neural networks in terms of the accuracy-FLOPs trade-off for complex image recognition tasks. Downstream experiments on object detection also demonstrate the efficacy of our approach. △ Less

Submitted 20 December, 2022; originally announced December 2022.

Comments: 16 pages, 4 figures

arXiv:2211.04515 [pdf, other]

QuantPipe: Applying Adaptive Post-Training Quantization for Distributed Transformer Pipelines in Dynamic Edge Environments

Authors: Haonan Wang, Connor Imes, Souvik Kundu, Peter A. Beerel, Stephen P. Crago, John Paul Walters

Abstract: Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed… ▽ More Pipeline parallelism has achieved great success in deploying large-scale transformer models in cloud environments, but has received less attention in edge environments. Unlike in cloud scenarios with high-speed and stable network interconnects, dynamic bandwidth in edge systems can degrade distributed pipeline performance. We address this issue with QuantPipe, a communication-efficient distributed edge system that introduces post-training quantization (PTQ) to compress the communicated tensors. QuantPipe uses adaptive PTQ to change bitwidths in response to bandwidth dynamics, maintaining transformer pipeline performance while incurring limited inference accuracy loss. We further improve the accuracy with a directed-search analytical clip** for integer quantization method (DS-ACIQ), which bridges the gap between estimated and real data distributions. Experimental results show that QuantPipe adapts to dynamic bandwidth to maintain pipeline performance while achieving a practical model accuracy using a wide range of quantization bitwidths, e.g., improving accuracy under 2-bit quantization by 15.85\% on ImageNet compared to naive quantization. △ Less

Submitted 8 November, 2022; originally announced November 2022.

arXiv:2210.12613 [pdf, other]

Towards Energy-Efficient, Low-Latency and Accurate Spiking LSTMs

Authors: Gourav Datta, Haoqin Deng, Robert Aviles, Peter A. Beerel

Abstract: Spiking Neural Networks (SNNs) have emerged as an attractive spatio-temporal computing paradigm for complex vision tasks. However, most existing works yield models that require many time steps and do not leverage the inherent temporal dynamics of spiking neural networks, even for sequential tasks. Motivated by this observation, we propose an \rev{optimized spiking long short-term memory networks (… ▽ More Spiking Neural Networks (SNNs) have emerged as an attractive spatio-temporal computing paradigm for complex vision tasks. However, most existing works yield models that require many time steps and do not leverage the inherent temporal dynamics of spiking neural networks, even for sequential tasks. Motivated by this observation, we propose an \rev{optimized spiking long short-term memory networks (LSTM) training framework that involves a novel ANN-to-SNN conversion framework, followed by SNN training}. In particular, we propose novel activation functions in the source LSTM architecture and judiciously select a subset of them for conversion to integrate-and-fire (IF) activations with optimal bias shifts. Additionally, we derive the leaky-integrate-and-fire (LIF) activation functions converted from their non-spiking LSTM counterparts which justifies the need to jointly optimize the weights, threshold, and leak parameter. We also propose a pipelined parallel processing scheme which hides the SNN time steps, significantly improving system latency, especially for long sequences. The resulting SNNs have high activation sparsity and require only accumulate operations (AC), in contrast to expensive multiply-and-accumulates (MAC) needed for ANNs, except for the input layer when using direct encoding, yielding significant improvements in energy efficiency. We evaluate our framework on sequential learning tasks including temporal MNIST, Google Speech Commands (GSC), and UCI Smartphone datasets on different LSTM architectures. We obtain test accuracy of 94.75% with only 2 time steps with direct encoding on the GSC dataset with 4.1x lower energy than an iso-architecture standard LSTM. △ Less

Submitted 23 October, 2022; originally announced October 2022.

arXiv:2210.05451 [pdf, other]

Enabling ISP-less Low-Power Computer Vision

Authors: Gourav Datta, Zeyu Liu, Zihan Yin, Linyu Sun, Akhilesh R. Jaiswal, Peter A. Beerel

Abstract: In order to deploy current computer vision (CV) models on resource-constrained low-power devices, recent works have proposed in-sensor and in-pixel computing approaches that try to partly/fully bypass the image signal processor (ISP) and yield significant bandwidth reduction between the image sensor and the CV processing unit by downsampling the activation maps in the initial convolutional neural… ▽ More In order to deploy current computer vision (CV) models on resource-constrained low-power devices, recent works have proposed in-sensor and in-pixel computing approaches that try to partly/fully bypass the image signal processor (ISP) and yield significant bandwidth reduction between the image sensor and the CV processing unit by downsampling the activation maps in the initial convolutional neural network (CNN) layers. However, direct inference on the raw images degrades the test accuracy due to the difference in covariance of the raw images captured by the image sensors compared to the ISP-processed images used for training. Moreover, it is difficult to train deep CV models on raw images, because most (if not all) large-scale open-source datasets consist of RGB images. To mitigate this concern, we propose to invert the ISP pipeline, which can convert the RGB images of any dataset to its raw counterparts, and enable model training on raw images. We release the raw version of the COCO dataset, a large-scale benchmark for generic high-level vision tasks. For ISP-less CV systems, training on these raw images result in a 7.1% increase in test accuracy on the visual wake works (VWW) dataset compared to relying on training with traditional ISP-processed RGB datasets. To further improve the accuracy of ISP-less CV models and to increase the energy and bandwidth benefits obtained by in-sensor/in-pixel computing, we propose an energy-efficient form of analog in-pixel demosaicing that may be coupled with in-pixel CNN computations. When evaluated on raw images captured by real sensors from the PASCALRAW dataset, our approach results in a 8.1% increase in mAP. Lastly, we demonstrate a further 20.5% increase in mAP by using a novel application of few-shot learning with thirty shots each for the novel PASCALRAW dataset, constituting 3 classes. △ Less

Submitted 11 October, 2022; originally announced October 2022.

Comments: Accepted to WACV 2023

arXiv:2205.14285 [pdf, other]

P2M-DeTrack: Processing-in-Pixel-in-Memory for Energy-efficient and Real-Time Multi-Object Detection and Tracking

Authors: Gourav Datta, Souvik Kundu, Zihan Yin, Joe Mathai, Zeyu Liu, Zixu Wang, Mulin Tian, Shunlin Lu, Ravi T. Lakkireddy, Andrew Schmidt, Wael Abd-Almageed, Ajey P. Jacob, Akhilesh R. Jaiswal, Peter A. Beerel

Abstract: Today's high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which h… ▽ More Today's high resolution, high frame rate cameras in autonomous vehicles generate a large volume of data that needs to be transferred and processed by a downstream processor or machine learning (ML) accelerator to enable intelligent computing tasks, such as multi-object detection and tracking. The massive amount of data transfer incurs significant energy, latency, and bandwidth bottlenecks, which hinders real-time processing. To mitigate this problem, we propose an algorithm-hardware co-design framework called Processing-in-Pixel-in-Memory-based object Detection and Tracking (P2M-DeTrack). P2M-DeTrack is based on a custom faster R-CNN-based model that is distributed partly inside the pixel array (front-end) and partly in a separate FPGA/ASIC (back-end). The proposed front-end in-pixel processing down-samples the input feature maps significantly with judiciously optimized strided convolution and pooling. Compared to a conventional baseline design that transfers frames of RGB pixels to the back-end, the resulting P2M-DeTrack designs reduce the data bandwidth between sensor and back-end by up to 24x. The designs also reduce the sensor and total energy (obtained from in-house circuit simulations at Globalfoundries 22nm technology node) per frame by 5.7x and 1.14x, respectively. Lastly, they reduce the sensing and total frame latency by an estimated 1.7x and 3x, respectively. We evaluate our approach on the multi-object object detection (tracking) task of the large-scale BDD100K dataset and observe only a 0.5% reduction in the mean average precision (0.8% reduction in the identification F1 score) compared to the state-of-the-art. △ Less

Submitted 27 May, 2022; originally announced May 2022.

Comments: 6 pages, 4 figures, 4 tables

arXiv:2204.00426 [pdf, other]

A Fast and Efficient Conditional Learning for Tunable Trade-Off between Accuracy and Robustness

Authors: Souvik Kundu, Sairam Sundaresan, Massoud Pedram, Peter A. Beerel

Abstract: Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-lim… ▽ More Existing models that achieve state-of-the-art (SOTA) performance on both clean and adversarially-perturbed images rely on convolution operations conditioned with feature-wise linear modulation (FiLM) layers. These layers require many new parameters and are hyperparameter sensitive. They significantly increase training time, memory cost, and potential latency which can prove costly for resource-limited or real-time applications. In this paper, we present a fast learnable once-for-all adversarial training (FLOAT) algorithm, which instead of the existing FiLM-based conditioning, presents a unique weight conditioned learning that requires no additional layer, thereby incurring no significant increase in parameter count, training time, or network latency compared to standard adversarial training. In particular, we add configurable scaled noise to the weight tensors that enables a trade-off between clean and adversarial performance. Extensive experiments show that FLOAT can yield SOTA performance improving both clean and perturbed image classification by up to ~6% and ~10%, respectively. Moreover, real hardware measurement shows that FLOAT can reduce the training time by up to 1.43x with fewer model parameters of up to 1.47x on iso-hyperparameter settings compared to the FiLM-based alternatives. Additionally, to further improve memory efficiency we introduce FLOAT sparse (FLOATS), a form of non-iterative model pruning and provide detailed empirical analysis to provide a three way accuracy-robustness-complexity trade-off for these new class of pruned conditionally trained models. △ Less

Submitted 28 March, 2022; originally announced April 2022.

Comments: 14 pages, 10 figures, 1 table

arXiv:2203.05696 [pdf, other]

Toward Efficient Hyperspectral Image Processing inside Camera Pixels

Authors: Gourav Datta, Zihan Yin, Ajey Jacob, Akhilesh R. Jaiswal, Peter A. Beerel

Abstract: Hyperspectral cameras generate a large amount of data due to the presence of hundreds of spectral bands as opposed to only three channels (red, green, and blue) in traditional cameras. This requires a significant amount of data transmission between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame, expending high energy and causing bandwidth an… ▽ More Hyperspectral cameras generate a large amount of data due to the presence of hundreds of spectral bands as opposed to only three channels (red, green, and blue) in traditional cameras. This requires a significant amount of data transmission between the hyperspectral image sensor and a processor used to classify/detect/track the images, frame by frame, expending high energy and causing bandwidth and security bottlenecks. To mitigate this problem, we propose a form of processing-in-pixel (PIP) that leverages advanced CMOS technologies to enable the pixel array to perform a wide range of complex operations required by the modern convolutional neural networks (CNN) for hyperspectral image recognition (HSI). Consequently, our PIP-optimized custom CNN layers effectively compress the input data, significantly reducing the bandwidth required to transmit the data downstream to the HSI processing unit. This reduces the average energy consumption associated with pixel array of cameras and the CNN processing unit by 25.06x and 3.90x respectively, compared to existing hardware implementations. Our custom models yield average test accuracies within 0.56% of the baseline models for the standard HSI benchmarks. △ Less

Submitted 10 March, 2022; originally announced March 2022.

Comments: 6 pages, 3 figures

arXiv:2203.04737 [pdf, other]

P2M: A Processing-in-Pixel-in-Memory Paradigm for Resource-Constrained TinyML Applications

Authors: Gourav Datta, Souvik Kundu, Zihan Yin, Ravi Teja Lakkireddy, Joe Mathai, Ajey Jacob, Peter A. Beerel, Akhilesh R. Jaiswal

Abstract: The demand to process vast amounts of data generated from state-of-the-art high resolution cameras has motivated novel energy-efficient on-device AI solutions. Visual data in such cameras are usually captured in the form of analog voltages by a sensor pixel array, and then converted to the digital domain for subsequent AI processing using analog-to-digital converters (ADC). Recent research has tri… ▽ More The demand to process vast amounts of data generated from state-of-the-art high resolution cameras has motivated novel energy-efficient on-device AI solutions. Visual data in such cameras are usually captured in the form of analog voltages by a sensor pixel array, and then converted to the digital domain for subsequent AI processing using analog-to-digital converters (ADC). Recent research has tried to take advantage of massively parallel low-power analog/digital computing in the form of near- and in-sensor processing, in which the AI computation is performed partly in the periphery of the pixel array and partly in a separate on-board CPU/accelerator. Unfortunately, high-resolution input images still need to be streamed between the camera and the AI processing unit, frame by frame, causing energy, bandwidth, and security bottlenecks. To mitigate this problem, we propose a novel Processing-in-Pixel-in-memory (P2M) paradigm, that customizes the pixel array by adding support for analog multi-channel, multi-bit convolution, batch normalization, and ReLU (Rectified Linear Units). Our solution includes a holistic algorithm-circuit co-design approach and the resulting P2M paradigm can be used as a drop-in replacement for embedding memory-intensive first few layers of convolutional neural network (CNN) models within foundry-manufacturable CMOS image sensor platforms. Our experimental results indicate that P2M reduces data transfer bandwidth from sensors and analog to digital conversions by ~21x, and the energy-delay product (EDP) incurred in processing a MobileNetV2 model on a TinyML use case for visual wake words dataset (VWW) by up to ~11x compared to standard near-processing or in-sensor implementations, without any significant drop in test accuracy. △ Less

Submitted 16 March, 2022; v1 submitted 6 March, 2022; originally announced March 2022.

Comments: 15 pages, 8 figures

arXiv:2201.05943 [pdf, other]

TriLock: IC Protection with Tunable Corruptibility and Resilience to SAT and Removal Attacks

Authors: Yuke Zhang, Yinghua Hu, Pierluigi Nuzzo, Peter A. Beerel

Abstract: Sequential logic locking has been studied over the last decade as a method to protect sequential circuits from reverse engineering. However, most of the existing sequential logic locking techniques are threatened by increasingly more sophisticated SAT-based attacks, efficiently using input queries to a SAT solver to rule out incorrect keys, as well as removal attacks based on structural analysis.… ▽ More Sequential logic locking has been studied over the last decade as a method to protect sequential circuits from reverse engineering. However, most of the existing sequential logic locking techniques are threatened by increasingly more sophisticated SAT-based attacks, efficiently using input queries to a SAT solver to rule out incorrect keys, as well as removal attacks based on structural analysis. In this paper, we propose TriLock, a sequential logic locking method that simultaneously addresses these vulnerabilities. TriLock can achieve high, tunable functional corruptibility while still guaranteeing exponential queries to the SAT solver in a SAT-based attack. Further, it adopts a state re-encoding method to obscure the boundary between the original state registers and those inserted by the locking method, thus making it more difficult to detect and remove the locking-related components. △ Less

Submitted 15 January, 2022; originally announced January 2022.

Comments: Accepted at Design, Automation and Test in Europe Conference (DATE), 2022

arXiv:2112.13843 [pdf, other]

BMPQ: Bit-Gradient Sensitivity Driven Mixed-Precision Quantization of DNNs from Scratch

Authors: Souvik Kundu, Shikai Wang, Qirui Sun, Peter A. Beerel, Massoud Pedram

Abstract: Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on a compute-expensive, iterative training poli… ▽ More Large DNNs with mixed-precision quantization can achieve ultra-high compression while retaining high classification performance. However, because of the challenges in finding an accurate metric that can guide the optimization process, these methods either sacrifice significant performance compared to the 32-bit floating-point (FP-32) baseline or rely on a compute-expensive, iterative training policy that requires the availability of a pre-trained baseline. To address this issue, this paper presents BMPQ, a training method that uses bit gradients to analyze layer sensitivities and yield mixed-precision quantized models. BMPQ requires a single training iteration but does not need a pre-trained baseline. It uses an integer linear program (ILP) to dynamically adjust the precision of layers during training, subject to a fixed hardware budget. To evaluate the efficacy of BMPQ, we conduct extensive experiments with VGG16 and ResNet18 on CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets. Compared to the baseline FP-32 models, BMPQ can yield models that have 15.4x fewer parameter bits with a negligible drop in accuracy. Compared to the SOTA "during training", mixed-precision training scheme, our models are 2.1x, 2.2x, and 2.9x smaller, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively, with an improved accuracy of up to 14.54%. △ Less

Submitted 23 December, 2021; originally announced December 2021.

Comments: 4 pages, 2 figures, 2 tables

arXiv:2112.12133 [pdf, other]

Can Deep Neural Networks be Converted to Ultra Low-Latency Spiking Neural Networks?

Authors: Gourav Datta, Peter A. Beerel

Abstract: Spiking neural networks (SNNs), that operate via binary spikes distributed over time, have emerged as a promising energy efficient ML paradigm for resource-constrained devices. However, the current state-of-the-art (SOTA) SNNs require multiple time steps for acceptable inference accuracy, increasing spiking activity and, consequently, energy consumption. SOTA training strategies for SNNs involve c… ▽ More Spiking neural networks (SNNs), that operate via binary spikes distributed over time, have emerged as a promising energy efficient ML paradigm for resource-constrained devices. However, the current state-of-the-art (SOTA) SNNs require multiple time steps for acceptable inference accuracy, increasing spiking activity and, consequently, energy consumption. SOTA training strategies for SNNs involve conversion from a non-spiking deep neural network (DNN). In this paper, we determine that SOTA conversion strategies cannot yield ultra low latency because they incorrectly assume that the DNN and SNN pre-activation values are uniformly distributed. We propose a new training algorithm that accurately captures these distributions, minimizing the error between the DNN and converted SNN. The resulting SNNs have ultra low latency and high activation sparsity, yielding significant improvements in compute efficiency. In particular, we evaluate our framework on image recognition tasks from CIFAR-10 and CIFAR-100 datasets on several VGG and ResNet architectures. We obtain top-1 accuracy of 64.19% with only 2 time steps on the CIFAR-100 dataset with ~159.2x lower compute energy compared to an iso-architecture standard DNN. Compared to other SOTA SNN models, our models perform inference 2.5-8x faster (i.e., with fewer time steps). △ Less

Submitted 22 December, 2021; originally announced December 2021.

Comments: Accepted to DATE 2022

arXiv:2110.14895 [pdf, other]

Pipeline Parallelism for Inference on Heterogeneous Edge Computing

Authors: Yang Hu, Connor Imes, Xuanang Zhao, Souvik Kundu, Peter A. Beerel, Stephen P. Crago, John Paul N. Walters

Abstract: Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too compute- or memory-intensive for resource-constrained edge devices. Prior works on parallel and distributed execution primarily focus on training -- rather than inference -- using homogeneous accelerators in… ▽ More Deep neural networks with large model sizes achieve state-of-the-art results for tasks in computer vision (CV) and natural language processing (NLP). However, these large-scale models are too compute- or memory-intensive for resource-constrained edge devices. Prior works on parallel and distributed execution primarily focus on training -- rather than inference -- using homogeneous accelerators in data centers. We propose EdgePipe, a distributed framework for edge systems that uses pipeline parallelism to both speed up inference and enable running larger (and more accurate) models that otherwise cannot fit on single edge devices. EdgePipe achieves these results by using an optimal partition strategy that considers heterogeneity in compute, memory, and network bandwidth. Our empirical evaluation demonstrates that EdgePipe achieves $10.59\times$ and $11.88\times$ speedup using 16 edge devices for the ViT-Large and ViT-Huge models, respectively, with no accuracy loss. Similarly, EdgePipe improves ViT-Huge throughput by $3.93\times$ over a 4-node baseline using 16 edge devices, which independently cannot fit the model in memory. Finally, we show up to $4.16\times$ throughput improvement over the state-of-the-art PipeDream when using a heterogeneous set of devices. △ Less

Submitted 28 October, 2021; originally announced October 2021.

arXiv:2110.11417 [pdf, other]

HIRE-SNN: Harnessing the Inherent Robustness of Energy-Efficient Deep Spiking Neural Networks by Training with Crafted Input Noise

Authors: Souvik Kundu, Massoud Pedram, Peter A. Beerel

Abstract: Low-latency deep spiking neural networks (SNNs) have become a promising alternative to conventional artificial neural networks (ANNs) because of their potential for increased energy efficiency on event-driven neuromorphic hardware. Neural networks, including SNNs, however, are subject to various adversarial attacks and must be trained to remain resilient against such attacks for many applications.… ▽ More Low-latency deep spiking neural networks (SNNs) have become a promising alternative to conventional artificial neural networks (ANNs) because of their potential for increased energy efficiency on event-driven neuromorphic hardware. Neural networks, including SNNs, however, are subject to various adversarial attacks and must be trained to remain resilient against such attacks for many applications. Nevertheless, due to prohibitively high training costs associated with SNNs, analysis, and optimization of deep SNNs under various adversarial attacks have been largely overlooked. In this paper, we first present a detailed analysis of the inherent robustness of low-latency SNNs against popular gradient-based attacks, namely fast gradient sign method (FGSM) and projected gradient descent (PGD). Motivated by this analysis, to harness the model robustness against these attacks we present an SNN training algorithm that uses crafted input noise and incurs no additional training time. To evaluate the merits of our algorithm, we conducted extensive experiments with variants of VGG and ResNet on both CIFAR-10 and CIFAR-100 datasets. Compared to standard trained direct input SNNs, our trained models yield improved classification accuracy of up to 13.7% and 10.1% on FGSM and PGD attack-generated images, respectively, with negligible loss in clean image accuracy. Our models also outperform inherently robust SNNs trained on rate-coded inputs with improved or similar classification performance on attack-generated images while having up to 25x and 4.6x lower latency and computation energy, respectively. △ Less

Submitted 6 October, 2021; originally announced October 2021.

Comments: 10 pages, 11 figures, 7 tables, International Conference on Computer Vision

arXiv:2108.04892 [pdf, other]

doi 10.1109/HOST49136.2021.9702267

Fun-SAT: Functional Corruptibility-Guided SAT-Based Attack on Sequential Logic Encryption

Authors: Yinghua Hu, Yuke Zhang, Kaixin Yang, Dake Chen, Peter A. Beerel, Pierluigi Nuzzo

Abstract: The SAT attack has shown to be efficient against most combinational logic encryption methods. It can be extended to attack sequential logic encryption techniques by leveraging circuit unrolling and model checking methods. However, with no guidance on the number of times that a circuit needs to be unrolled to find the correct key, the attack tends to solve many time-consuming Boolean satisfiability… ▽ More The SAT attack has shown to be efficient against most combinational logic encryption methods. It can be extended to attack sequential logic encryption techniques by leveraging circuit unrolling and model checking methods. However, with no guidance on the number of times that a circuit needs to be unrolled to find the correct key, the attack tends to solve many time-consuming Boolean satisfiability (SAT) and model checking problems, which can significantly hamper its efficiency. In this paper, we introduce Fun-SAT, a functional corruptibility-guided SAT-based attack that can significantly decrease the SAT solving and model checking time of a SAT-based attack on sequential encryption by efficiently estimating the minimum required number of circuit unrollings. Fun-SAT relies on a notion of functional corruptibility for encrypted sequential circuits and its relationship with the required number of circuit unrollings in a SAT-based attack. Numerical results show that Fun-SAT can be, on average, 90x faster than previous attacks against state-of-the-art encryption methods, when both attacks successfully complete before a one-day time-out. Moreover, Fun-SAT completes before the time-out on many more circuits. △ Less

Submitted 10 August, 2021; originally announced August 2021.

Comments: Accepted at IEEE International Symposium on Hardware Oriented Security and Trust (HOST) 2021

arXiv:2108.03719 [pdf, other]

A High Performance and Robust FIFO Synchronizer-Interface for Crossing Clock Domains in SFQ Logic

Authors: Gourav Datta, Shidie Lin, Peter A. Beerel

Abstract: Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing platforms. However, clocking SFQ logic circuits remains a challenge due to the presence of a large number of on-chip clock sinks, and hence, decomposing large designs into multiple independent clock domains similar to CMOS, have been propos… ▽ More Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing platforms. However, clocking SFQ logic circuits remains a challenge due to the presence of a large number of on-chip clock sinks, and hence, decomposing large designs into multiple independent clock domains similar to CMOS, have been proposed. However, such clock domains demand efficient synchronizing First-in-first-out (FIFO) buffers and robust interfaces to safely transfer data from one clock domain to another. In this brief, we propose such a FIFO synchronizer and clock-domain crossing interface for both uni and bi-directional data transfer without any significant degradation of the clock frequency. Our proposal scales to complex gate-level pipelined SFQ logic cores while demonstrating extremely low Bit Error Rate (BER) and is unaffected by noise, given current SFQ lithography feature sizes. △ Less

Submitted 8 August, 2021; originally announced August 2021.

Comments: 5 pages, 5 figures, under review in IEEE Transactions on Circuits and Systems-II

arXiv:2107.12445 [pdf, other]

Towards Low-Latency Energy-Efficient Deep SNNs via Attention-Guided Compression

Authors: Souvik Kundu, Gourav Datta, Massoud Pedram, Peter A. Beerel

Abstract: Deep spiking neural networks (SNNs) have emerged as a potential alternative to traditional deep learning frameworks, due to their promise to provide increased compute efficiency on event-driven neuromorphic hardware. However, to perform well on complex vision applications, most SNN training frameworks yield large inference latency which translates to increased spike activity and reduced energy eff… ▽ More Deep spiking neural networks (SNNs) have emerged as a potential alternative to traditional deep learning frameworks, due to their promise to provide increased compute efficiency on event-driven neuromorphic hardware. However, to perform well on complex vision applications, most SNN training frameworks yield large inference latency which translates to increased spike activity and reduced energy efficiency. Hence,minimizing average spike activity while preserving accuracy indeep SNNs remains a significant challenge and opportunity.This paper presents a non-iterative SNN training technique thatachieves ultra-high compression with reduced spiking activitywhile maintaining high inference accuracy. In particular, our framework first uses the attention-maps of an un compressed meta-model to yield compressed ANNs. This step can be tuned to support both irregular and structured channel pruning to leverage computational benefits over a broad range of platforms. The framework then performs sparse-learning-based supervised SNN training using direct inputs. During the training, it jointly optimizes the SNN weight, threshold, and leak parameters to drastically minimize the number of time steps required while retaining compression. To evaluate the merits of our approach, we performed experiments with variants of VGG and ResNet, on both CIFAR-10 and CIFAR-100, and VGG16 on Tiny-ImageNet.The SNN models generated through the proposed technique yield SOTA compression ratios of up to 33.4x with no significant drops in accuracy compared to baseline unpruned counterparts. Compared to existing SNN pruning methods, we achieve up to 8.3x higher compression with improved accuracy. △ Less

Submitted 16 July, 2021; originally announced July 2021.

Comments: 10 Pages, 8 Figures, 5 Tables

arXiv:2107.12374 [pdf, other]

Training Energy-Efficient Deep Spiking Neural Networks with Single-Spike Hybrid Input Encoding

Authors: Gourav Datta, Souvik Kundu, Peter A. Beerel

Abstract: Spiking Neural Networks (SNNs) have emerged as an attractive alternative to traditional deep learning frameworks, since they provide higher computational efficiency in event driven neuromorphic hardware. However, the state-of-the-art (SOTA) SNNs suffer from high inference latency, resulting from inefficient input encoding and training techniques. The most widely used input coding schemes, such as… ▽ More Spiking Neural Networks (SNNs) have emerged as an attractive alternative to traditional deep learning frameworks, since they provide higher computational efficiency in event driven neuromorphic hardware. However, the state-of-the-art (SOTA) SNNs suffer from high inference latency, resulting from inefficient input encoding and training techniques. The most widely used input coding schemes, such as Poisson based rate-coding, do not leverage the temporal learning capabilities of SNNs. This paper presents a training framework for low-latency energy-efficient SNNs that uses a hybrid encoding scheme at the input layer in which the analog pixel values of an image are directly applied during the first timestep and a novel variant of spike temporal coding is used during subsequent timesteps. In particular, neurons in every hidden layer are restricted to fire at most once per image which increases activation sparsity. To train these hybrid-encoded SNNs, we propose a variant of the gradient descent based spike timing dependent back propagation (STDB) mechanism using a novel cross entropy loss function based on both the output neurons' spike time and membrane potential. The resulting SNNs have reduced latency and high activation sparsity, yielding significant improvements in computational efficiency. In particular, we evaluate our proposed training scheme on image classification tasks from CIFAR-10 and CIFAR-100 datasets on several VGG architectures. We achieve top-1 accuracy of $66.46$\% with $5$ timesteps on the CIFAR-100 dataset with ${\sim}125\times$ less compute energy than an equivalent standard ANN. Additionally, our proposed SNN performs $5$-$300\times$ faster inference compared to other state-of-the-art rate or temporally coded SNN models. △ Less

Submitted 26 July, 2021; originally announced July 2021.

Comments: Extended version of arXiv:2107.11979

arXiv:2107.11979 [pdf, other]

HYPER-SNN: Towards Energy-efficient Quantized Deep Spiking Neural Networks for Hyperspectral Image Classification

Authors: Gourav Datta, Souvik Kundu, Akhilesh R. Jaiswal, Peter A. Beerel

Abstract: Hyper spectral images (HSI) provide rich spectral and spatial information across a series of contiguous spectral bands. However, the accurate processing of the spectral and spatial correlation between the bands requires the use of energy-expensive 3-D Convolutional Neural Networks (CNNs). To address this challenge, we propose the use of Spiking Neural Networks (SNNs) that are generated from iso-ar… ▽ More Hyper spectral images (HSI) provide rich spectral and spatial information across a series of contiguous spectral bands. However, the accurate processing of the spectral and spatial correlation between the bands requires the use of energy-expensive 3-D Convolutional Neural Networks (CNNs). To address this challenge, we propose the use of Spiking Neural Networks (SNNs) that are generated from iso-architecture CNNs and trained with quantization-aware gradient descent to optimize their weights, membrane leak, and firing thresholds. During both training and inference, the analog pixel values of a HSI are directly applied to the input layer of the SNN without the need to convert to a spike-train. The reduced latency of our training technique combined with high activation sparsity yields significant improvements in computational efficiency. We evaluate our proposal using three HSI datasets on a 3-D and a 3-D/2-D hybrid convolutional architecture. We achieve overall accuracy, average accuracy, and kappa coefficient of 98.68%, 98.34%, and 98.20% respectively with 5 time steps (inference latency) and 6-bit weight quantization on the Indian Pines dataset. In particular, our models achieved accuracies similar to state-of-the-art (SOTA) with 560.6 and 44.8 times less compute energy on average over three HSI datasets than an iso-architecture full-precision and 6-bit quantized CNN, respectively. △ Less

Submitted 28 July, 2021; v1 submitted 26 July, 2021; originally announced July 2021.

arXiv:2101.12279 [pdf, other]

GF-Flush: A GF(2) Algebraic Attack on Secure Scan Chains

Authors: Dake Chen, Chunxiao Lin, Peter A. Beerel

Abstract: Scan chains provide increased controllability and observability for testing digital circuits. The increased testability, however, can also be a source of information leakage for sensitive designs. The state-of-the-art defenses to secure scan chains apply dynamic keys to pseudo-randomly invert the scan vectors. In this paper, we pinpoint an algebraic vulnerability of these dynamic defenses that inv… ▽ More Scan chains provide increased controllability and observability for testing digital circuits. The increased testability, however, can also be a source of information leakage for sensitive designs. The state-of-the-art defenses to secure scan chains apply dynamic keys to pseudo-randomly invert the scan vectors. In this paper, we pinpoint an algebraic vulnerability of these dynamic defenses that involves creating and solving a system of linear equations over the finite field GF(2). In particular, we propose a novel GF(2)-based flush attack that breaks even the most rigorous version of state-of-the-art dynamic defenses. Our experimental results demonstrate that our attack recovers the key as long as 500 bits in less than 7 seconds, the attack times are about one hundredth of state-of-the-art SAT based attacks on the same defenses. We then demonstrate how our attacks can be extended to scan chains compressed with Multiple-Input Signature Registers (MISRs). △ Less

Submitted 28 January, 2021; originally announced January 2021.

Comments: Submitted to IEEE Transactions on Computer-Aided Design of Integrated Circuits And Systems

arXiv:2011.03083 [pdf, other]

A Tunable Robust Pruning Framework Through Dynamic Network Rewiring of DNNs

Authors: Souvik Kundu, Mahdi Nazemi, Peter A. Beerel, Massoud Pedram

Abstract: This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial trainin… ▽ More This paper presents a dynamic network rewiring (DNR) method to generate pruned deep neural network (DNN) models that are robust against adversarial attacks yet maintain high accuracy on clean images. In particular, the disclosed DNR method is based on a unified constrained optimization formulation using a hybrid loss function that merges ultra-high model compression with robust adversarial training. This training strategy dynamically adjusts inter-layer connectivity based on per-layer normalized momentum computed from the hybrid loss function. In contrast to existing robust pruning frameworks that require multiple training iterations, the proposed learning strategy achieves an overall target pruning ratio with only a single training iteration and can be tuned to support both irregular and structured channel pruning. To evaluate the merits of DNR, experiments were performed with two widely accepted models, namely VGG16 and ResNet-18, on CIFAR-10, CIFAR-100 as well as with VGG16 on Tiny-ImageNet. Compared to the baseline uncompressed models, DNR provides over20x compression on all the datasets with no significant drop in either clean or adversarial classification accuracy. Moreover, our experiments show that DNR consistently finds compressed models with better clean and adversarial image classification performance than what is achievable through state-of-the-art alternatives. △ Less

Submitted 24 November, 2020; v1 submitted 3 November, 2020; originally announced November 2020.

Comments: 8 pages, 4 figures, conference paper

arXiv:2004.10655 [pdf, other]

Formal Verification of Flow Equivalence in Desynchronized Designs

Authors: Jennifer Paykin, Brian Huffman, Daniel M. Zimmerman, Peter A. Beerel

Abstract: Seminal work by Cortadella, Kondratyev, Lavagno, and Sotiriou includes a hand-written proof that a particular handshaking protocol preserves flow equivalence, a notion of equivalence between synchronous latch-based specifications and their desynchronized bundled-data asynchronous implementations. In this work we identify a counterexample to Cortadella et al.'s proof illustrating how their protocol… ▽ More Seminal work by Cortadella, Kondratyev, Lavagno, and Sotiriou includes a hand-written proof that a particular handshaking protocol preserves flow equivalence, a notion of equivalence between synchronous latch-based specifications and their desynchronized bundled-data asynchronous implementations. In this work we identify a counterexample to Cortadella et al.'s proof illustrating how their protocol can in fact lead to a violation of flow equivalence. However, two of the less concurrent protocols identified in their paper do preserve flow equivalence. To verify this fact, we formalize flow equivalence in the Coq proof assistant and provide mechanized, machine-checkable proofs of our results. △ Less

Submitted 6 April, 2020; originally announced April 2020.

Comments: To appear in ASYNC 2020

arXiv:2004.04804 [pdf, other]

Modeling and Characterization of Metastability in Single Flux Quantum (SFQ) Synchronizers

Authors: Gourav Datta, Peter A. Beerel

Abstract: Despite the promises of low-power and high-frequency of single-flux quantum (SFQ) technology, scaling these circuits remains a serious challenge that motivates the support of multiple SFQ clock domains. Towards this end, this paper analyzes the impact of setup time violations and metastability in SFQ circuits comparing the derived analytical models to their CMOS counterparts. It then extends this… ▽ More Despite the promises of low-power and high-frequency of single-flux quantum (SFQ) technology, scaling these circuits remains a serious challenge that motivates the support of multiple SFQ clock domains. Towards this end, this paper analyzes the impact of setup time violations and metastability in SFQ circuits comparing the derived analytical models to their CMOS counterparts. It then extends this model to estimate the Mean Time Between Failure (MTBF) of flip-flop-based synchronizers and curve fits this model to simulations in the state-of-the-art SFQ5ee process. Interestingly, we find a two-flop SFQ synchronizer has an estimated MTBF of 10^6 years. △ Less

Submitted 9 April, 2020; originally announced April 2020.

arXiv:2004.00974 [pdf, other]

Deep-n-Cheap: An Automated Search Framework for Low Complexity Deep Learning

Authors: Sourya Dey, Saikrishna C. Kanala, Keith M. Chugg, Peter A. Beerel

Abstract: We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. This search includes both architecture and training hyperparameters, and supports convolutional neural networks and multi-layer perceptrons. Our framework is targeted for deployment on both benchmark and custom datasets, and as a result, offers a greater degree of search space customizability as compared… ▽ More We present Deep-n-Cheap -- an open-source AutoML framework to search for deep learning models. This search includes both architecture and training hyperparameters, and supports convolutional neural networks and multi-layer perceptrons. Our framework is targeted for deployment on both benchmark and custom datasets, and as a result, offers a greater degree of search space customizability as compared to a more limited search over only pre-existing models from literature. We also introduce the technique of 'search transfer', which demonstrates the generalization capabilities of the models found by our framework to multiple datasets. Deep-n-Cheap includes a user-customizable complexity penalty which trades off performance with training time or number of parameters. Specifically, our framework results in models offering performance comparable to state-of-the-art while taking 1-2 orders of magnitude less time to train than models from other AutoML and model search frameworks. Additionally, this work investigates and develops various insights regarding the search process. In particular, we show the superiority of a greedy strategy and justify our choice of Bayesian optimization as the primary search methodology over random / grid search. △ Less

Submitted 5 September, 2020; v1 submitted 27 March, 2020; originally announced April 2020.

Comments: Accepted as a conference paper at ACML 2020

arXiv:2001.10715 [pdf, other]

qBSA: Logic Design of a 32-bit Block-Skewed RSFQ Arithmetic Logic Unit

Authors: Souvik Kundu, Gourav datta, Peter A. Beerel, Massoud Pedram

Abstract: Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz.However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ… ▽ More Single flux quantum (SFQ) circuits are an attractive beyond-CMOS technology because they promise two orders of magnitude lower power at clock frequencies exceeding 25 GHz.However, every SFQ gate is clocked creating very deep gate-level pipelines that are difficult to keep full, particularly for sequences that include data-dependent operations. This paper proposes to increase the throughput of SFQ pipelines by re-designing the datapath to accept and operate on least-significant bits (LSBs) clock cycles earlier than more significant bits. This skewed datapath approach reduces the latency of the LSB side which can be feedback earlier for use in subsequent data-dependent operations increasing their throughput. In particular,we propose to group the bits into 4-bit blocks that are operatedon concurrently and create block-skewed datapath units for 32-bit operation. This skewed approach allows a subsequent data-dependent operation to start evaluating as soon as the first 4-bit block completes. Using this general approach, we developa block-skewed MIPS-compatible 32-bit ALU. Our gate-level Verilog design improves the throughput of 32-bit data dependent operations by 2x and 1.5x compared to previously proposed 4-bit bit-slice and 32-bit Ladner-Fischer ALUs respectively. △ Less

Submitted 29 January, 2020; originally announced January 2020.

Comments: 3 pages, 3 figures

arXiv:2001.10710 [pdf, other]

Pre-defined Sparsity for Low-Complexity Convolutional Neural Networks

Authors: Souvik Kundu, Mahdi Nazemi, Massoud Pedram, Keith M. Chugg, Peter A. Beerel

Abstract: The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the par… ▽ More The high energy cost of processing deep convolutional neural networks impedes their ubiquitous deployment in energy-constrained platforms such as embedded systems and IoT devices. This work introduces convolutional layers with pre-defined sparse 2D kernels that have support sets that repeat periodically within and across filters. Due to the efficient storage of our periodic sparse kernels, the parameter savings can translate into considerable improvements in energy efficiency due to reduced DRAM accesses, thus promising significant improvements in the trade-off between energy consumption and accuracy for both training and inference. To evaluate this approach, we performed experiments with two widely accepted datasets, CIFAR-10 and Tiny ImageNet in sparse variants of the ResNet18 and VGG16 architectures. Compared to baseline models, our proposed sparse variants require up to 82% fewer model parameters with 5.6times fewer FLOPs with negligible loss in accuracy for ResNet18 on CIFAR-10. For VGG16 trained on Tiny ImageNet, our approach requires 5.8times fewer FLOPs and up to 83.3% fewer model parameters with a drop in top-5 (top-1) accuracy of only 1.2% (2.1%). We also compared the performance of our proposed architectures with that of ShuffleNet andMobileNetV2. Using similar hyperparameters and FLOPs, our ResNet18 variants yield an average accuracy improvement of 2.8%. △ Less

Submitted 4 February, 2020; v1 submitted 29 January, 2020; originally announced January 2020.

Comments: 14 pages, 13 figures

arXiv:2001.04039 [pdf, other]

doi 10.1109/TCSI.2020.2965073

SERAD: Soft Error Resilient Asynchronous Design using a Bundled Data Protocol

Authors: Sai Aparna Aketi, Smriti Gupta, Huimei Cheng, Joycee Mekie, Peter A. Beerel

Abstract: The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-d… ▽ More The risk of soft errors due to radiation continues to be a significant challenge for engineers trying to build systems that can handle harsh environments. Building systems that are Radiation Hardened by Design (RHBD) is the preferred approach, but existing techniques are expensive in terms of performance, power, and/or area. This paper introduces a novel soft-error resilient asynchronous bundled-data design template, SERAD, which uses a combination of temporal and spatial redundancy to mitigate Single Event Transients (SETs) and upsets (SEUs). SERAD uses Error Detecting Logic (EDL) to detect SETs at the inputs of sequential elements and correct them via re-sampling. Because SERAD only pays the delay penalty in the presence of an SET, which rarely occurs, its average performance is comparable to the baseline synchronous design. We tested the SERAD design using a combination of Spice and Verilog simulations and evaluated its impact on area, frequency, and power on an open-core MIPS-like processor using a NCSU 45nm cell library. Our post-synthesis results show that the SERAD design consumes less than half of the area of the Triple Modular Redundancy (TMR), exhibits significantly less performance degradation than Glitch Filtering (GF), and consumes no more total power than the baseline unhardened design. △ Less

Submitted 12 January, 2020; originally announced January 2020.

arXiv:1910.09876 [pdf, other]

Neural Network Training with Approximate Logarithmic Computations

Authors: Arnab Sanyal, Peter A. Beerel, Keith M. Chugg

Abstract: The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the l… ▽ More The high computational complexity associated with training deep neural networks limits online and real-time training on edge devices. This paper proposed an end-to-end training and inference scheme that eliminates multiplications by approximate operations in the log-domain which has the potential to significantly reduce implementation complexity. We implement the entire training procedure in the log-domain, with fixed-point data representations. This training procedure is inspired by hardware-friendly approximations of log-domain addition which are based on look-up tables and bit-shifts. We show that our 16-bit log-based training can achieve classification accuracy within approximately 1% of the equivalent floating-point baselines for a number of commonly used datasets. △ Less

Submitted 22 October, 2019; originally announced October 2019.

arXiv:1910.04907 [pdf, other]

Metastability-Resilient Synchronization FIFO for SFQ Logic

Authors: Gourav Datta, Haolin Cong, Souvik Kundu, Peter A. Beerel

Abstract: Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing systems. The combination of ultra high clock frequencies, gate-level pipelines, and numerous sources of variability in SFQ circuits, however, make low-skew global clock distribution a challenge. This motivates the support of multiple indepe… ▽ More Digital single-flux quantum (SFQ) technology promises to meet the demands of ultra low power and high speed computing needed for future exascale supercomputing systems. The combination of ultra high clock frequencies, gate-level pipelines, and numerous sources of variability in SFQ circuits, however, make low-skew global clock distribution a challenge. This motivates the support of multiple independent clock domains and related clock domain crossing circuits that enable reliable communication across domains. Existing J-SIM simulation models indicate that setup violations can cause clock-to-Q increases of up to 100%. This paper first shows that naive SFQ clock domain crossing (CDC) first-in-first-out buffers (FIFOs) are vulnerable to these delay increases, motivating the need for more robust CDC FIFOs. Inspired by CMOS multi-flip-flop asynchronous FIFO synchronizers, we then propose a novel 1-bit metastability-resilient SFQ CDC FIFO that simulations show delivers over a 1000 reduction in logical error rate at 30 GHz. Moreover, for a 10-stage FIFO, the Josephson junction (JJ) area of our proposed design is only 7.5% larger than the non-resilient counterpart. Finally, we propose design guidelines that define the minimal FIFO depth subject to both throughput and burstiness constraints. △ Less

Submitted 23 October, 2019; v1 submitted 10 October, 2019; originally announced October 2019.

Comments: Accepted in ISEC 2019

Showing 1–50 of 59 results for author: Beerel, P A