Search | arXiv e-print repository

SpikePipe: Accelerated Training of Spiking Neural Networks via Inter-Layer Pipelining and Multiprocessor Scheduling

Authors: Sai Sanjeet, Bibhu Datta Sahoo, Keshab K. Parhi

Abstract: Spiking Neural Networks (SNNs) have gained popularity due to their high energy efficiency. Prior works have proposed various methods for training SNNs, including backpropagation-based methods. Training SNNs is computationally expensive compared to their conventional counterparts and would benefit from multiprocessor hardware acceleration. This is the first paper to propose inter-layer pipelining t… ▽ More Spiking Neural Networks (SNNs) have gained popularity due to their high energy efficiency. Prior works have proposed various methods for training SNNs, including backpropagation-based methods. Training SNNs is computationally expensive compared to their conventional counterparts and would benefit from multiprocessor hardware acceleration. This is the first paper to propose inter-layer pipelining to accelerate training in SNNs using systolic array-based processors and multiprocessor scheduling. The impact of training using delayed gradients is observed using three networks training on different datasets, showing no degradation for small networks and < 10% degradation for large networks. The map** of various training tasks of the SNN onto systolic arrays is formulated, and the proposed scheduling method is evaluated on the three networks. The results are compared against standard pipelining algorithms. The results show that the proposed method achieves an average speedup of 1.6X compared to standard pipelining algorithms, with an upwards of 2X improvement in some cases. The incurred communication overhead due to the proposed method is less than 0.5% of the total required communication of training. △ Less

Submitted 10 June, 2024; originally announced June 2024.

arXiv:2312.02407 [pdf, other]

doi 10.1109/OJCAS.2024.3381508

Robust Clustering using Hyperdimensional Computing

Authors: Lulu Ge, Keshab K. Parhi

Abstract: This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this… ▽ More This paper addresses the clustering of data in the hyperdimensional computing (HDC) domain. In prior work, an HDC-based clustering framework, referred to as HDCluster, has been proposed. However, the performance of the existing HDCluster is not robust. The performance of HDCluster is degraded as the hypervectors for the clusters are chosen at random during the initialization step. To overcome this bottleneck, we assign the initial cluster hypervectors by exploring the similarity of the encoded data, referred to as \textit{query} hypervectors. Intra-cluster hypervectors have a higher similarity than inter-cluster hypervectors. Harnessing the similarity results among query hypervectors, this paper proposes four HDC-based clustering algorithms: similarity-based k-means, equal bin-width histogram, equal bin-height histogram, and similarity-based affinity propagation. Experimental results illustrate that: (i) Compared to the existing HDCluster, our proposed HDC-based clustering algorithms can achieve better accuracy, more robust performance, fewer iterations, and less execution time. Similarity-based affinity propagation outperforms the other three HDC-based clustering algorithms on eight datasets by 2~38% in clustering accuracy. (ii) Even for one-pass clustering, i.e., without any iterative update of the cluster hypervectors, our proposed algorithms can provide more robust clustering accuracy than HDCluster. (iii) Over eight datasets, five out of eight can achieve higher or comparable accuracy when projected onto the hyperdimensional space. Traditional clustering is more desirable than HDC when the number of clusters, $k$, is large. △ Less

Submitted 4 December, 2023; originally announced December 2023.

Journal ref: IEEE Open Journal of Circuits and Systems, Vol. 5, pp. 102-116, 2024

arXiv:2310.04618 [pdf, other]

doi 10.1109/ICCAD57390.2023.10323839

KyberMat: Efficient Accelerator for Matrix-Vector Polynomial Multiplication in CRYSTALS-Kyber Scheme via NTT and Polyphase Decomposition

Authors: Weihang Tan, Yingjie Lao, Keshab K. Parhi

Abstract: CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations… ▽ More CRYSTAL-Kyber (Kyber) is one of the post-quantum cryptography (PQC) key-encapsulation mechanism (KEM) schemes selected during the standardization process. This paper addresses optimization for Kyber architecture with respect to latency and throughput constraints. Specifically, matrix-vector multiplication and number theoretic transform (NTT)-based polynomial multiplication are critical operations and bottlenecks that require optimization. To address this challenge, we propose an algorithm and hardware co-design approach to systematically optimize matrix-vector multiplication and NTT-based polynomial multiplication by employing a novel sub-structure sharing technique in order to reduce computational complexity, i.e., the number of modular multiplications and modular additions/subtractions consumed. The sub-structure sharing approach is inspired by prior fast parallel approaches based on polyphase decomposition. The proposed efficient feed-forward architecture achieves high speed, low latency, and full utilization of all hardware components, which can significantly enhance the overall efficiency of the Kyber scheme. The FPGA implementation results show that our proposed design, using the fast two-parallel structure, leads to an approximate reduction of 90% in execution time, along with a 66 times improvement in throughput performance. △ Less

Submitted 6 October, 2023; originally announced October 2023.

Comments: Proc. 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, Oct. 29 - Nov. 2, 2023

Journal ref: 2023 IEEE/ACM International Conference on Computer Aided Design (ICCAD)

arXiv:2309.09035 [pdf, other]

doi 10.1109/ICASSP48485.2024.10447370

A Low-Latency FFT-IFFT Cascade Architecture

Authors: Keshab K. Parhi

Abstract: This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to d… ▽ More This paper addresses the design of a partly-parallel cascaded FFT-IFFT architecture that does not require any intermediate buffer. Folding can be used to design partly-parallel architectures for FFT and IFFT. While many cascaded FFT-IFFT architectures can be designed using various folding sets for the FFT and the IFFT, for a specified folded FFT architecture, there exists a unique folding set to design the IFFT architecture that does not require an intermediate buffer. Such a folding set is designed by processing the output of the FFT as soon as possible (ASAP) in the folded IFFT. Elimination of the intermediate buffer reduces latency and saves area. The proposed approach is also extended to interleaved processing of multi-channel time-series. The proposed FFT-IFFT cascade architecture saves about N/2 memory elements and N/4 clock cycles of latency compared to a design with identical folding sets. For the 2-interleaved FFT-IFFT cascade, the memory and latency savings are, respectively, N/2 units and N/2 clock cycles, compared to a design with identical folding sets. △ Less

Submitted 16 September, 2023; originally announced September 2023.

Journal ref: ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, April 2024

arXiv:2306.12519 [pdf, other]

doi 10.1109/MSP.2024.3368239

Long Polynomial Modular Multiplication using Low-Complexity Number Theoretic Transform

Authors: Sin-Wei Chiu, Keshab K. Parhi

Abstract: This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and discrete Fourier transform (DFT). The main goal is to extend the well-known theory of DFT in signal processing (SP) to other applications involving polynomials in a ring such as homomorphic encryption (HE). HE allows any third party to operate on the encrypted data without… ▽ More This tutorial aims to establish connections between polynomial modular multiplication over a ring to circular convolution and discrete Fourier transform (DFT). The main goal is to extend the well-known theory of DFT in signal processing (SP) to other applications involving polynomials in a ring such as homomorphic encryption (HE). HE allows any third party to operate on the encrypted data without decrypting it in advance. Since most HE schemes are constructed from the ring-learning with errors (R-LWE) problem, efficient polynomial modular multiplication implementation becomes critical. Any improvement in the execution of these building blocks would have significant consequences for the global performance of HE. This lecture note describes three approaches to implementing long polynomial modular multiplication using the number theoretic transform (NTT): zero-padded convolution, without zero-padding, also referred to as negative wrapped convolution (NWC), and low-complexity NWC (LC-NWC). △ Less

Submitted 22 December, 2023; v1 submitted 21 June, 2023; originally announced June 2023.

Comments: 10 pages

Journal ref: IEEE Signal Processing Magazine, 41(1), pp. 92-102, Jan. 2024

arXiv:2304.13539 [pdf, other]

doi 10.1109/MCAS.2023.3267921

Tensor Decomposition for Model Reduction in Neural Networks: A Review

Authors: Xingyi Liu, Keshab K. Parhi

Abstract: Modern neural networks have revolutionized the fields of computer vision (CV) and Natural Language Processing (NLP). They are widely used for solving complex CV tasks and NLP tasks such as image classification, image generation, and machine translation. Most state-of-the-art neural networks are over-parameterized and require a high computational cost. One straightforward solution is to replace the… ▽ More Modern neural networks have revolutionized the fields of computer vision (CV) and Natural Language Processing (NLP). They are widely used for solving complex CV tasks and NLP tasks such as image classification, image generation, and machine translation. Most state-of-the-art neural networks are over-parameterized and require a high computational cost. One straightforward solution is to replace the layers of the networks with their low-rank tensor approximations using different tensor decomposition methods. This paper reviews six tensor decomposition methods and illustrates their ability to compress model parameters of convolutional neural networks (CNNs), recurrent neural networks (RNNs) and Transformers. The accuracy of some compressed models can be higher than the original versions. Evaluations indicate that tensor decompositions can achieve significant reductions in model size, run-time and energy consumption, and are well suited for implementing neural networks on edge devices. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Comments: IEEE Circuits and Systems Magazine, 2023

Journal ref: IEEE Circuits and Systems Magazine, pp. 8-28, Second Quarter, 2023

arXiv:2304.13532 [pdf, other]

doi 10.1109/TCAD.2023.3291672

SCV-GNN: Sparse Compressed Vector-based Graph Neural Network Aggregation

Authors: Nanda K. Unnikrishnan, Joe Gould, Keshab K. Parhi

Abstract: Graph neural networks (GNNs) have emerged as a powerful tool to process graph-based data in fields like communication networks, molecular interactions, chemistry, social networks, and neuroscience. GNNs are characterized by the ultra-sparse nature of their adjacency matrix that necessitates the development of dedicated hardware beyond general-purpose sparse matrix multipliers. While there has been… ▽ More Graph neural networks (GNNs) have emerged as a powerful tool to process graph-based data in fields like communication networks, molecular interactions, chemistry, social networks, and neuroscience. GNNs are characterized by the ultra-sparse nature of their adjacency matrix that necessitates the development of dedicated hardware beyond general-purpose sparse matrix multipliers. While there has been extensive research on designing dedicated hardware accelerators for GNNs, few have extensively explored the impact of the sparse storage format on the efficiency of the GNN accelerators. This paper proposes SCV-GNN with the novel sparse compressed vectors (SCV) format optimized for the aggregation operation. We use Z-Morton ordering to derive a data-locality-based computation ordering and partitioning scheme. The paper also presents how the proposed SCV-GNN is scalable on a vector processing system. Experimental results over various datasets show that the proposed method achieves a geometric mean speedup of $7.96\times$ and $7.04\times$ over CSC and CSR aggregation operations, respectively. The proposed method also reduces the memory traffic by a factor of $3.29\times$ and $4.37\times$ over compressed sparse column (CSC) and compressed sparse row (CSR), respectively. Thus, the proposed novel aggregation format reduces the latency and memory access for GNN inference. △ Less

Submitted 26 April, 2023; originally announced April 2023.

Journal ref: IEEE Transactions on Computer Aided Design (TCAD), 2023

arXiv:2303.02237 [pdf, other]

doi 10.1109/TIFS.2023.3338553

PaReNTT: Low-Latency Parallel Residue Number System and NTT-Based Long Polynomial Modular Multiplication for Homomorphic Encryption

Authors: Weihang Tan, Sin-Wei Chiu, Antian Wang, Yingjie Lao, Keshab K. Parhi

Abstract: High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our… ▽ More High-speed long polynomial multiplication is important for applications in homomorphic encryption (HE) and lattice-based cryptosystems. This paper addresses low-latency hardware architectures for long polynomial modular multiplication using the number-theoretic transform (NTT) and inverse NTT (iNTT). Chinese remainder theorem (CRT) is used to decompose the modulus into multiple smaller moduli. Our proposed architecture, namely PaReNTT, makes four novel contributions. First, parallel NTT and iNTT architectures are proposed to reduce the number of clock cycles to process the polynomials. This can enable real-time processing for HE applications, as the number of clock cycles to process the polynomial is inversely proportional to the level of parallelism. Second, the proposed architecture eliminates the need for permuting the NTT outputs before their product is input to the iNTT. This reduces latency by n/4 clock cycles, where n is the length of the polynomial, and reduces buffer requirement by one delay-switch-delay circuit of size n. Third, an approach to select special moduli is presented where the moduli can be expressed in terms of a few signed power-of-two terms. Fourth, novel architectures for pre-processing for computing residual polynomials using the CRT and post-processing for combining the residual polynomials are proposed. These architectures significantly reduce the area consumption of the pre-processing and post-processing steps. The proposed long modular polynomial multiplications are ideal for applications that require low latency and high sample rate as these feed-forward architectures can be pipelined at arbitrary levels. △ Less

Submitted 6 July, 2023; v1 submitted 3 March, 2023; originally announced March 2023.

Journal ref: IEEE Transactions on Information Forensics and Security, Vol. 19, pp. 1646-1659, 2024

arXiv:2208.14270 [pdf, other]

Integral Sampler and Polynomial Multiplication Architecture for Lattice-based Cryptography

Authors: Antian Wang, Weihang Tan, Keshab K. Parhi, Yingjie Lao

Abstract: With the surge of the powerful quantum computer, lattice-based cryptography proliferated the latest cryptography hardware implementation due to its resistance against quantum computers. Among the computational blocks of lattice-based cryptography, the random errors produced by the sampler play a key role in ensuring the security of these schemes. This paper proposes an integral architecture for th… ▽ More With the surge of the powerful quantum computer, lattice-based cryptography proliferated the latest cryptography hardware implementation due to its resistance against quantum computers. Among the computational blocks of lattice-based cryptography, the random errors produced by the sampler play a key role in ensuring the security of these schemes. This paper proposes an integral architecture for the sampler, which can reduce the overall resource consumption by reusing the multipliers and adders within the modular polynomial computation. For instance, our experimental results show that the proposed design can effectively reduce the discrete Ziggurat sampling method in DSP usage. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: 6 pages, accepted by 35th IEEE Int. Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems

arXiv:2202.09623 [pdf, other]

doi 10.1109/ISCAS48785.2022.9937347

Multi-Channel FFT Architectures Designed via Folding and Interleaving

Authors: Nanda K. Unnikrishnan, Keshab K. Parhi

Abstract: Computing the FFT of a single channel is well understood in the literature. However, computing the FFT of multiple channels in a systematic manner has not been fully addressed. This paper presents a framework to design a family of multi-channel FFT architectures using {\em folding} and {\em interleaving}. Three distinct multi-channel FFT architectures are presented in this paper. These architectur… ▽ More Computing the FFT of a single channel is well understood in the literature. However, computing the FFT of multiple channels in a systematic manner has not been fully addressed. This paper presents a framework to design a family of multi-channel FFT architectures using {\em folding} and {\em interleaving}. Three distinct multi-channel FFT architectures are presented in this paper. These architectures differ in the input and output preprocessing steps and are based on different folding sets, i.e., different orders of execution. △ Less

Submitted 19 February, 2022; originally announced February 2022.

Comments: Proc. 2022 IEEE International Symposium on Circuits and Systems (ISCAS)

Journal ref: Proc. 2022 IEEE International Symposium on Circuits and Systems (ISCAS), pp. 142-146

arXiv:2110.12127 [pdf, other]

doi 10.1109/TC.2023.3251847

High-Speed VLSI Architectures for Modular Polynomial Multiplication via Fast Filtering and Applications to Lattice-Based Cryptography

Authors: Weihang Tan, Antian Wang, Yingjie Lao, Xinmiao Zhang, Keshab K. Parhi

Abstract: This paper presents a low-latency hardware accelerator for modular polynomial multiplication for lattice-based post-quantum cryptography and homomorphic encryption applications. The proposed novel modular polynomial multiplier exploits the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication. We also exten… ▽ More This paper presents a low-latency hardware accelerator for modular polynomial multiplication for lattice-based post-quantum cryptography and homomorphic encryption applications. The proposed novel modular polynomial multiplier exploits the fast finite impulse response (FIR) filter architecture to reduce the computational complexity of the schoolbook modular polynomial multiplication. We also extend this structure to fast $M$-parallel architectures while achieving low-latency, high-speed, and full hardware utilization. We comprehensively evaluate the performance of the proposed architectures under various polynomial settings as well as in the Saber scheme for post-quantum cryptography as a case study. The experimental results show that our proposed modular polynomial multiplier reduces the computation time and area-time product, respectively, compared to the state-of-the-art designs. △ Less

Submitted 24 February, 2023; v1 submitted 22 October, 2021; originally announced October 2021.

Journal ref: IEEE Trans. on Computers, 72(9), pp. 2454-2466, Sept. 2023

arXiv:2108.06629 [pdf, other]

doi 10.1109/ICCAD51958.2021.9643567

LayerPipe: Accelerating Deep Neural Network Training by Intra-Layer and Inter-Layer Gradient Pipelining and Multiprocessor Scheduling

Authors: Nanda K. Unnikrishnan, Keshab K. Parhi

Abstract: The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipel… ▽ More The time required for training the neural networks increases with size, complexity, and depth. Training model parameters by backpropagation inherently creates feedback loops. These loops hinder efficient pipelining and scheduling of the tasks within the layer and between consecutive layers. Prior approaches, such as PipeDream, have exploited the use of delayed gradient to achieve inter-layer pipelining. However, these approaches treat the entire backpropagation as a single task; this leads to an increase in computation time and processor underutilization. This paper presents novel optimization approaches where the gradient computations with respect to the weights and the activation functions are considered independently; therefore, these can be computed in parallel. This is referred to as intra-layer optimization. Additionally, the gradient computation with respect to the activation function is further divided into two parts and distributed to two consecutive layers. This leads to balanced scheduling where the computation time of each layer is the same. This is referred to as inter-layer optimization. The proposed system, referred to as LayerPipe, reduces the number of clock cycles required for training while maximizing processor utilization with minimal inter-processor communication overhead. LayerPipe achieves an average speedup of 25% and upwards of 80% with 7 to 9 processors with less communication overhead when compared to PipeDream. △ Less

Submitted 14 August, 2021; originally announced August 2021.

Comments: Proc. of the 2021 IEEE International Conference on Computer Aided Design (ICCAD)

Journal ref: 2021 IEEE/ACM Conference on Computer Aided Design (ICCAD)

arXiv:2005.13610 [pdf, other]

doi 10.1109/ISVLSI49217.2020.00-10

Molecular MUX-Based Physical Unclonable Functions

Authors: Lulu Ge, Keshab K. Parhi

Abstract: Physical unclonable functions (PUFs) are small circuits that are widely used as hardware security primitives for authentication. These circuits can generate unique signatures because of the inherent randomness in manufacturing and process variations. This paper introduces molecular PUFs based on multiplexer (MUX) PUFs using dual-rail representation. It may be noted that molecular PUFs have not bee… ▽ More Physical unclonable functions (PUFs) are small circuits that are widely used as hardware security primitives for authentication. These circuits can generate unique signatures because of the inherent randomness in manufacturing and process variations. This paper introduces molecular PUFs based on multiplexer (MUX) PUFs using dual-rail representation. It may be noted that molecular PUFs have not been presented before. Each molecular multiplexer is synthesized using 16 molecular reactions. The intrinsic variations of the rate constants of the molecular reactions are assumed to provide inherent randomness necessary for uniqueness of PUFs. Based on Gaussian distribution of the rate constants of the reactions, this paper simulates intra-chip and inter-chip variations of linear molecular MUX PUFs containing 8, 16, 32 and 64 stages. These variations are, respectively, used to compute reliability and uniqueness. It is shown that, for the rate constants used in this paper, although 8-state molecular MUX PUFs are not useful as PUFs, PUFs containing 16 or higher stages are useful as molecular PUFs. Like electronic PUFs, increasing the number of stages increases uniqueness and reliability of the PUFs △ Less

Submitted 27 May, 2020; originally announced May 2020.

Comments: Proc. of 2020 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), July 2020

arXiv:2004.11204 [pdf, other]

doi 10.1109/MCAS.2020.2988388

Classification using Hyperdimensional Computing: A Review

Authors: Lulu Ge, Keshab K. Parhi

Abstract: Hyperdimensional (HD) computing is built upon its unique data type referred to as hypervectors. The dimension of these hypervectors is typically in the range of tens of thousands. Proposed to solve cognitive tasks, HD computing aims at calculating similarity among its data. Data transformation is realized by three operations, including addition, multiplication and permutation. Its ultra-wide data… ▽ More Hyperdimensional (HD) computing is built upon its unique data type referred to as hypervectors. The dimension of these hypervectors is typically in the range of tens of thousands. Proposed to solve cognitive tasks, HD computing aims at calculating similarity among its data. Data transformation is realized by three operations, including addition, multiplication and permutation. Its ultra-wide data representation introduces redundancy against noise. Since information is evenly distributed over every bit of the hypervectors, HD computing is inherently robust. Additionally, due to the nature of those three operations, HD computing leads to fast learning ability, high energy efficiency and acceptable accuracy in learning and classification tasks. This paper introduces the background of HD computing, and reviews the data representation, data transformation, and similarity measurement. The orthogonality in high dimensions presents opportunities for flexible computing. To balance the tradeoff between accuracy and efficiency, strategies include but are not limited to encoding, retraining, binarization and hardware acceleration. Evaluations indicate that HD computing shows great potential in addressing problems using data in the form of letters, signals and images. HD computing especially shows significant promise to replace machine learning algorithms as a light-weight classifier in the field of internet of things (IoTs). △ Less

Submitted 19 April, 2020; originally announced April 2020.

Comments: IEEE Circuits and Systems Magazine (2020)

Journal ref: IEEE Circuits and Systems Magazine, 20(2), pp. 30-47, June 2020

arXiv:2004.10936 [pdf, other]

PERMDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices

Authors: Chunhua Deng, Siyu Liao, Yi Xie, Keshab K. Parhi, Xuehai Qian, Bo Yuan

Abstract: Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. Thus, model compression becomes a crucial problem. However, the current approaches are limited by various drawbacks. Specifically, network sparsification approach suffers from irregular… ▽ More Deep neural network (DNN) has emerged as the most important and popular artificial intelligent (AI) technique. The growth of model size poses a key energy efficiency challenge for the underlying computing platform. Thus, model compression becomes a crucial problem. However, the current approaches are limited by various drawbacks. Specifically, network sparsification approach suffers from irregularity, heuristic nature and large indexing overhead. On the other hand, the recent structured matrix-based approach (i.e., CirCNN) is limited by the relatively complex arithmetic computation (i.e., FFT), less flexible compression ratio, and its inability to fully utilize input sparsity. To address these drawbacks, this paper proposes PermDNN, a novel approach to generate and execute hardware-friendly structured sparse DNN models using permuted diagonal matrices. Compared with unstructured sparsification approach, PermDNN eliminates the drawbacks of indexing overhead, non-heuristic compression effects and time-consuming retraining. Compared with circulant structure-imposing approach, PermDNN enjoys the benefits of higher reduction in computational complexity, flexible compression ratio, simple arithmetic computation and full utilization of input sparsity. We propose PermDNN architecture, a multi-processing element (PE) fully-connected (FC) layer-targeted computing engine. The entire architecture is highly scalable and flexible, and hence it can support the needs of different applications with different model configurations. We implement a 32-PE design using CMOS 28nm technology. Compared with EIE, PermDNN achieves 3.3x~4.8x higher throughout, 5.9x~8.5x better area efficiency and 2.8x~4.0x better energy efficiency on different workloads. Compared with CirCNN, PermDNN achieves 11.51x higher throughput and 3.89x better energy efficiency. △ Less

Submitted 22 April, 2020; originally announced April 2020.

arXiv:1911.07110 [pdf, other]

doi 10.1109/IEEECONF44664.2019.9048931

Training DNA Perceptrons via Fractional Coding

Authors: Xingyi Liu, Keshab K. Parhi

Abstract: This paper describes a novel approach to synthesize molecular reactions to train a perceptron, i.e., a single-layered neural network, with sigmoidal activation function. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to trans… ▽ More This paper describes a novel approach to synthesize molecular reactions to train a perceptron, i.e., a single-layered neural network, with sigmoidal activation function. The approach is based on fractional coding where a variable is represented by two molecules. The synergy between fractional coding in molecular computing and stochastic logic implementations in electronic computing is key to translating known stochastic logic circuits to molecular computing. In prior work, a DNA perceptron with bipolar inputs and unipolar output was proposed for inference. The focus of this paper is on synthesis of molecular reactions for training of the DNA perceptron. A new molecular scaler that performs multiplication by a factor greater than 1 is proposed based on fractional coding. The training of the perceptron proposed in this paper is based on a modified backpropagation equation as the exact equation cannot be easily mapped to molecular reactions using fractional coding. △ Less

Submitted 8 January, 2020; v1 submitted 16 November, 2019; originally announced November 2019.

Comments: Proc. 2019 Asilomar Conference on Signals, Systems and Computers

arXiv:1910.05643 [pdf, other]

doi 10.1109/TBCAS.2020.2979485

Molecular and DNA Artificial Neural Networks via Fractional Coding

Authors: Xingyi Liu, Keshab K. Parhi

Abstract: This paper considers implementation of artificial neural networks (ANNs) using molecular computing and DNA based on fractional coding. Prior work had addressed molecular two-layer ANNs with binary inputs and arbitrary weights. In prior work using fractional coding, a simple molecular perceptron that computes sigmoid of scaled weighted sum of the inputs was presented where the inputs and the weight… ▽ More This paper considers implementation of artificial neural networks (ANNs) using molecular computing and DNA based on fractional coding. Prior work had addressed molecular two-layer ANNs with binary inputs and arbitrary weights. In prior work using fractional coding, a simple molecular perceptron that computes sigmoid of scaled weighted sum of the inputs was presented where the inputs and the weights lie between [-1, 1]. Even for computing the perceptron, the prior approach suffers from two major limitations. First, it cannot compute the sigmoid of the weighted sum, but only the sigmoid of the scaled weighted sum. Second, many machine learning applications require the coefficients to be arbitrarily positive and negative numbers that are not bounded between [-1, 1]; such numbers cannot be handled by the prior perceptron using fractional coding. This paper makes four contributions. First molecular perceptrons that can handle arbitrary weights and can compute sigmoid of the weighted sums are presented. Thus, these molecular perceptrons are ideal for regression applications and multi-layer ANNs. A new molecular divider is introduced and is used to compute sigmoid(ax) where a > 1. Second, based on fractional coding, a molecular artificial neural network (ANN) with one hidden layer is presented. Third, a trained ANN classifier with one hidden layer from seizure prediction application from electroencephalogram is mapped to molecular reactions and DNA and their performances are presented. Fourth, molecular activation functions for rectified linear unit (ReLU) and softmax are also presented. △ Less

Submitted 7 March, 2020; v1 submitted 12 October, 2019; originally announced October 2019.

Comments: IEEE Transactions on Biomedical Circuits and Systems, 2020

arXiv:1610.07560 [pdf, other]

Automated OCT Segmentation for Images with DME

Authors: Sohini Roychowdhury, Dara D. Koozekanani, Michael Reinsbach, Keshab K. Parhi

Abstract: This paper presents a novel automated system that segments six sub-retinal layers from optical coherence tomography (OCT) image stacks of healthy patients and patients with diabetic macular edema (DME). First, each image in the OCT stack is denoised using a Wiener deconvolution algorithm that estimates the additive speckle noise variance using a novel Fourier-domain based structural error. This de… ▽ More This paper presents a novel automated system that segments six sub-retinal layers from optical coherence tomography (OCT) image stacks of healthy patients and patients with diabetic macular edema (DME). First, each image in the OCT stack is denoised using a Wiener deconvolution algorithm that estimates the additive speckle noise variance using a novel Fourier-domain based structural error. This denoising method enhances the image SNR by an average of 12dB. Next, the denoised images are subjected to an iterative multi-resolution high-pass filtering algorithm that detects seven sub-retinal surfaces in six iterative steps. The thicknesses of each sub-retinal layer for all scans from a particular OCT stack are then compared to the manually marked groundtruth. The proposed system uses adaptive thresholds for denoising and segmenting each image and hence it is robust to disruptions in the retinal micro-structure due to DME. The proposed denoising and segmentation system has an average error of 1.2-5.8 $μm$ and 3.5-26$μm$ for segmenting sub-retinal surfaces in normal and abnormal images with DME, respectively. For estimating the sub-retinal layer thicknesses, the proposed system has an average error of 0.2-2.5 $μm$ and 1.8-18 $μm$ in normal and abnormal images, respectively. Additionally, the average inner sub-retinal layer thickness in abnormal images is estimated as 275$μm (r=0.92)$ with an average error of 9.3 $μm$, while the average thickness of the outer layers in abnormal images is estimated as 57.4$μm (r=0.74)$ with an average error of 3.5 $μm$. The proposed system can be useful for tracking the disease progression for DME over a period of time. △ Less

Submitted 24 October, 2016; originally announced October 2016.

Comments: 31 pages, 7 figures, CRC Press Book Chapter, 2016

arXiv:1603.07055 [pdf]

doi 10.1109/TCSII.2016.2546904

LLR-based Successive-Cancellation List Decoder for Polar Codes with Multi-bit Decision

Authors: Bo Yuan, Keshab K. Parhi

Abstract: Due to their capacity-achieving property, polar codes have become one of the most attractive channel codes. To date, the successive cancellation list (SCL) decoding algorithm is the primary approach that can guarantee outstanding error-correcting performance of polar codes. However, the hardware designs of the original SCL decoder have large silicon area and long decoding latency. Although some re… ▽ More Due to their capacity-achieving property, polar codes have become one of the most attractive channel codes. To date, the successive cancellation list (SCL) decoding algorithm is the primary approach that can guarantee outstanding error-correcting performance of polar codes. However, the hardware designs of the original SCL decoder have large silicon area and long decoding latency. Although some recent efforts can reduce either the area or latency of SCL decoders, these two metrics still cannot be optimized at the same time. This paper, for the first time, proposes a general log-likelihood-ratio (LLR)-based SCL decoding algorithm with multi-bit decision. This new algorithm, referred as LLR-2Kb-SCL, can determine 2K bits simultaneously for arbitrary K with the use of LLR messages. In addition, a reduced-data-width scheme is presented to reduce the critical path of the sorting block. Then, based on the proposed algorithm, a VLSI architecture of the new SCL decoder is developed. Synthesis results show that for an example (1024, 512) polar code with list size 4, the proposed LLR-2Kb-SCL decoders achieve significant reduction in both area and latency as compared to prior works. As a result, the hardware efficiency of the proposed designs with K=2 and 3 are 2.33 times and 3.32 times of that of the state-of-the-art works, respectively. △ Less

Submitted 22 March, 2016; originally announced March 2016.

Comments: accepted by IEEE Trans. Circuits and Systems II

arXiv:1501.03235 [pdf]

Successive Cancellation Decoding of Polar Codes using Stochastic Computing

Authors: Bo Yuan, Keshab K. Parhi

Abstract: Polar codes have emerged as the most favorable channel codes for their unique capacity-achieving property. To date, numerous works have been reported for efficient design of polar codes decoder. However, these prior efforts focused on design of polar decoders via deterministic computation, while the behavior of stochastic polar decoder, which can have potential advantages such as low complexity an… ▽ More Polar codes have emerged as the most favorable channel codes for their unique capacity-achieving property. To date, numerous works have been reported for efficient design of polar codes decoder. However, these prior efforts focused on design of polar decoders via deterministic computation, while the behavior of stochastic polar decoder, which can have potential advantages such as low complexity and strong error-resilience, has not been studied in existing literatures. This paper, for the first time, investigates polar decoding using stochastic logic. Specifically, the commonly-used successive cancellation (SC) algorithm is reformulated into the stochastic form. Several methods that can potentially improve decoding performance are discussed and analyzed. Simulation results show that a stochastic SC decoder can achieve similar error-correcting performance as its deterministic counterpart. This work can pave the way for future hardware design of stochastic polar codes decoders. △ Less

Submitted 13 January, 2015; originally announced January 2015.

Comments: accepted by International Symposium on Circuits and Systems (ISCAS) 2015

arXiv:1411.7286 [pdf]

Algorithm and Architecture for Hybrid Decoding of Polar Codes

Authors: Bo Yuan, Keshab K. Parhi

Abstract: Polar codes are the first provable capacity-achieving forward error correction (FEC) codes. In general polar codes can be decoded via either successive cancellation (SC) or belief propagation (BP) decoding algorithm. However, to date practical applications of polar codes have been hindered by the long decoding latency and limited error-correcting performance problems. In this paper, based on our r… ▽ More Polar codes are the first provable capacity-achieving forward error correction (FEC) codes. In general polar codes can be decoded via either successive cancellation (SC) or belief propagation (BP) decoding algorithm. However, to date practical applications of polar codes have been hindered by the long decoding latency and limited error-correcting performance problems. In this paper, based on our recent proposed early stop** criteria for the BP algorithm, we propose a hybrid BP-SC decoding scheme to improve the decoding performance of polar codes with relatively short latency. Simulation results show that, for (1024, 512) polar codes the proposed approach leads to at least 0.2dB gain over the BP algorithm with the same maximum number of iterations for the entire SNR region, and also achieves 0.2dB decoding gain over the BP algorithm with the same worst-case latency in the high SNR region. Besides, compared to the SC algorithm, the proposed scheme leads to 0.2dB gain in the medium SNR region with much less average decoding latency. In addition, we also propose the low-complexity unified hardware architecture for the hybrid decoding scheme, which is able to implement SC and BP algorithms using same hardware. △ Less

Submitted 26 November, 2014; originally announced November 2014.

Comments: accepted by 2014 Asilomar Conference on Signals, Systems, and Computers

arXiv:1411.7282 [pdf]

Successive Cancellation List Polar Decoder using Log-likelihood Ratios

Authors: Bo Yuan, Keshab K. Parhi

Abstract: Successive cancellation list (SCL) decoding algorithm is a powerful method that can help polar codes achieve excellent error-correcting performance. However, the current SCL algorithm and decoders are based on likelihood or log-likelihood forms, which render high hardware complexity. In this paper, we propose a log-likelihood-ratio (LLR)-based SCL (LLR-SCL) decoding algorithm, which only needs hal… ▽ More Successive cancellation list (SCL) decoding algorithm is a powerful method that can help polar codes achieve excellent error-correcting performance. However, the current SCL algorithm and decoders are based on likelihood or log-likelihood forms, which render high hardware complexity. In this paper, we propose a log-likelihood-ratio (LLR)-based SCL (LLR-SCL) decoding algorithm, which only needs half the computation and storage complexity than the conventional one. Then, based on the proposed algorithm, we develop low-complexity VLSI architectures for LLR-SCL decoders. Analysis results show that the proposed LLR-SCL decoder achieves 50% reduction in hardware and 98% improvement in hardware efficiency. △ Less

Submitted 12 December, 2014; v1 submitted 26 November, 2014; originally announced November 2014.

Comments: accepted by 2014 Asilomar Conference on Signals, Systems, and Computers

arXiv:1406.7036 [pdf]

Low-Latency Successive-Cancellation List Decoders for Polar Codes with Multi-bit Decision

Authors: Bo Yuan, Keshab K. Parhi

Abstract: Polar codes, as the first provable capacity-achieving error-correcting codes, have received much attention in recent years. However, the decoding performance of polar codes with traditional successive-cancellation (SC) algorithm cannot match that of the low-density parity-check (LDPC) or turbo codes. Because SC list (SCL) decoding algorithm can significantly improve the error-correcting performanc… ▽ More Polar codes, as the first provable capacity-achieving error-correcting codes, have received much attention in recent years. However, the decoding performance of polar codes with traditional successive-cancellation (SC) algorithm cannot match that of the low-density parity-check (LDPC) or turbo codes. Because SC list (SCL) decoding algorithm can significantly improve the error-correcting performance of polar codes, design of SCL decoders is important for polar codes to be deployed in practical applications. However, because the prior latency reduction approaches for SC decoders are not applicable for SCL decoders, these list decoders suffer from the long latency bottleneck. In this paper, we propose a multi-bit-decision approach that can significantly reduce latency of SCL decoders. First, we present a reformulated SCL algorithm that can perform intermediate decoding of 2 bits together. The proposed approach, referred as 2-bit reformulated SCL (2b-rSCL) algorithm, can reduce the latency of SCL decoder from (3n-2) to (2n-2) clock cycles without any performance loss. Then, we extend the idea of 2-bit-decision to general case, and propose a general decoding scheme that can perform intermediate decoding of any 2K bits simultaneously. This general approach, referred as 2K-bit reformulated SCL (2Kb-rSCL) algorithm, can reduce the overall decoding latency to as short as n/2K-2-2 cycles. Furthermore, based on the proposed algorithms, VLSI architectures for 2b-rSCL and 4b-rSCL decoders are synthesized. Compared with a prior SCL decoder, the proposed (1024, 512) 2b-rSCL and 4b-rSCL decoders can achieve 21% and 60% reduction in latency, 1.66 times and 2.77 times increase in coded throughput with list size 2, and 2.11 times and 3.23 times increase in coded throughput with list size 4, respectively. △ Less

Submitted 20 September, 2014; v1 submitted 26 June, 2014; originally announced June 2014.

Comments: submitted to IEEE TVLSI in Feb 2014, accepted in Sep. 2014

arXiv:1111.0705 [pdf]

Low-Latency SC Decoder Architectures for Polar Codes

Authors: Chuan Zhang, Bo Yuan, Keshab K. Parhi

Abstract: Nowadays polar codes are becoming one of the most favorable capacity achieving error correction codes for their low encoding and decoding complexity. However, due to the large code length required by practical applications, the few existing successive cancellation (SC) decoder implementations still suffer from not only the high hardware cost but also the long decoding latency. This paper presents… ▽ More Nowadays polar codes are becoming one of the most favorable capacity achieving error correction codes for their low encoding and decoding complexity. However, due to the large code length required by practical applications, the few existing successive cancellation (SC) decoder implementations still suffer from not only the high hardware cost but also the long decoding latency. This paper presents novel several approaches to design low-latency decoders for polar codes based on look-ahead techniques. Look-ahead techniques can be employed to reschedule the decoding process of polar decoder in numerous approaches. However, among those approaches, only well-arranged ones can achieve good performance in terms of both latency and hardware complexity. By revealing the recurrence property of SC decoding chart, the authors succeed in reducing the decoding latency by 50% with look-ahead techniques. With the help of VLSI-DSP design techniques such as pipelining, folding, unfolding, and parallel processing, methodologies for four different polar decoder architectures have been proposed to meet various application demands. Sub-structure sharing scheme has been adopted to design the merged processing element (PE) for further hardware reduction. In addition, systematic methods for construction refined pipelining decoder (2nd design) and the input generating circuits (ICG) block have been given. Detailed gate-level analysis has demonstrated that the proposed designs show latency advantages over conventional ones with similar hardware cost. △ Less

Submitted 2 November, 2011; originally announced November 2011.

arXiv:1111.0704 [pdf]

Reduced-Latency SC Polar Decoder Architectures

Authors: Chuan Zhang, Bo Yuan, Keshab K. Parhi

Abstract: Polar codes have become one of the most favorable capacity achieving error correction codes (ECC) along with their simple encoding method. However, among the very few prior successive cancellation (SC) polar decoder designs, the required long code length makes the decoding latency high. In this paper, conventional decoding algorithm is transformed with look-ahead techniques. This reduces the decod… ▽ More Polar codes have become one of the most favorable capacity achieving error correction codes (ECC) along with their simple encoding method. However, among the very few prior successive cancellation (SC) polar decoder designs, the required long code length makes the decoding latency high. In this paper, conventional decoding algorithm is transformed with look-ahead techniques. This reduces the decoding latency by 50%. With pipelining and parallel processing schemes, a parallel SC polar decoder is proposed. Sub-structure sharing approach is employed to design the merged processing element (PE). Moreover, inspired by the real FFT architecture, this paper presents a novel input generating circuit (ICG) block that can generate additional input signals for merged PEs on-the-fly. Gate-level analysis has demonstrated that the proposed design shows advantages of 50% decoding latency and twice throughput over the conventional one with similar hardware cost. △ Less

Submitted 2 November, 2011; originally announced November 2011.

arXiv:1111.0703 [pdf]

Efficient Network for Non-Binary QC-LDPC Decoder

Authors: Chuan Zhang, Keshab K. Parhi

Abstract: This paper presents approaches to develop efficient network for non-binary quasi-cyclic LDPC (QC-LDPC) decoders. By exploiting the intrinsic shifting and symmetry properties of the check matrices, significant reduction of memory size and routing complexity can be achieved. Two different efficient network architectures for Class-I and Class-II non-binary QC-LDPC decoders have been proposed, respect… ▽ More This paper presents approaches to develop efficient network for non-binary quasi-cyclic LDPC (QC-LDPC) decoders. By exploiting the intrinsic shifting and symmetry properties of the check matrices, significant reduction of memory size and routing complexity can be achieved. Two different efficient network architectures for Class-I and Class-II non-binary QC-LDPC decoders have been proposed, respectively. Comparison results have shown that for the code of the 64-ary (1260, 630) rate-0.5 Class-I code, the proposed scheme can save more than 70.6% hardware required by shuffle network than the state-of-the-art designs. The proposed decoder example for the 32-ary (992, 496) rate-0.5 Class-II code can achieve a 93.8% shuffle network reduction compared with the conventional ones. Meanwhile, based on the similarity of Class-I and Class-II codes, similar shuffle network is further developed to incorporate both classes of codes at a very low cost. △ Less

Submitted 2 November, 2011; originally announced November 2011.

Showing 1–26 of 26 results for author: Parhi, K K