Search | arXiv e-print repository

Faster Inference of Integer SWIN Transformer by Removing the GELU Activation

Authors: Mohammadreza Tayaranian, Seyyed Hasan Mozafari, James J. Clark, Brett Meyer, Warren Gross

Abstract: SWIN transformer is a prominent vision transformer model that has state-of-the-art accuracy in image classification tasks. Despite this success, its unique architecture causes slower inference compared with similar deep neural networks. Integer quantization of the model is one of the methods used to improve its inference latency. However, state-of-the-art has not been able to fully quantize the mo… ▽ More SWIN transformer is a prominent vision transformer model that has state-of-the-art accuracy in image classification tasks. Despite this success, its unique architecture causes slower inference compared with similar deep neural networks. Integer quantization of the model is one of the methods used to improve its inference latency. However, state-of-the-art has not been able to fully quantize the model. In this work, we improve upon the inference latency of the state-of-the-art methods by removing the floating-point operations, which are associated with the GELU activation in Swin Transformer. While previous work proposed to replace the non-integer operations with linear approximation functions, we propose to replace GELU with ReLU activation. The advantage of ReLU over previous methods is its low memory and computation complexity. We use iterative knowledge distillation to compensate for the lost accuracy due to replacing GELU with ReLU. We quantize our GELU-less SWIN transformer and show that on an RTX 4090 NVIDIA GPU we can improve the inference latency of the quantized SWIN transformer by at least $11\%$ while maintaining an accuracy drop of under $0.5\%$ on the ImageNet evaluation dataset. △ Less

Submitted 2 February, 2024; originally announced February 2024.

Comments: 5 pages, 1 figure. Submitted to Edge Intelligence Workshop III, an AAAI 2024 workshop

arXiv:2401.13212 [pdf, other]

AdCorDA: Classifier Refinement via Adversarial Correction and Domain Adaptation

Authors: Lulan Shen, Ali Edalati, Brett Meyer, Warren Gross, James J. Clark

Abstract: This paper describes a simple yet effective technique for refining a pretrained classifier network. The proposed AdCorDA method is based on modification of the training set and making use of the duality between network weights and layer inputs. We call this input space training. The method consists of two stages - adversarial correction followed by domain adaptation. Adversarial correction uses ad… ▽ More This paper describes a simple yet effective technique for refining a pretrained classifier network. The proposed AdCorDA method is based on modification of the training set and making use of the duality between network weights and layer inputs. We call this input space training. The method consists of two stages - adversarial correction followed by domain adaptation. Adversarial correction uses adversarial attacks to correct incorrect training-set classifications. The incorrectly classified samples of the training set are removed and replaced with the adversarially corrected samples to form a new training set, and then, in the second stage, domain adaptation is performed back to the original training set. Extensive experimental validations show significant accuracy boosts of over 5% on the CIFAR-100 dataset. The technique can be straightforwardly applied to refinement of weight-quantized neural networks, where experiments show substantial enhancement in performance over the baseline. The adversarial correction technique also results in enhanced robustness to adversarial attacks. △ Less

Submitted 23 January, 2024; originally announced January 2024.

arXiv:2401.12014 [pdf, other]

Robustness to distribution shifts of compressed networks for edge devices

Authors: Lulan Shen, Ali Edalati, Brett Meyer, Warren Gross, James J. Clark

Abstract: It is necessary to develop efficient DNNs deployed on edge devices with limited computation resources. However, the compressed networks often execute new tasks in the target domain, which is different from the source domain where the original network is trained. It is important to investigate the robustness of compressed networks in two types of data distribution shifts: domain shifts and adversar… ▽ More It is necessary to develop efficient DNNs deployed on edge devices with limited computation resources. However, the compressed networks often execute new tasks in the target domain, which is different from the source domain where the original network is trained. It is important to investigate the robustness of compressed networks in two types of data distribution shifts: domain shifts and adversarial perturbations. In this study, we discover that compressed models are less robust to distribution shifts than their original networks. Interestingly, larger networks are more vulnerable to losing robustness than smaller ones, even when they are compressed to a similar size as the smaller networks. Furthermore, compact networks obtained by knowledge distillation are much more robust to distribution shifts than pruned networks. Finally, post-training quantization is a reliable method for achieving significant robustness to distribution shifts, and it outperforms both pruned and distilled models in terms of robustness. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2311.14762 [pdf, other]

The 2nd Workshop on Maritime Computer Vision (MaCVi) 2024

Authors: Benjamin Kiefer, Lojze Žust, Matej Kristan, Janez Perš, Matija Teršek, Arnold Wiliem, Martin Messmer, Cheng-Yen Yang, Hsiang-Wei Huang, Zhongyu Jiang, Heng-Cheng Kuo, Jie Mei, Jenq-Neng Hwang, Daniel Stadler, Lars Sommer, Kaer Huang, Aiguo Zheng, Weitu Chong, Kanokphan Lertniphonphan, Jun Xie, Feng Chen, Jian Li, Zhepeng Wang, Luca Zedda, Andrea Loddo , et al. (24 additional authors not shown)

Abstract: The 2nd Workshop on Maritime Computer Vision (MaCVi) 2024 addresses maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicles (USV). Three challenges categories are considered: (i) UAV-based Maritime Object Tracking with Re-identification, (ii) USV-based Maritime Obstacle Segmentation and Detection, (iii) USV-based Maritime Boat Tracking. The USV-based Maritime Obst… ▽ More The 2nd Workshop on Maritime Computer Vision (MaCVi) 2024 addresses maritime computer vision for Unmanned Aerial Vehicles (UAV) and Unmanned Surface Vehicles (USV). Three challenges categories are considered: (i) UAV-based Maritime Object Tracking with Re-identification, (ii) USV-based Maritime Obstacle Segmentation and Detection, (iii) USV-based Maritime Boat Tracking. The USV-based Maritime Obstacle Segmentation and Detection features three sub-challenges, including a new embedded challenge addressing efficicent inference on real-world embedded devices. This report offers a comprehensive overview of the findings from the challenges. We provide both statistical and qualitative analyses, evaluating trends from over 195 submissions. All datasets, evaluation code, and the leaderboard are available to the public at https://macvi.org/workshop/macvi24. △ Less

Submitted 23 November, 2023; originally announced November 2023.

Comments: Part of 2nd Workshop on Maritime Computer Vision (MaCVi) 2024 IEEE Xplore submission as part of WACV 2024

arXiv:2307.07133 [pdf, other]

Step-GRAND: A Low Latency Universal Soft-input Decoder

Authors: Syed Mohsin Abbas, Marwan Jalaleddine, Chi-Ying Tsui, Warren J. Gross

Abstract: GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardwa… ▽ More GRAND features both soft-input and hard-input variants that are well suited to efficient hardware implementations that can be characterized with achievable average and worst-case decoding latency. This paper introduces step-GRAND, a soft-input variant of GRAND that, in addition to achieving appealing average decoding latency, also reduces the worst-case decoding latency of the corresponding hardware implementation. The hardware implementation results demonstrate that the proposed step-GRAND can decode CA-polar code $(128,105+11)$ with an average information throughput of $47.7$ Gbps at the target FER of $\leq10^{-7}$. Furthermore, the proposed step-GRAND hardware is $10\times$ more area efficient than the previous soft-input ORBGRAND hardware implementation, and its worst-case latency is $\frac{1}{6.8}\times$ that of the previous ORBGRAND hardware. △ Less

Submitted 26 July, 2023; v1 submitted 13 July, 2023; originally announced July 2023.

Comments: Submitted to 2023 IEEE Globecom Workshops

arXiv:2304.11207 [pdf, other]

SSS3D: Fast Neural Architecture Search For Efficient Three-Dimensional Semantic Segmentation

Authors: Olivier Therrien, Marihan Amein, Zhuoran Xiong, Warren J. Gross, Brett H. Meyer

Abstract: We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. It uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.… ▽ More We present SSS3D, a fast multi-objective NAS framework designed to find computationally efficient 3D semantic scene segmentation networks. It uses RandLA-Net, an off-the-shelf point-based network, as a super-network to enable weight sharing and reduce search time by 99.67% for single-stage searches. SSS3D has a complex search space composed of sampling and architectural parameters that can form 2.88 * 10^17 possible networks. To further reduce search time, SSS3D splits the complete search space and introduces a two-stage search that finds optimal subnetworks in 54% of the time required by single-stage searches. △ Less

Submitted 21 April, 2023; originally announced April 2023.

Comments: Accepted as a full paper by the TinyML Research Symposium 2023

arXiv:2303.16322 [pdf, other]

FMAS: Fast Multi-Objective SuperNet Architecture Search for Semantic Segmentation

Authors: Zhuoran Xiong, Marihan Amein, Olivier Therrien, Warren J. Gross, Brett H. Meyer

Abstract: We present FMAS, a fast multi-objective neural architecture search framework for semantic segmentation. FMAS subsamples the structure and pre-trained parameters of DeepLabV3+, without fine-tuning, dramatically reducing training time during search. To further reduce candidate evaluation time, we use a subset of the validation dataset during the search. Only the final, Pareto non-dominated, candidat… ▽ More We present FMAS, a fast multi-objective neural architecture search framework for semantic segmentation. FMAS subsamples the structure and pre-trained parameters of DeepLabV3+, without fine-tuning, dramatically reducing training time during search. To further reduce candidate evaluation time, we use a subset of the validation dataset during the search. Only the final, Pareto non-dominated, candidates are ultimately fine-tuned using the complete training set. We evaluate FMAS by searching for models that effectively trade accuracy and computational cost on the PASCAL VOC 2012 dataset. FMAS finds competitive designs quickly, e.g., taking just 0.5 GPU days to discover a DeepLabV3+ variant that reduces FLOPs and parameters by 10$\%$ and 20$\%$ respectively, for less than 3$\%$ increased error. We also search on an edge device called GAP8 and use its latency as the metric. FMAS is capable of finding 2.2$\times$ faster network with 7.61$\%$ MIoU loss. △ Less

Submitted 28 March, 2023; originally announced March 2023.

Comments: Accepted as a full paper by the TinyML Research Symposium 2023

arXiv:2302.12454 [pdf, ps, other]

Stochastic Simulated Quantum Annealing for Fast Solution of Combinatorial Optimization Problems

Authors: Naoya Onizawa, Ryoma Sasaki, Duckgyu Shin, Warren J. Gross, Takahiro Hanyu

Abstract: In this paper, we introduce stochastic simulated quantum annealing (SSQA) for large-scale combinatorial optimization problems. SSQA is designed based on stochastic computing and quantum Monte Carlo, which can simulate quantum annealing (QA) by using multiple replicas of spins (probabilistic bits) in classical computing. The use of stochastic computing leads to an efficient parallel spin-state upda… ▽ More In this paper, we introduce stochastic simulated quantum annealing (SSQA) for large-scale combinatorial optimization problems. SSQA is designed based on stochastic computing and quantum Monte Carlo, which can simulate quantum annealing (QA) by using multiple replicas of spins (probabilistic bits) in classical computing. The use of stochastic computing leads to an efficient parallel spin-state update algorithm, enabling quick search for a solution around the global minimum energy. Therefore, SSQA realizes quantum-like annealing for large-scale problems and can handle fully connected models in combinatorial optimization, unlike QA. The proposed method is evaluated in MATLAB on graph isomorphism problems, which are typical combinatorial optimization problems. The proposed method achieves a convergence speed an order of magnitude faster than a conventional stochastic simulaated annealing method. Additionally, it can handle a 100-times larger problem size compared to QA and a 25-times larger problem size compared to a traditional SA method, respectively, for similar convergence probabilities. △ Less

Submitted 28 June, 2023; v1 submitted 24 February, 2023; originally announced February 2023.

Comments: 14 pages, 8 figures

arXiv:2212.12965 [pdf, other]

BD-KD: Balancing the Divergences for Online Knowledge Distillation

Authors: Ibtihel Amara, Nazanin Sepahvand, Brett H. Meyer, Warren J. Gross, James J. Clark

Abstract: Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks… ▽ More Knowledge distillation (KD) has gained a lot of attention in the field of model compression for edge devices thanks to its effectiveness in compressing large powerful networks into smaller lower-capacity models. Online distillation, in which both the teacher and the student are learning collaboratively, has also gained much interest due to its ability to improve on the performance of the networks involved. The Kullback-Leibler (KL) divergence ensures the proper knowledge transfer between the teacher and student. However, most online KD techniques present some bottlenecks under the network capacity gap. By cooperatively and simultaneously training, the models the KL distance becomes incapable of properly minimizing the teacher's and student's distributions. Alongside accuracy, critical edge device applications are in need of well-calibrated compact networks. Confidence calibration provides a sensible way of getting trustworthy predictions. We propose BD-KD: Balancing of Divergences for online Knowledge Distillation. We show that adaptively balancing between the reverse and forward divergences shifts the focus of the training strategy to the compact student network without limiting the teacher network's learning process. We demonstrate that, by performing this balancing design at the level of the student distillation loss, we improve upon both performance accuracy and calibration of the compact student network. We conducted extensive experiments using a variety of network architectures and show improvements on multiple datasets including CIFAR-10, CIFAR-100, Tiny-ImageNet, and ImageNet. We illustrate the effectiveness of our approach through comprehensive comparisons and ablations with current state-of-the-art online and offline KD techniques. △ Less

Submitted 25 December, 2022; originally announced December 2022.

arXiv:2209.07606 [pdf, other]

CES-KD: Curriculum-based Expert Selection for Guided Knowledge Distillation

Authors: Ibtihel Amara, Maryam Ziaeefard, Brett H. Meyer, Warren Gross, James J. Clark

Abstract: Knowledge distillation (KD) is an effective tool for compressing deep classification models for edge devices. However, the performance of KD is affected by the large capacity gap between the teacher and student networks. Recent methods have resorted to a multiple teacher assistant (TA) setting for KD, which sequentially decreases the size of the teacher model to relatively bridge the size gap betw… ▽ More Knowledge distillation (KD) is an effective tool for compressing deep classification models for edge devices. However, the performance of KD is affected by the large capacity gap between the teacher and student networks. Recent methods have resorted to a multiple teacher assistant (TA) setting for KD, which sequentially decreases the size of the teacher model to relatively bridge the size gap between these models. This paper proposes a new technique called Curriculum Expert Selection for Knowledge Distillation (CES-KD) to efficiently enhance the learning of a compact student under the capacity gap problem. This technique is built upon the hypothesis that a student network should be guided gradually using stratified teaching curriculum as it learns easy (hard) data samples better and faster from a lower (higher) capacity teacher network. Specifically, our method is a gradual TA-based KD technique that selects a single teacher per input image based on a curriculum driven by the difficulty in classifying the image. In this work, we empirically verify our hypothesis and rigorously experiment with CIFAR-10, CIFAR-100, CINIC-10, and ImageNet datasets and show improved accuracy on VGG-like models, ResNets, and WideResNets architectures. △ Less

Submitted 15 September, 2022; originally announced September 2022.

Comments: ICPR2022

arXiv:2208.02070 [pdf, other]

Efficient Fine-Tuning of Compressed Language Models with Learners

Authors: Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J. Clark, Brett H. Meyer, Warren J. Gross

Abstract: Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of… ▽ More Fine-tuning BERT-based models is resource-intensive in memory, computation, and time. While many prior works aim to improve inference efficiency via compression techniques, e.g., pruning, these works do not explicitly address the computational challenges of training to downstream tasks. We introduce Learner modules and priming, novel methods for fine-tuning that exploit the overparameterization of pre-trained language models to gain benefits in convergence speed and resource utilization. Learner modules navigate the double bind of 1) training efficiently by fine-tuning a subset of parameters, and 2) training effectively by ensuring quick convergence and high metric scores. Our results on DistilBERT demonstrate that learners perform on par with or surpass the baselines. Learners train 7x fewer parameters than state-of-the-art methods on GLUE. On CoLA, learners fine-tune 20% faster, and have significantly lower resource utilization. △ Less

Submitted 3 August, 2022; originally announced August 2022.

Comments: 8 pages, 9 figures, 2 tables, presented at ICML 2022 workshop on Hardware-Aware Efficient Training (HAET 2022)

arXiv:2205.01541 [pdf, other]

doi 10.1109/ISCAS48785.2022.9937567

Efficient Fine-Tuning of BERT Models on the Edge

Authors: Danilo Vucetic, Mohammadreza Tayaranian, Maryam Ziaeefard, James J. Clark, Brett H. Meyer, Warren J. Gross

Abstract: Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased reso… ▽ More Resource-constrained devices are increasingly the deployment targets of machine learning applications. Static models, however, do not always suffice for dynamic environments. On-device training of models allows for quick adaptability to new scenarios. With the increasing size of deep neural networks, as noted with the likes of BERT and other natural language processing models, comes increased resource requirements, namely memory, computation, energy, and time. Furthermore, training is far more resource intensive than inference. Resource-constrained on-device learning is thus doubly difficult, especially with large BERT-like models. By reducing the memory usage of fine-tuning, pre-trained BERT models can become efficient enough to fine-tune on resource-constrained devices. We propose Freeze And Reconfigure (FAR), a memory-efficient training regime for BERT-like models that reduces the memory usage of activation maps during fine-tuning by avoiding unnecessary parameter updates. FAR reduces fine-tuning time on the DistilBERT model and CoLA dataset by 30%, and time spent on memory operations by 47%. More broadly, reductions in metric performance on the GLUE and SQuAD datasets are around 1% on average. △ Less

Submitted 3 May, 2022; originally announced May 2022.

Comments: 4 pages, 2 figures, 3 tables. To be published in ISCAS 2022 and made available on IEEE Xplore

arXiv:2205.00030 [pdf, other]

GRAND for Rayleigh Fading Channels

Authors: Syed Mohsin Abbas, Marwan Jalaleddine, Warren J. Gross

Abstract: Guessing Random Additive Noise Decoding (GRAND) is a code-agnostic decoding technique for short-length and high-rate channel codes. GRAND tries to guess the channel noise by generating test error patterns (TEPs), and the sequence of the TEPs is the main difference between different GRAND variants. In this work, we extend the application of GRAND to multipath frequency non-selective Rayleigh fading… ▽ More Guessing Random Additive Noise Decoding (GRAND) is a code-agnostic decoding technique for short-length and high-rate channel codes. GRAND tries to guess the channel noise by generating test error patterns (TEPs), and the sequence of the TEPs is the main difference between different GRAND variants. In this work, we extend the application of GRAND to multipath frequency non-selective Rayleigh fading communication channels, and we refer to this GRAND variant as Fading-GRAND. The proposed Fading-GRAND adapts its TEP generation to the fading conditions of the underlying communication channel, outperforming traditional channel code decoders in scenarios with $L$ spatial diversity branches as well as scenarios with no diversity. Numerical simulation results show that the Fading-GRAND outperforms the traditional Berlekamp-Massey (B-M) decoder for decoding BCH code $(127,106)$ and BCH code $(127,113)$ by $\mathbf{0.5\sim6.5}$ dB at a target FER of $10^{-7}$. Similarly, Fading-GRAND outperforms GRANDAB, the hard-input variation of GRAND, by $0.2\sim8$ dB at a target FER of $10^{-7}$ with CRC $(128,104)$ code and RLC $(128,104)$. Furthermore the average complexity of Fading-GRAND, at $\frac{E_b}{N_0}$ corresponding to target FER of $10^{-7}$, is $\frac{1}{2}\times\sim \frac{1}{46}\times$ the complexity of GRANDAB. △ Less

Submitted 30 November, 2022; v1 submitted 29 April, 2022; originally announced May 2022.

Comments: To appear in IEEE Global Communications Conference (GLOBECOM) 2022 Workshops

Journal ref: GLOBECOM 2022 Workshops

arXiv:2202.12422 [pdf, other]

Standard Deviation-Based Quantization for Deep Neural Networks

Authors: Amir Ardakani, Arash Ardakani, Brett Meyer, James J. Clark, Warren J. Gross

Abstract: Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose… ▽ More Quantization of deep neural networks is a promising approach that reduces the inference cost, making it feasible to run deep networks on resource-restricted devices. Inspired by existing methods, we propose a new framework to learn the quantization intervals (discrete values) using the knowledge of the network's weight and activation distributions, i.e., standard deviation. Furthermore, we propose a novel base-2 logarithmic quantization scheme to quantize weights to power-of-two discrete values. Our proposed scheme allows us to replace resource-hungry high-precision multipliers with simple shift-add operations. According to our evaluations, our method outperforms existing work on CIFAR10 and ImageNet datasets and even achieves better accuracy performance with 3-bit weights and activations when compared to the full-precision models. Moreover, our scheme simultaneously prunes the network's parameters and allows us to flexibly adjust the pruning ratio during the quantization process. △ Less

Submitted 24 February, 2022; originally announced February 2022.

arXiv:2110.13776 [pdf, other]

doi 10.1109/TVLSI.2022.3153605

High-Throughput and Energy-Efficient VLSI Architecture for Ordered Reliability Bits GRAND

Authors: Syed Mohsin Abbas, Thibaud Tonnellier, Furkan Ercan, Marwan Jalaleddine, Warren J. Gross

Abstract: Ultra-reliable low-latency communication (URLLC), a major 5G New-Radio use case, is the key enabler for applications with strict reliability and latency requirements. These applications necessitate the use of short-length and high-rate codes. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique for these short-length and high-rate codes.… ▽ More Ultra-reliable low-latency communication (URLLC), a major 5G New-Radio use case, is the key enabler for applications with strict reliability and latency requirements. These applications necessitate the use of short-length and high-rate codes. Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique for these short-length and high-rate codes. Rather than decoding the received vector, GRAND tries to infer the noise that corrupted the transmitted codeword during transmission through the communication channel. As a result, GRAND can decode any code, structured or unstructured. GRAND has hard-input as well as soft-input variants. Among these variants, Ordered Reliability Bits GRAND (ORBGRAND) is a soft-input variant that outperforms hard-input GRAND and is suitable for parallel hardware implementation. This work reports the first hardware architecture for ORBGRAND, which achieves an average throughput of up to $42.5$ Gbps for a code length of $128$ at a target FER of $10^{-7}$. Furthermore, the proposed hardware can be used to decode any code as long as the length and rate constraints are met. In comparison to the GRANDAB, a hard-input variant of GRAND, the proposed architecture enhances decoding performance by at least $2$ dB. When compared to the state-of-the-art fast dynamic successive cancellation flip decoder (Fast-DSCF) using a 5G polar $(128,105)$ code, the proposed ORBGRAND VLSI implementation has $49\times$ higher average throughput, $32\times$ times more energy efficiency, and $5\times$ more area efficiency while maintaining similar decoding performance. △ Less

Submitted 11 March, 2022; v1 submitted 26 October, 2021; originally announced October 2021.

Comments: Accepted for inclusion in IEEE Transactions on Very Large Scale Integration Systems (TVLSI), 2022. For the updated version, please see IEEE Xplore

Journal ref: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022

arXiv:2109.12239 [pdf, other]

doi 10.1109/ACCESS.2021.3140151

Fast Successive-Cancellation List Flip Decoding of Polar Codes

Authors: Nghia Doan, Seyyed Ali Hashemi, Warren J. Gross

Abstract: This work presents a fast successive-cancellation list flip (Fast-SCLF) decoding algorithm for polar codes that addresses the high latency issue associated with the successive-cancellation list flip (SCLF) decoding algorithm. We first propose a bit-flip** strategy tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding that avoids tree-traversal in the binary tree repr… ▽ More This work presents a fast successive-cancellation list flip (Fast-SCLF) decoding algorithm for polar codes that addresses the high latency issue associated with the successive-cancellation list flip (SCLF) decoding algorithm. We first propose a bit-flip** strategy tailored to the state-of-the-art fast successive-cancellation list (FSCL) decoding that avoids tree-traversal in the binary tree representation of SCLF, thus reducing the latency of the decoding process. We then derive a parameterized path selection error model to accurately estimate the bit index at which the correct decoding path is eliminated from the initial FSCL decoding. The trainable parameter is optimized online based on an efficient supervised learning framework. Simulation results show that for a polar code of length 512 with 256 information bits, with similar error-correction performance and memory consumption, the proposed Fast-SCLF decoder reduces up to $73.4\%$ of the average decoding latency of the SCLF decoder with the same list size at the frame error rate of $10^{-4}$, while incurring a maximum computational complexity overhead of $27.6\%$. For the same polar code of length 512 with 256 information bits and at practical signal-to-noise ratios, the proposed decoder with list size 4 reduces $89.3\%$ and $43.7\%$ of the average complexity and decoding latency of the FSCL decoder with list size 32 (FSCL-32), respectively, while also reducing $83.2\%$ of the memory consumption of FSCL-32. The significant improvements of the proposed decoder come at the cost of $0.07$ dB error-correction performance degradation compared with FSCL-32. △ Less

Submitted 23 January, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: Published in IEEE Access, Volume: 10, Page(s): 5568 - 5584, Date of Publication: 04 January 2022

arXiv:2109.12225 [pdf, other]

doi 10.1109/TVLSI.2022.3223692

List-GRAND: A practical way to achieve Maximum Likelihood Decoding

Authors: Syed Mohsin Abbas, Marwan Jalaleddine, Warren J. Gross

Abstract: Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal Maximum Likelihood (ML) decoder for short-length and high-rate linear block-codes. Soft-GRAND (SGRAND) is a prominent soft-input GRAND variant, outperforming the other GRAND variants in decoding performance; nevertheless, SGRAND is not suitable for parallel hardware implementation. Ordered Reliability Bits-GRAND (ORBG… ▽ More Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal Maximum Likelihood (ML) decoder for short-length and high-rate linear block-codes. Soft-GRAND (SGRAND) is a prominent soft-input GRAND variant, outperforming the other GRAND variants in decoding performance; nevertheless, SGRAND is not suitable for parallel hardware implementation. Ordered Reliability Bits-GRAND (ORBGRAND) is another soft-input GRAND variant that is suitable for parallel hardware implementation, however it has lower decoding performance than SGRAND. In this paper, we propose List-GRAND (LGRAND), a technique for enhancing the decoding performance of ORBGRAND to match the ML decoding performance of SGRAND. Numerical simulation results show that LGRAND enhances ORBGRAND's decoding performance by $0.5-0.75$ dB for channel-codes of various classes at a target FER of $10^{-7}$. For linear block codes of length $127/128$ and different code-rates, LGRAND's VLSI implementation can achieve an average information throughput of $47.27-51.36$ Gbps. In comparison to ORBGRAND's VLSI implementation, the proposed LGRAND hardware has a $4.84\%$ area overhead. △ Less

Submitted 2 December, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: This article has been accepted for publication in IEEE Transactions on Very Large Scale Integration (VLSI) Systems. This is the author's version which has not been fully edited and content may change prior to final publication. Citation information: DOI 10.1109/TVLSI.2022.3223692

Journal ref: IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2022

arXiv:2109.02122 [pdf, other]

Decoding Reed-Muller Codes with Successive Codeword Permutations

Authors: Nghia Doan, Seyyed Ali Hashemi, Marco Mondelli, Warren J. Gross

Abstract: A novel recursive list decoding (RLD) algorithm for Reed-Muller (RM) codes based on successive permutations (SP) of the codeword is presented. A low-complexity SP scheme applied to a subset of the symmetry group of RM codes is first proposed to carefully select a good codeword permutation on the fly. Then, the proposed SP technique is integrated into an improved RLD algorithm that initializes diff… ▽ More A novel recursive list decoding (RLD) algorithm for Reed-Muller (RM) codes based on successive permutations (SP) of the codeword is presented. A low-complexity SP scheme applied to a subset of the symmetry group of RM codes is first proposed to carefully select a good codeword permutation on the fly. Then, the proposed SP technique is integrated into an improved RLD algorithm that initializes different decoding paths with random codeword permutations, which are sampled from the full symmetry group of RM codes. Finally, efficient latency and complexity reduction schemes are introduced that virtually preserve the error-correction performance of the proposed decoder. Simulation results demonstrate that at the target frame error rate of $10^{-3}$ for the RM code of length $256$ with $163$ information bits, the proposed decoder reduces $6\%$ of the computational complexity and $22\%$ of the decoding latency of the state-of-the-art semi-parallel simplified successive-cancellation decoder with fast Hadamard transform (SSC-FHT) that uses $96$ permutations from the full symmetry group of RM codes, while relatively maintaining the error-correction performance and memory consumption of the semi-parallel permuted SSC-FHT decoder. △ Less

Submitted 20 September, 2022; v1 submitted 5 September, 2021; originally announced September 2021.

Comments: Accepted for publication in IEEE Transactions on Communications

arXiv:2108.12563 [pdf, other]

High-Throughput VLSI Architecture for GRAND Markov Order

Authors: Syed Mohsin Abbas, Marwan Jalaleddine, Warren J. Gross

Abstract: Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique. Irrespective of the structure of the error correcting code, GRAND tries to guess the noise that corrupted the codeword in order to decode any linear error-correcting block code. GRAND Markov Order (GRAND-MO) is a variant of GRAND that is useful to decode error correcting code transmit… ▽ More Guessing Random Additive Noise Decoding (GRAND) is a recently proposed Maximum Likelihood (ML) decoding technique. Irrespective of the structure of the error correcting code, GRAND tries to guess the noise that corrupted the codeword in order to decode any linear error-correcting block code. GRAND Markov Order (GRAND-MO) is a variant of GRAND that is useful to decode error correcting code transmitted over communication channels with memory which are vulnerable to burst noise. Usually, interleavers and de-interleavers are used in communication systems to mitigate the effects of channel memory. Interleaving and de-interleaving introduce undesirable latency, which increases with channel memory. To prevent this added latency penalty, GRAND-MO can be directly used on the hard demodulated channel signals. This work reports the first GRAND-MO hardware architecture which achieves an average throughput of up to $52$ Gbps and $64$ Gbps for a code length of $128$ and $79$ respectively. Compared to the GRANDAB, hard-input variant of GRAND, the proposed architecture achieves $3$ dB gain in decoding performance for a target FER of $10^{-5}$. Similarly, comparing the GRAND-MO decoder with a decoder tailored for a $(79,64)$ BCH code showed that the proposed architecture achieves 33$\%$ higher worst case throughput and $2$ dB gain in decoding performance. △ Less

Submitted 27 August, 2021; originally announced August 2021.

Comments: 6 pages; Accepted to be Published at 2021 International Workshop on Signal Processing Systems (SiPS)

arXiv:2108.12550 [pdf, ps, other]

Successive-Cancellation Decoding of Reed-Muller Codes with Fast Hadamard Transform

Authors: Nghia Doan, Seyyed Ali Hashemi, Warren J. Gross

Abstract: A novel permuted fast successive-cancellation list decoding algorithm with fast Hadamard transform (FHT-FSCL) is presented. The proposed decoder initializes $L$ $(L\ge1)$ active decoding paths with $L$ random codeword permutations sampled from the full symmetry group of the codes. The path extension in the permutation domain is carried out until the first constituent RM code of order $1$ is visite… ▽ More A novel permuted fast successive-cancellation list decoding algorithm with fast Hadamard transform (FHT-FSCL) is presented. The proposed decoder initializes $L$ $(L\ge1)$ active decoding paths with $L$ random codeword permutations sampled from the full symmetry group of the codes. The path extension in the permutation domain is carried out until the first constituent RM code of order $1$ is visited. Conventional path extension of the successive-cancellation list decoder is then utilized in the information bit domain. The simulation results show that for a RM code of length $512$ with $46$ information bits, by running $20$ parallel permuted FHT-FSCL decoders with $L=4$, we reduce $72\%$ of the computational complexity, $22\%$ of the decoding latency, and $84\%$ of the memory consumption of the state-of-the-art simplified successive-cancellation decoder that uses $512$ permutations sampled from the full symmetry group of the code, with similar error-correction performance at the target frame error rate of $10^{-4}$. △ Less

Submitted 7 February, 2022; v1 submitted 27 August, 2021; originally announced August 2021.

Comments: Submitted to an IEEE journal for possible publication

arXiv:2107.08991 [pdf, ps, other]

A Tree Search Approach for Maximum-Likelihood Decoding of Reed-Muller Codes

Authors: Seyyed Ali Hashemi, Nghia Doan, Warren J. Gross, John Cioffi, Andrea Goldsmith

Abstract: A low-complexity tree search approach is presented that achieves the maximum-likelihood (ML) decoding performance of Reed-Muller (RM) codes. The proposed approach generates a bit-flip** tree that is traversed to find the ML decoding result by performing successive-cancellation decoding after each node visit. A depth-first search (DFS) and a breadth-first search (BFS) scheme are developed and a l… ▽ More A low-complexity tree search approach is presented that achieves the maximum-likelihood (ML) decoding performance of Reed-Muller (RM) codes. The proposed approach generates a bit-flip** tree that is traversed to find the ML decoding result by performing successive-cancellation decoding after each node visit. A depth-first search (DFS) and a breadth-first search (BFS) scheme are developed and a log-likelihood-ratio-based bit-flip** metric is utilized to avoid redundant node visits in the tree. Several enhancements to the proposed algorithm are presented to further reduce the number of node visits. Simulation results confirm that the BFS scheme provides a lower average number of node visits than the existing tree search approach to decode RM codes. △ Less

Submitted 19 July, 2021; originally announced July 2021.

arXiv:2105.07115 [pdf, other]

doi 10.1109/ICASSP39728.2021.9414908

High-Throughput VLSI architecture for Soft-Decision decoding with ORBGRAND

Authors: Syed Mohsin Abbas, Thibaud Tonnellier, Furkan Ercan, Marwan Jalaleddine, Warren J. Gross

Abstract: Guessing Random Additive Noise Decoding (GRAND) is a recently proposed approximate Maximum Likelihood (ML) decoding technique that can decode any linear error-correcting block code. Ordered Reliability Bits GRAND (ORBGRAND) is a powerful variant of GRAND, which outperforms the original GRAND technique by generating error patterns in a specific order. Moreover, their simplicity at the algorithm lev… ▽ More Guessing Random Additive Noise Decoding (GRAND) is a recently proposed approximate Maximum Likelihood (ML) decoding technique that can decode any linear error-correcting block code. Ordered Reliability Bits GRAND (ORBGRAND) is a powerful variant of GRAND, which outperforms the original GRAND technique by generating error patterns in a specific order. Moreover, their simplicity at the algorithm level renders GRAND family a desirable candidate for applications that demand very high throughput. This work reports the first-ever hardware architecture for ORBGRAND, which achieves an average throughput of up to $42.5$ Gbps for a code length of $128$ at an SNR of $10$ dB. Moreover, the proposed hardware can be used to decode any code provided the length and rate constraints. Compared to the state-of-the-art fast dynamic successive cancellation flip decoder (Fast-DSCF) using a 5G polar $(128,105)$ code, the proposed VLSI implementation has $49\times$ more average throughput while maintaining similar decoding performance. △ Less

Submitted 14 May, 2021; originally announced May 2021.

Comments: Please note that a mislabeling in Fig. 1 has occurred in the IEEE Xplore version of this paper. This error has been corrected in this version of the manuscript. (Accepted in ICASSP 2021)

arXiv:2011.03177 [pdf, other]

On Systematic Polarization-Adjusted Convolutional (PAC) Codes

Authors: Thibaud Tonnellier, Warren J. Gross

Abstract: Polarization-adjusted convolutional (PAC) codes were recently proposed and arouse the interest of the channel coding community because they were shown to approach theoretical bounds for the (128,64) code size. In this letter, we propose systematic PAC codes. Thanks to the systematic property, improvement in the bit-error rate of up to 0.2 dB is observed, while preserving the frame-error rate perfo… ▽ More Polarization-adjusted convolutional (PAC) codes were recently proposed and arouse the interest of the channel coding community because they were shown to approach theoretical bounds for the (128,64) code size. In this letter, we propose systematic PAC codes. Thanks to the systematic property, improvement in the bit-error rate of up to 0.2 dB is observed, while preserving the frame-error rate performance. Moreover, a genetic-algorithm based construction method targeted to approach the theoretical bound is provided. It is then shown that using the proposed construction method systematic and non-systematic PAC codes can approach the theoretical bound even for higher code sizes such as (256,128). △ Less

Submitted 5 November, 2020; originally announced November 2020.

Comments: 5 pages, 5 figures

arXiv:2009.08547 [pdf, ps, other]

doi 10.1109/TSP.2020.3023582

Practical Dynamic SC-Flip Polar Decoders: Algorithm and Implementation

Authors: Furkan Ercan, Thibaud Tonnellier, Nghia Doan, Warren J. Gross

Abstract: SC-Flip (SCF) is a low-complexity polar code decoding algorithm with improved performance, and is an alternative to high-complexity (CRC)-aided SC-List (CA-SCL) decoding. However, the performance improvement of SCF is limited since it can correct up to only one channel error ($ω=1$). Dynamic SCF (DSCF) algorithm tackles this problem by tackling multiple errors ($ω\geq 1$), but it requires logarith… ▽ More SC-Flip (SCF) is a low-complexity polar code decoding algorithm with improved performance, and is an alternative to high-complexity (CRC)-aided SC-List (CA-SCL) decoding. However, the performance improvement of SCF is limited since it can correct up to only one channel error ($ω=1$). Dynamic SCF (DSCF) algorithm tackles this problem by tackling multiple errors ($ω\geq 1$), but it requires logarithmic and exponential computations, which make it infeasible for practical applications. In this work, we propose simplifications and approximations to make DSCF practically feasible. First, we reduce the transcendental computations of DSCF decoding to a constant approximation. Then, we show how to incorporate special node decoding techniques into DSCF algorithm, creating the Fast-DSCF decoding. Next, we reduce the search span within the special nodes to further reduce the computational complexity. Following, we describe a hardware architecture for the Fast-DSCF decoder, in which we introduce additional simplifications such as metric normalization and sorter length reduction. All the simplifications and approximations are shown to have minimal impact on the error-correction performance, and the reported Fast-DSCF decoder is the only SCF-based architecture that can correct multiple errors. The Fast-DSCF decoders synthesized using TSMC $65$nm CMOS technology can achieve a $1.25$, $1.06$ and $0.93$ Gbps throughput for $ω\in \{1,2,3\}$, respectively. Compared to the state-of-the-art fast CA-SCL decoders with equivalent FER performance, the proposed decoders are up to $5.8\times$ more area-efficient. Finally, observations at energy dissipation indicate that the Fast-DSCF is more energy-efficient than its CA-SCL-based counterparts. △ Less

Submitted 21 September, 2020; v1 submitted 17 September, 2020; originally announced September 2020.

Comments: Accepted for publication in IEEE TSP

Journal ref: IEEE Transactions on Signal Processing, 2020

arXiv:2009.06796 [pdf, other]

Decoding Polar Codes with Reinforcement Learning

Authors: Nghia Doan, Seyyed Ali Hashemi, Warren Gross

Abstract: In this paper we address the problem of selecting factor-graph permutations of polar codes under belief propagation (BP) decoding to significantly improve the error-correction performance of the code. In particular, we formalize the factor-graph permutation selection as the multi-armed bandit problem in reinforcement learning and propose a decoder that acts like an online-learning agent that learn… ▽ More In this paper we address the problem of selecting factor-graph permutations of polar codes under belief propagation (BP) decoding to significantly improve the error-correction performance of the code. In particular, we formalize the factor-graph permutation selection as the multi-armed bandit problem in reinforcement learning and propose a decoder that acts like an online-learning agent that learns to select the good factor-graph permutations during the course of decoding. We use state-of-the-art algorithms for the multi-armed bandit problem and show that for a 5G polar codes of length 128 with 64 information bits, the proposed decoder has an error-correction performance gain of around 0.125 dB at the target frame error rate of 10^{-4}, when compared to the approach that randomly selects the factor-graph permutations. △ Less

Submitted 14 September, 2020; originally announced September 2020.

Comments: Accepted for presentation at IEEE GLOBECOM 2020

arXiv:2007.15647 [pdf, ps, other]

Fast Thresholded SC-Flip Decoding of Polar Codes

Authors: Furkan Ercan, Warren J. Gross

Abstract: SC-Flip (SCF) decoding algorithm shares the attention with the common polar code decoding approaches due to its low-complexity and improved error-correction performance. However, the inefficient criterion for locating the correct bit-flip** position in SCF decoding limits its improvements. Due to its improved bit-flip** criterion, Thresholded SCF (TSCF) decoding algorithm exhibits a superior e… ▽ More SC-Flip (SCF) decoding algorithm shares the attention with the common polar code decoding approaches due to its low-complexity and improved error-correction performance. However, the inefficient criterion for locating the correct bit-flip** position in SCF decoding limits its improvements. Due to its improved bit-flip** criterion, Thresholded SCF (TSCF) decoding algorithm exhibits a superior error-correction performance and lower computational complexity than SCF decoding. However, the parameters of TSCF decoding depend on multiple channel and code parameters, and are obtained via Monte-Carlo simulations. Our main goal is to realize TSCF decoding as a practical polar decoder implementation. To this end, we first realize an approximated threshold value that is independent of the code parameters and precomputations. The proposed approximation has negligible error-correction performance degradation on the TSCF decoding. Then, we validate an alternative approach for forming a critical set that does not require precomputations, which also paves the way to the implementation of the Fast-TSCF decoder. Compared to the existing fast SCF implementations, the proposed Fast-TSCF decoder has $0.24$ to $0.41$ dB performance gain at frame error rate of $10^{-3}$, without any extra cost. Compared to the TSCF decoding, Fast-TSCF does not depend on precomputations and requires $87\%$ fewer decoding steps. Finally, implementation results in TSMC 65nm CMOS technology show that the Fast-TSCF decoder is $20\%$ and $82\%$ more area-efficient than the state-of-the-art fast SCF and fast SC-List decoder architectures, respectively. △ Less

Submitted 30 July, 2020; originally announced July 2020.

arXiv:2007.07328 [pdf, other]

High-Throughput VLSI Architecture for GRAND

Authors: Syed Mohsin Abbas, Thibaud Tonnellier, Furkan Ercan, Warren J. Gross

Abstract: Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal decoding algorithm for linear error correcting codes. Since GRAND does not depend on the structure of the code, it can be used for any code encountered in contemporary communication standards or may even be used for random linear network coding. This property makes this new algorithm particularly appealing. Instead of… ▽ More Guessing Random Additive Noise Decoding (GRAND) is a recently proposed universal decoding algorithm for linear error correcting codes. Since GRAND does not depend on the structure of the code, it can be used for any code encountered in contemporary communication standards or may even be used for random linear network coding. This property makes this new algorithm particularly appealing. Instead of trying to decode the received vector, GRAND attempts to identify the noise that corrupted the codeword. To that end, GRAND relies on the generation of test error patterns that are successively applied to the received vector. In this paper, we propose the first hardware architecture for the GRAND algorithm. Considering GRAND with ABandonment (GRANDAB) that limits the number of test patterns, the proposed architecture only needs $2+\sum_{i=2}^{n} \left\lfloor\frac{i}{2}\right\rfloor$ time steps to perform the $\sum_{i=1}^3 \binom{n}{i}$ queries required when $\text{AB}=3$. For a code length of $128$, our proposed hardware architecture demonstrates only a fraction ($1.2\%$) of the total number of performed queries as time steps. Synthesis result using TSMC 65nm CMOS technology shows that average throughputs of $32$ Gbps to $64$ Gbps can be achieved at an SNR of $10$ dB for a code length of $128$ and code rates rate higher than $0.75$, transmitted over an AWGN channel. Comparisons with a decoder tailored for a $(79,64)$ BCH code show that the proposed architecture can achieve a slightly higher average throughput at high SNRs, while obtaining the same decoding performance. △ Less

Submitted 14 July, 2020; originally announced July 2020.

Comments: 6 pages, 6 figures, submitted to SiPS 2020

arXiv:2006.02012 [pdf, other]

doi 10.1007/s11265-018-1413-4

Operation Merging for Hardware Implementations of Fast Polar Decoders

Authors: Furkan Ercan, Thibaud Tonnellier, Carlo Condo, Warren J. Gross

Abstract: Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous… ▽ More Polar codes are a class of linear block codes that provably achieves channel capacity. They have been selected as a coding scheme for the control channel of enhanced mobile broadband (eMBB) scenario for $5^{\text{th}}$ generation wireless communication networks (5G) and are being considered for additional use scenarios. As a result, fast decoding techniques for polar codes are essential. Previous works targeting improved throughput for successive-cancellation (SC) decoding of polar codes are semi-parallel implementations that exploit special maximum-likelihood (ML) nodes. In this work, we present a new fast simplified SC (Fast-SSC) decoder architecture. Compared to a baseline Fast-SSC decoder, our solution is able to reduce the memory requirements. We achieve this through a more efficient memory utilization, which also enables to execute multiple operations in a single clock cycle. Finally, we propose new special node merging techniques that improve the throughput further, and detail a new Fast-SSC-based decoder architecture to support merged operations. The proposed decoder reduces the operation sequence requirement by up to $39\%$, which enables to reduce the number of time steps to decode a codeword by $35\%$. ASIC implementation results with 65 nm TSMC technology show that the proposed decoder has a throughput improvement of up to $31\%$ compared to previous Fast-SSC decoder architectures. △ Less

Submitted 2 June, 2020; originally announced June 2020.

Comments: 13 figures, 8 tables, 11 pages, published on November 3, 2018 in Journal of Signal Processing Systems (JSPS), vol. 91, pp. 995-1007

arXiv:1912.01086 [pdf, other]

Deep-Learning-Aided Successive-Cancellation Decoding of Polar Codes

Authors: Seyyed Ali Hashemi, Nghia Doan, Thibaud Tonnellier, Warren J. Gross

Abstract: A deep-learning-aided successive-cancellation list (DL-SCL) decoding algorithm for polar codes is introduced with deep-learning-aided successive-cancellation (DL-SC) decoding being a specific case of it. The DL-SCL decoder works by allowing additional rounds of SCL decoding when the first SCL decoding attempt fails, using a novel bit-flip** metric. The proposed bit-flip** metric exploits the i… ▽ More A deep-learning-aided successive-cancellation list (DL-SCL) decoding algorithm for polar codes is introduced with deep-learning-aided successive-cancellation (DL-SC) decoding being a specific case of it. The DL-SCL decoder works by allowing additional rounds of SCL decoding when the first SCL decoding attempt fails, using a novel bit-flip** metric. The proposed bit-flip** metric exploits the inherent relations between the information bits in polar codes that are represented by a correlation matrix. The correlation matrix is then optimized using emerging deep-learning techniques. Performance results on a polar code of length 128 with 64 information bits concatenated with a 24-bit cyclic redundancy check show that the proposed bit-flip** metric in the proposed DL-SCL decoder requires up to 66% fewer multiplications and up to 36% fewer additions, without any need to perform transcendental functions, and by providing almost the same error-correction performance in comparison with the state of the art. △ Less

Submitted 2 December, 2019; originally announced December 2019.

Comments: 2019 Asilomar Conference on Signals, Systems, and Computers

arXiv:1908.05798 [pdf, other]

Efficient Flicker-Free FEC Codes using Knuth's Balancing Algorithm for VLC

Authors: Elie Ngomseu Mambou, Thibaud Tonnellier, Seyyed Ali Hashemi, Warren J. Gross

Abstract: Visible light communication (VLC) provides a short-range optical wireless communication through light-emitting diode (LED) lighting. Light beam flickering and dimming are among the challenges to be addressed in VLC. Conventional methods for generating flicker-free codes in VLC are based on run-length limited codes that have poor error correction performance, use lookup tables which are memory cons… ▽ More Visible light communication (VLC) provides a short-range optical wireless communication through light-emitting diode (LED) lighting. Light beam flickering and dimming are among the challenges to be addressed in VLC. Conventional methods for generating flicker-free codes in VLC are based on run-length limited codes that have poor error correction performance, use lookup tables which are memory consuming, and have low transmission rates. In this paper, we propose an efficient construction of flicker-free forward error correction codes to tackle the issue of flickering in VLC. Our simulation results show that by using polar codes and at a dimming ratio of 50%, the proposed system generates flicker-free codes without using lookup tables, while having lower complexity and higher transmission rates than the standard VLC methods. For an information block length of 256, the error correction performance of the proposed scheme is $1.8$ dB and $0.9$ dB better than that of the regular schemes at the bit error rate of $10^{-6}$ for a rate of 0.44 and 0.23, respectively. △ Less

Submitted 15 August, 2019; originally announced August 2019.

Comments: 6 pages, 8 figures, conference

arXiv:1907.11563 [pdf, other]

Neural Dynamic Successive Cancellation Flip Decoding of Polar Codes

Authors: Nghia Doan, Seyyed Ali Hashemi, Furkan Ercan, Thibaud Tonnellier, Warren Gross

Abstract: Dynamic successive cancellation flip (DSCF) decoding of polar codes is a powerful algorithm that can achieve the error correction performance of successive cancellation list (SCL) decoding, with a complexity that is close to that of successive cancellation (SC) decoding at practical signal-to-noise ratio (SNR) regimes. However, DSCF decoding requires costly transcendental computations which advers… ▽ More Dynamic successive cancellation flip (DSCF) decoding of polar codes is a powerful algorithm that can achieve the error correction performance of successive cancellation list (SCL) decoding, with a complexity that is close to that of successive cancellation (SC) decoding at practical signal-to-noise ratio (SNR) regimes. However, DSCF decoding requires costly transcendental computations which adversely affect its implementation complexity. In this paper, we first show that a direct application of common approximation schemes on the conventional DSCF decoding results in significant error-correction performance loss. We then introduce a training parameter and propose an approximation scheme which completely removes the need to perform transcendental computations in DSCF decoding, with almost no error-correction performance degradation. △ Less

Submitted 26 July, 2019; originally announced July 2019.

arXiv:1903.09203 [pdf, ps, other]

doi 10.1109/TSP.2019.2944738

Rate-Flexible Fast Polar Decoders

Authors: Seyyed Ali Hashemi, Carlo Condo, Marco Mondelli, Warren J. Gross

Abstract: Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successive-cancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fa… ▽ More Polar codes have gained extensive attention during the past few years and recently they have been selected for the next generation of wireless communications standards (5G). Successive-cancellation-based (SC-based) decoders, such as SC list (SCL) and SC flip (SCF), provide a reasonable error performance for polar codes at the cost of low decoding speed. Fast SC-based decoders, such as Fast-SSC, Fast-SSCL, and Fast-SSCF, identify the special constituent codes in a polar code graph off-line, produce a list of operations, store the list in memory, and feed the list to the decoder to decode the constituent codes in order efficiently, thus increasing the decoding speed. However, the list of operations is dependent on the code rate and as the rate changes, a new list is produced, making fast SC-based decoders not rate-flexible. In this paper, we propose a completely rate-flexible fast SC-based decoder by creating the list of operations directly in hardware, with low implementation complexity. We further propose a hardware architecture implementing the proposed method and show that the area occupation of the rate-flexible fast SC-based decoder in this paper is only $38\%$ of the total area of the memory-based base-line decoder when 5G code rates are supported. △ Less

Submitted 21 March, 2019; originally announced March 2019.

arXiv:1902.02402 [pdf, other]

doi 10.1109/ICC.2019.8761129

Asymmetric Construction of Low-Latency and Length-Flexible Polar Codes

Authors: Adam Cavatassi, Thibaud Tonnellier, Warren J. Gross

Abstract: Polar codes are a class of capacity-achieving error correcting codes that have been selected for use in enhanced mobile broadband in the 3GPP 5th generation (5G) wireless standard. Most polar code research examines the original Arikan polar coding scheme, which is limited in block length to powers of two. This constraint presents a considerable obstacle since practical applications call for all co… ▽ More Polar codes are a class of capacity-achieving error correcting codes that have been selected for use in enhanced mobile broadband in the 3GPP 5th generation (5G) wireless standard. Most polar code research examines the original Arikan polar coding scheme, which is limited in block length to powers of two. This constraint presents a considerable obstacle since practical applications call for all code lengths to be readily available. Puncturing and shortening techniques allow for flexible polar codes, while multi-kernel polar codes produce native code lengths that are powers of two and/or three. In this work, we propose a new low complexity coding scheme called asymmetric polar coding that allows for any arbitrary block length. We present details on the generator matrix, frozen set design, and decoding schedule. Our scheme offers flexible polar code lengths with decoding complexity lower than equivalent state-of-the-art length-compatible approaches under successive cancellation decoding. Further, asymmetric decoding complexity is directly dependent on the codeword length rather than the nearest valid polar code length. We compare our scheme with other length matching techniques, and simulations are presented. Results show that asymmetric polar codes present similar error correction performance to the competing schemes, while dividing the number of SC decoding operations by up to a factor of 2 using the same codeword length △ Less

Submitted 6 February, 2019; originally announced February 2019.

Comments: To appear in IEEE International Conference on Communications 2019 (Submitted October 12, 2018), 6 pages

Journal ref: 2019 IEEE International Conference on Communications (ICC)

arXiv:1902.01922 [pdf, other]

Fast Decoding of Multi-Kernel Polar Codes

Authors: Adam Cavatassi, Thibaud Tonnellier, Warren J. Gross

Abstract: Polar codes are a class of linear error correction codes which provably attain channel capacity with infinite codeword lengths. Finite length polar codes have been adopted into the 5th Generation 3GPP standard for New Radio, though their native length is limited to powers of 2. Utilizing multiple polarizing matrices increases the length flexibility of polar codes at the expense of a more complicat… ▽ More Polar codes are a class of linear error correction codes which provably attain channel capacity with infinite codeword lengths. Finite length polar codes have been adopted into the 5th Generation 3GPP standard for New Radio, though their native length is limited to powers of 2. Utilizing multiple polarizing matrices increases the length flexibility of polar codes at the expense of a more complicated decoding process. Successive cancellation (SC) is the standard polar decoder and has time complexity $\mathcal{O}(N \log N)$ due to its sequential nature. However, some patterns in the frozen set mirror simple linear codes with low latency decoders, which allows for a significant reduction in SC latency by pruning the decoding schedule. Such fast decoding techniques have only previously been used for traditional Arikan polar codes, causing multi-kernel polar codes to be an impractical length-compatibility technique with no fast decoders available. We propose fast simplified successive cancellation decoding node patterns, which are compatible with polar codes constructed with both the Arikan and ternary kernels, and generalization techniques. We outline efficient implementations, made possible by imposing constraints on ternary node parameters. We show that fast decoding of multi-kernel polar codes has at least 72% reduced latency compared with an SC decoder in all cases considered where codeword lengths are (96, 432, 768, 2304). △ Less

Submitted 5 February, 2019; originally announced February 2019.

Comments: To appear in IEEE WCNC 2019 (Submitted September 25, 2018), 6 pages

arXiv:1811.10396 [pdf, other]

Learning to Skip Ineffectual Recurrent Computations in LSTMs

Authors: Arash Ardakani, Zhengyun Ji, Warren J. Gross

Abstract: Long Short-Term Memory (LSTM) is a special class of recurrent neural network, which has shown remarkable successes in processing sequential data. The typical architecture of an LSTM involves a set of states and gates: the states retain information over arbitrary time intervals and the gates regulate the flow of information. Due to the recursive nature of LSTMs, they are computationally intensive t… ▽ More Long Short-Term Memory (LSTM) is a special class of recurrent neural network, which has shown remarkable successes in processing sequential data. The typical architecture of an LSTM involves a set of states and gates: the states retain information over arbitrary time intervals and the gates regulate the flow of information. Due to the recursive nature of LSTMs, they are computationally intensive to deploy on edge devices with limited hardware resources. To reduce the computational complexity of LSTMs, we first introduce a method that learns to retain only the important information in the states by pruning redundant information. We then show that our method can prune over 90% of information in the states without incurring any accuracy degradation over a set of temporal tasks. This observation suggests that a large fraction of the recurrent computations are ineffectual and can be avoided to speed up the process during the inference as they involve noncontributory multiplications/accumulations with zero-valued states. Finally, we introduce a custom hardware accelerator that can perform the recurrent computations using both sparse and dense states. Experimental measurements show that performing the computations using the sparse states speeds up the process and improves energy efficiency by up to 5.2x when compared to implementation results of the accelerator performing the computations using dense states. △ Less

Submitted 29 November, 2018; v1 submitted 9 November, 2018; originally announced November 2018.

Comments: Accepted as a conference paper for presentation at DATE 2019

arXiv:1811.00124 [pdf, other]

Neural Belief Propagation Decoding of CRC-Polar Concatenated Codes

Authors: Nghia Doan, Seyyed Ali Hashemi, Elie Ngomseu Mambou, Thibaud Tonnellier, Warren J. Gross

Abstract: Polar codes are the first class of error correcting codes that provably achieve the channel capacity at infinite code length. They were selected for use in the fifth generation of cellular mobile communications (5G). In practical scenarios such as 5G, a cyclic redundancy check (CRC) is concatenated with polar codes to improve their finite length performance. This is mostly beneficial for sequentia… ▽ More Polar codes are the first class of error correcting codes that provably achieve the channel capacity at infinite code length. They were selected for use in the fifth generation of cellular mobile communications (5G). In practical scenarios such as 5G, a cyclic redundancy check (CRC) is concatenated with polar codes to improve their finite length performance. This is mostly beneficial for sequential successive-cancellation list decoders. However, for parallel iterative belief propagation (BP) decoders, CRC is only used as an early stop** criterion with incremental error-correction performance improvement. In this paper, we first propose a CRC-polar BP (CPBP) decoder by exchanging the extrinsic information between the factor graph of the polar code and that of the CRC. We then propose a neural CPBP (NCPBP) algorithm which improves the CPBP decoder by introducing trainable normalizing weights on the concatenated factor graph. Our results on a 5G polar code of length 128 show that at the frame error rate of 10^(-5) and with a maximum of 30 iterations, the error-correction performance of CPBP and NCPBP are approximately 0.25 dB and 0.5 dB better than that of the conventional CRC-aided BP decoder, respectively, while introducing almost no latency overhead. △ Less

Submitted 31 October, 2018; originally announced November 2018.

arXiv:1810.10902 [pdf, ps, other]

Learning from the Syndrome

Authors: Loren Lugosch, Warren J. Gross

Abstract: In this paper, we introduce the syndrome loss, an alternative loss function for neural error-correcting decoders based on a relaxation of the syndrome. The syndrome loss penalizes the decoder for producing outputs that do not correspond to valid codewords. We show that training with the syndrome loss yields decoders with consistently lower frame error rate for a number of short block codes, at lit… ▽ More In this paper, we introduce the syndrome loss, an alternative loss function for neural error-correcting decoders based on a relaxation of the syndrome. The syndrome loss penalizes the decoder for producing outputs that do not correspond to valid codewords. We show that training with the syndrome loss yields decoders with consistently lower frame error rate for a number of short block codes, at little additional cost during training and no additional cost during inference. The proposed method does not depend on knowledge of the transmitted codeword, making it a promising tool for online adaptation to changing channel conditions. △ Less

Submitted 22 October, 2018; originally announced October 2018.

Comments: Accepted to Asilomar 2018 - special session on "Machine Learning for Wireless Systems"

arXiv:1809.11086 [pdf, other]

Learning Recurrent Binary/Ternary Weights

Authors: Arash Ardakani, Zhengyun Ji, Sean C. Smithson, Brett H. Meyer, Warren J. Gross

Abstract: Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights d… ▽ More Recurrent neural networks (RNNs) have shown excellent performance in processing sequence data. However, they are both complex and memory intensive due to their recursive nature. These limitations make RNNs difficult to embed on mobile devices requiring real-time processes with limited hardware resources. To address the above issues, we introduce a method that can learn binary and ternary weights during the training phase to facilitate hardware implementations of RNNs. As a result, using this approach replaces all multiply-accumulate operations by simple accumulations, bringing significant benefits to custom hardware in terms of silicon area and power consumption. On the software side, we evaluate the performance (in terms of accuracy) of our method using long short-term memories (LSTMs) on various sequential models including sequence classification and language modeling. We demonstrate that our method achieves competitive results on the aforementioned tasks while using binary/ternary weights during the runtime. On the hardware side, we present custom hardware for accelerating the recurrent computations of LSTMs with binary/ternary weights. Ultimately, we show that LSTMs with binary/ternary weights can achieve up to 12x memory saving and 10x inference speedup compared to the full-precision implementation on an ASIC platform. △ Less

Submitted 24 January, 2019; v1 submitted 28 September, 2018; originally announced September 2018.

Comments: Published as a conference paper at ICLR 2019

arXiv:1809.03606 [pdf, ps, other]

Towards Practical Software Stack Decoding of Polar Codes

Authors: Harsh Aurora, Warren J. Gross

Abstract: The successive cancellation list decoding algorithm for polar codes yields near-optimal decoding performance at the cost of high implementation complexity. The successive cancellation stack algorithm has been shown to provide similar decoding performance at a much lower computational complexity, but software implementations report a sub-par T/P performance. In this technical report, the benefits o… ▽ More The successive cancellation list decoding algorithm for polar codes yields near-optimal decoding performance at the cost of high implementation complexity. The successive cancellation stack algorithm has been shown to provide similar decoding performance at a much lower computational complexity, but software implementations report a sub-par T/P performance. In this technical report, the benefits of the fast simplified successive cancellation list decoder are extended to the stack algorithm, resulting in a throughput increase by two orders of magnitude over the traditional stack decoder. △ Less

Submitted 10 September, 2018; originally announced September 2018.

arXiv:1808.03616 [pdf, other]

Improved Bit-Flip** Algorithm for Successive Cancellation Decoding of Polar Codes

Authors: Furkan Ercan, Carlo Condo, Warren J. Gross

Abstract: The interest in polar codes has been increasing significantly since their adoption for use in the 5$^{\rm th}$ generation wireless systems standard. Successive cancellation (SC) decoding algorithm has low implementation complexity, but yields mediocre error-correction performance at the code lengths of interest. SC-Flip algorithm improves the error-correction performance of SC by identifying possi… ▽ More The interest in polar codes has been increasing significantly since their adoption for use in the 5$^{\rm th}$ generation wireless systems standard. Successive cancellation (SC) decoding algorithm has low implementation complexity, but yields mediocre error-correction performance at the code lengths of interest. SC-Flip algorithm improves the error-correction performance of SC by identifying possibly erroneous decisions made by SC and re-iterates after flip** one bit. It was recently shown that only a portion of bit-channels are most likely to be in error. In this work, we investigate the average log-likelihood ratio (LLR) values and their distribution related to the erroneous bit-channels, and develop the Thresholded SC-Flip (TSCF) decoding algorithm. We also replace the LLR selection and sorting of SC-Flip with a comparator to reduce the implementation complexity. Simulation results demonstrate that for practical code lengths and a wide range of rates, TSCF shows negligible loss compared to the error-correction performance obtained when all single-errors are corrected. At matching maximum iterations, TSCF has an error-correction performance gain of up to $0.45$ dB compared with SC-Flip decoding. At matching error-correction performance, the computational complexity of TSCF is reduced by up to $40\%$ on average, and requires up to $5\times$ lower maximum number of iterations. △ Less

Submitted 26 September, 2018; v1 submitted 10 August, 2018; originally announced August 2018.

Comments: This version of the manuscript corrects an error in the previous ArXiv version. The corrections include all the simulations of SC-Flip-based and SC-Oracle decoders, along with associated comments in-text

arXiv:1807.03912 [pdf, other]

Decoding Reed-Muller and Polar Codes by Successive Factor Graph Permutations

Authors: Seyyed Ali Hashemi, Nghia Doan, Marco Mondelli, Warren J. Gross

Abstract: Reed-Muller (RM) and polar codes are a class of capacity-achieving channel coding schemes with the same factor graph representation. Low-complexity decoding algorithms fall short in providing a good error-correction performance for RM and polar codes. Using the symmetric group of RM and polar codes, the specific decoding algorithm can be carried out on multiple permutations of the factor graph to… ▽ More Reed-Muller (RM) and polar codes are a class of capacity-achieving channel coding schemes with the same factor graph representation. Low-complexity decoding algorithms fall short in providing a good error-correction performance for RM and polar codes. Using the symmetric group of RM and polar codes, the specific decoding algorithm can be carried out on multiple permutations of the factor graph to boost the error-correction performance. However, this approach results in high decoding complexity. In this paper, we first derive the total number of factor graph permutations on which the decoding can be performed. We further propose a successive permutation (SP) scheme which finds the permutations on the fly, thus the decoding always progresses on a single factor graph permutation. We show that SP can be used to improve the error-correction performance of RM and polar codes under successive-cancellation (SC) and SC list (SCL) decoding, while kee** the memory requirements of the decoders unaltered. Our results for RM and polar codes of length $128$ and rate $0.5$ show that when SP is used and at a target frame error rate of $10^{-4}$, up to $0.5$ dB and $0.1$ dB improvement can be achieved for RM and polar codes respectively. △ Less

Submitted 10 July, 2018; originally announced July 2018.

arXiv:1806.11195 [pdf, other]

On the Decoding of Polar Codes on Permuted Factor Graphs

Authors: Nghia Doan, Seyyed Ali Hashemi, Marco Mondelli, Warren J. Gross

Abstract: Polar codes are a channel coding scheme for the next generation of wireless communications standard (5G). The belief propagation (BP) decoder allows for parallel decoding of polar codes, making it suitable for high throughput applications. However, the error-correction performance of polar codes under BP decoding is far from the requirements of 5G. It has been shown that the error-correction perfo… ▽ More Polar codes are a channel coding scheme for the next generation of wireless communications standard (5G). The belief propagation (BP) decoder allows for parallel decoding of polar codes, making it suitable for high throughput applications. However, the error-correction performance of polar codes under BP decoding is far from the requirements of 5G. It has been shown that the error-correction performance of BP can be improved if the decoding is performed on multiple permuted factor graphs of polar codes. However, a different BP decoding scheduling is required for each factor graph permutation which results in the design of a different decoder for each permutation. Moreover, the selection of the different factor graph permutations is at random, which prevents the decoder to achieve a desirable error-correction performance with a small number of permutations. In this paper, we first show that the permutations on the factor graph can be mapped into suitable permutations on the codeword positions. As a result, we can make use of a single decoder for all the permutations. In addition, we introduce a method to construct a set of predetermined permutations which can provide the correct codeword if the decoding fails on the original permutation. We show that for the 5G polar code of length $1024$, the error-correction performance of the proposed decoder is more than $0.25$ dB better than that of the BP decoder with the same number of random permutations at the frame error rate of $10^{-4}$. △ Less

Submitted 28 June, 2018; originally announced June 2018.

arXiv:1802.00580 [pdf, ps, other]

A Multi-Kernel Multi-Code Polar Decoder Architecture

Authors: Gabriele Coppolino, Carlo Condo, Guido Masera, Warren J. Gross

Abstract: Polar codes have received increasing attention in the past decade, and have been selected for the next generation of wireless communication standard. Most research on polar codes has focused on codes constructed from a $2\times2$ polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of $2$. A few recent works have proposed const… ▽ More Polar codes have received increasing attention in the past decade, and have been selected for the next generation of wireless communication standard. Most research on polar codes has focused on codes constructed from a $2\times2$ polarization matrix, called binary kernel: codes constructed from binary kernels have code lengths that are bound to powers of $2$. A few recent works have proposed construction methods based on multiple kernels of different dimensions, not only binary ones, allowing code lengths different from powers of $2$. In this work, we design and implement the first multi-kernel successive cancellation polar code decoder in literature. It can decode any code constructed with binary and ternary kernels: the architecture, sized for a maximum code length $N_{max}$, is fully flexible in terms of code length, code rate and kernel sequence. The decoder can achieve frequency of more than $1$ GHz in $65$ nm CMOS technology, and a throughput of $615$ Mb/s. The area occupation ranges between $0.11$ mm$^2$ for $N_{max}=256$ and $2.01$ mm$^2$ for $N_{max}=4096$. Implementation results show an unprecedented degree of flexibility: with $N_{max}=4096$, up to $55$ code lengths can be decoded with the same hardware, along with any kernel sequence and code rate. △ Less

Submitted 2 February, 2018; originally announced February 2018.

arXiv:1801.01820 [pdf, other]

Design and Implementation of a Polar Codes Blind Detection Scheme

Authors: Carlo Condo, Seyyed Ali Hashemi, Arash Ardakani, Furkan Ercan, Warren J. Gross

Abstract: In blind detection, a set of candidates has to be decoded within a strict time constraint, to identify which transmissions are directed at the user equipment. Blind detection is required by the 3GPP LTE/LTE-Advanced standard, and it will be required in the 5th generation wireless communication standard (5G) as well. Polar codes have been selected for use in 5G: thus, the issue of blind detection o… ▽ More In blind detection, a set of candidates has to be decoded within a strict time constraint, to identify which transmissions are directed at the user equipment. Blind detection is required by the 3GPP LTE/LTE-Advanced standard, and it will be required in the 5th generation wireless communication standard (5G) as well. Polar codes have been selected for use in 5G: thus, the issue of blind detection of polar codes must be addressed. We propose a polar code blind detection scheme where the user ID is transmitted instead of some of the frozen bits. A first, coarse decoding phase helps selecting a subset of candidates that is decoded by a more powerful algorithm: an early stop** criterion is also introduced for the second decoding phase. Simulations results show good missed detection and false alarm rates, along with substantial latency gains thanks to early stop**. We then propose an architecture to implement the devised blind detection scheme, based on a tunable decoder that can be used for both phases. The architecture is synthesized and implementation results are reported for various system parameters. The reported area occupation and latency, obtained in 65 nm CMOS technology, are able to meet 5G requirements, and are guaranteed to meet them with even less resource usage in the latest technology nodes. △ Less

Submitted 4 January, 2018; originally announced January 2018.

Comments: arXiv admin note: text overlap with arXiv:1705.01864

arXiv:1712.03994 [pdf, other]

Multi-Mode Inference Engine for Convolutional Neural Networks

Authors: Arash Ardakani, Carlo Condo, Warren J. Gross

Abstract: During the past few years, interest in convolutional neural networks (CNNs) has risen constantly, thanks to their excellent performance on a wide range of recognition and classification tasks. However, they suffer from the high level of complexity imposed by the high-dimensional convolutions in convolutional layers. Within scenarios with limited hardware resources and tight power and latency const… ▽ More During the past few years, interest in convolutional neural networks (CNNs) has risen constantly, thanks to their excellent performance on a wide range of recognition and classification tasks. However, they suffer from the high level of complexity imposed by the high-dimensional convolutions in convolutional layers. Within scenarios with limited hardware resources and tight power and latency constraints, the high computational complexity of CNNs makes them difficult to be exploited. Hardware solutions have striven to reduce the power consumption using low-power techniques, and to limit the processing time by increasing the number of processing elements (PEs). While most of ASIC designs claim a peak performance of a few hundred giga operations per seconds, their average performance is substantially lower when applied to state-of-the-art CNNs such as AlexNet, VGGNet and ResNet, leading to low resource utilization. Their performance efficiency is limited to less than 55% on average, which leads to unnecessarily high processing latency and silicon area. In this paper, we propose a dataflow which enables to perform both the fully-connected and convolutional computations for any filter/layer size using the same PEs. We then introduce a multi-mode inference engine (MMIE) based on the proposed dataflow. Finally, we show that the proposed MMIE achieves a performance efficiency of more than 84% when performing the computations of the three renown CNNs (i.e., AlexNet, VGGNet and ResNet), outperforming the best architecture in the state-of-the-art in terms of energy consumption, processing latency and silicon area. △ Less

Submitted 11 December, 2017; originally announced December 2017.

arXiv:1711.11096 [pdf, other]

doi 10.1109/WCNCW.2018.8368991

Improved Successive Cancellation Flip Decoding of Polar Codes Based on Error Distribution

Authors: Carlo Condo, Furkan Ercan, Warren J. Gross

Abstract: Polar codes are a class of linear block codes that provably achieves channel capacity, and have been selected as a coding scheme for $5^{\rm th}$ generation wireless communication standards. Successive-cancellation (SC) decoding of polar codes has mediocre error-correction performance on short to moderate codeword lengths: the SC-Flip decoding algorithm is one of the solutions that have been propo… ▽ More Polar codes are a class of linear block codes that provably achieves channel capacity, and have been selected as a coding scheme for $5^{\rm th}$ generation wireless communication standards. Successive-cancellation (SC) decoding of polar codes has mediocre error-correction performance on short to moderate codeword lengths: the SC-Flip decoding algorithm is one of the solutions that have been proposed to overcome this issue. On the other hand, SC-Flip has a higher implementation complexity compared to SC due to the required log-likelihood ratio (LLR) selection and sorting process. Moreover, it requires a high number of iterations to reach good error-correction performance. In this work, we propose two techniques to improve the SC-Flip decoding algorithm for low-rate codes, based on the observation of channel-induced error distributions. The first one is a fixed index selection (FIS) scheme to avoid the substantial implementation cost of LLR selection and sorting with no cost on error-correction performance. The second is an enhanced index selection (EIS) criterion to improve the error-correction performance of SC-Flip decoding. A reduction of $24.6\%$ in the implementation cost of logic elements is estimated with the FIS approach, while simulation results show that EIS leads to an improvement on error-correction performance improvement up to $0.42$ dB at a target FER of $10^{-4}$. △ Less

Submitted 26 September, 2018; v1 submitted 29 November, 2017; originally announced November 2017.

Comments: This version of the manuscript corrects an error in the previous ArXiv version, as well as the published version in IEEE Xplore under the same title, which has the DOI:10.1109/WCNCW.2018.8368991. The corrections include all the simulations of SC-Flip-based and SC-Oracle decoders, along with associated comments in-text

arXiv:1711.11093 [pdf, other]

doi 10.1109/ICC.2018.8422464

Partitioned Successive-Cancellation Flip Decoding of Polar Codes

Authors: Furkan Ercan, Carlo Condo, Seyyed Ali Hashemi, Warren J. Gross

Abstract: Polar codes are a class of channel capacity achieving codes that has been selected for the next generation of wireless communication standards. Successive-cancellation (SC) is the first proposed decoding algorithm, suffering from mediocre error-correction performance at moderate code length. In order to improve the error-correction performance of SC, two approaches are available: (i) SC-List decod… ▽ More Polar codes are a class of channel capacity achieving codes that has been selected for the next generation of wireless communication standards. Successive-cancellation (SC) is the first proposed decoding algorithm, suffering from mediocre error-correction performance at moderate code length. In order to improve the error-correction performance of SC, two approaches are available: (i) SC-List decoding which keeps a list of candidates by running a number of SC decoders in parallel, thus increasing the implementation complexity, and (ii) SC-Flip decoding that relies on a single SC module, and keeps the computational complexity close to SC. In this work, we propose the partitioned SC-Flip (PSCF) decoding algorithm, which outperforms SC-Flip in terms of error-correction performance and average computational complexity, leading to higher throughput and reduced energy consumption per codeword. We also introduce a partitioning scheme that best suits our PSCF decoder. Simulation results show that at equivalent frame error rate, PSCF has up to $5 \times$ less computational complexity than the SC-Flip decoder. At equivalent average number of iterations, the error-correction performance of PSCF outperforms SC-Flip by up to $0.15$ dB at frame error rate of $10^{-3}$. △ Less

Submitted 8 October, 2018; v1 submitted 29 November, 2017; originally announced November 2017.

Comments: This version of the manuscript corrects an error in the previous ArXiv version, as well as the published version in IEEE Xplore under the same title, which has the DOI:10.1109/ICC.2018.8422464. The corrections include all the simulations of SC-Flip-based and SC-Oracle decoders, along with associated comments in-text

arXiv:1708.09603 [pdf, other]

doi 10.1109/JETCAS.2017.2745704

PolarBear: A 28-nm FD-SOI ASIC for Decoding of Polar Codes

Authors: Pascal Giard, Alexios Balatsoukas-Stimming, Thomas Christoph Müller, Andrea Bonetti, Claude Thibeault, Warren J. Gross, Philippe Flatresse, Andreas Burg

Abstract: Polar codes are a recently proposed class of block codes that provably achieve the capacity of various communication channels. They received a lot of attention as they can do so with low-complexity encoding and decoding algorithms, and they have an explicit construction. Their recent inclusion in a 5G communication standard will only spur more research. However, only a couple of ASICs featuring de… ▽ More Polar codes are a recently proposed class of block codes that provably achieve the capacity of various communication channels. They received a lot of attention as they can do so with low-complexity encoding and decoding algorithms, and they have an explicit construction. Their recent inclusion in a 5G communication standard will only spur more research. However, only a couple of ASICs featuring decoders for polar codes were fabricated, and none of them implements a list-based decoding algorithm. In this paper, we present ASIC measurement results for a fabricated 28 nm CMOS chip that implements two different decoders: the first decoder is tailored toward error-correction performance and flexibility. It supports any code rate as well as three different decoding algorithms: successive cancellation (SC), SC flip and SC list (SCL). The flexible decoder can also decode both non-systematic and systematic polar codes. The second decoder targets speed and energy efficiency. We present measurement results for the first silicon-proven SCL decoder, where its coded throughput is shown to be of 306.8 Mbps with a latency of 3.34 us and an energy per bit of 418.3 pJ/bit at a clock frequency of 721 MHz for a supply of 1.3 V. The energy per bit drops down to 178.1 pJ/bit with a more modest clock frequency of 308 MHz, lower throughput of 130.9 Mbps and a reduced supply voltage of 0.9 V. For the other two operating modes, the energy per bit is shown to be of approximately 95 pJ/bit. The less flexible high-throughput unrolled decoder can achieve a coded throughput of 9.2 Gbps and a latency of 628 ns for a measured energy per bit of 1.15 pJ/bit at 451 MHz. △ Less

Submitted 1 September, 2017; v1 submitted 31 August, 2017; originally announced August 2017.

Comments: 12 pages, 12 figures, 5 tables, to appear in IEEE Journal on Emerging and Selected Topics in Circuits and Systems

arXiv:1708.04706 [pdf, ps, other]

On Error-Correction Performance and Implementation of Polar Code List Decoders for 5G

Authors: Furkan Ercan, Carlo Condo, Seyyed Ali Hashemi, Warren J. Gross

Abstract: Polar codes are a class of capacity achieving error correcting codes that has been recently selected for the next generation of wireless communication standards (5G). Polar code decoding algorithms have evolved in various directions, striking different balances between error-correction performance, speed and complexity. Successive-cancellation list (SCL) and its incarnations constitute a powerful,… ▽ More Polar codes are a class of capacity achieving error correcting codes that has been recently selected for the next generation of wireless communication standards (5G). Polar code decoding algorithms have evolved in various directions, striking different balances between error-correction performance, speed and complexity. Successive-cancellation list (SCL) and its incarnations constitute a powerful, well-studied set of algorithms, in constant improvement. At the same time, different implementation approaches provide a wide range of area occupations and latency results. 5G puts a focus on improved error-correction performance, high throughput and low power consumption: a comprehensive study considering all these metrics is currently lacking in literature. In this work, we evaluate SCL-based decoding algorithms in terms of error-correction performance and compare them to low-density parity-check (LDPC) codes. Moreover, we consider various decoder implementations, for both polar and LDPC codes, and compare their area occupation and power and energy consumption when targeting short code lengths and rates. Our work shows that among SCL-based decoders, the partitioned SCL (PSCL) provides the lowest area occupation and power consumption, whereas fast simplified SCL (Fast-SSCL) yields the lowest energy consumption. Compared to LDPC decoder architectures, different SCL implementations occupy up to 17.1x less area, dissipate up to 7.35x less power, and up to 26x less energy. △ Less

Submitted 12 October, 2017; v1 submitted 15 August, 2017; originally announced August 2017.

Comments: Accepted in 55th Annual Allerton Conference on Communication, Control, and Computing

arXiv:1706.07043 [pdf, other]

doi 10.1109/JSTSP.2017.2788405

Deep Learning Methods for Improved Decoding of Linear Codes

Authors: Eliya Nachmani, Elad Marciano, Loren Lugosch, Warren J. Gross, David Burshtein, Yair Beery

Abstract: The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterat… ▽ More The problem of low complexity, close to optimal, channel decoding of linear codes with short to moderate block length is considered. It is shown that deep learning methods can be used to improve a standard belief propagation decoder, despite the large example space. Similar improvements are obtained for the min-sum algorithm. It is also shown that tying the parameters of the decoders across iterations, so as to form a recurrent neural network architecture, can be implemented with comparable results. The advantage is that significantly less parameters are required. We also introduce a recurrent neural decoder architecture based on the method of successive relaxation. Improvements over standard belief propagation are also observed on sparser Tanner graph representations of the codes. Furthermore, we demonstrate that the neural belief propagation decoder can be used to improve the performance, or alternatively reduce the computational complexity, of a close to optimal decoder of short BCH codes. △ Less

Submitted 1 January, 2018; v1 submitted 21 June, 2017; originally announced June 2017.

Comments: Accepted To IEEE Journal Of Selected Topics In Signal Processing

Showing 1–50 of 86 results for author: Gross, W