-
High Throughput Polar Code Decoders with Information Bottleneck Quantization
Authors:
Claus Kestel,
Lucas Johannsen,
Norbert Wehn
Abstract:
In digital baseband processing, the forward error correction (FEC) unit belongs to the most demanding components in terms of computational complexity and power consumption. Hence, efficient implementation of FEC decoders is crucial for next generation mobile broadband standards and an ongoing research topic. Quantization has a significant impact on the decoder area, power consumption and throughpu…
▽ More
In digital baseband processing, the forward error correction (FEC) unit belongs to the most demanding components in terms of computational complexity and power consumption. Hence, efficient implementation of FEC decoders is crucial for next generation mobile broadband standards and an ongoing research topic. Quantization has a significant impact on the decoder area, power consumption and throughput. Thus, lower bit-widths are preferred for efficient implementations but degrade the error-correction capability. To address this issue, a non-uniform quantization based on the Information Bottleneck (IB) method was proposed that enables a low bit width while maintaining the essential information. Many investigations on the use of IB method for Low-density parity-check code (LDPC) decoders exist and have shown its advantages from an implementation perspective. However, for polar code decoder implementations, there exists only one publication that is not based on the state-of-the-art Fast-SSC decoding algorithm, and only synthesis implementation results without energy estimation are shown. In contrast, our paper presents several optimized Fast Simplified Successive-Cancellation (Fast-SSC) polar code decoder implementations using IB-based quantization with placement&routing results in an advanced 12 nm FinFET technology. Gains of up to 16% in area and 13% in energy efficiency are achieved with IB-based quantization at a Frame Error Rate (FER) of 10-7 and a Polar Code of N = 1024, R = 0.5 compared to state-of-the-art decoders.
△ Less
Submitted 3 June, 2024;
originally announced June 2024.
-
CNN-Based Equalization for Communications: Achieving Gigabit Throughput with a Flexible FPGA Hardware Architecture
Authors:
Jonas Ney,
Christoph Füllner,
Vincent Lauinger,
Laurent Schmalen,
Sebastian Randel,
Norbert Wehn
Abstract:
To satisfy the growing throughput demand of data-intensive applications, the performance of optical communication systems increased dramatically in recent years. With higher throughput, more advanced equalizers are crucial, to compensate for impairments caused by inter-symbol interference (ISI). The latest research shows that artificial neural network (ANN)-based equalizers are promising candidate…
▽ More
To satisfy the growing throughput demand of data-intensive applications, the performance of optical communication systems increased dramatically in recent years. With higher throughput, more advanced equalizers are crucial, to compensate for impairments caused by inter-symbol interference (ISI). The latest research shows that artificial neural network (ANN)-based equalizers are promising candidates to replace traditional algorithms for high-throughput communications. On the other hand, not only throughput but also flexibility is a main objective of beyond-5G and 6G communication systems. A platform that is able to satisfy the strict throughput and flexibility requirements of modern communication systems are field programmable gate arrays (FPGAs). Thus, in this work, we present a high-performance FPGA implementation of an ANN-based equalizer, which meets the throughput requirements of modern optical communication systems. Further, our architecture is highly flexible since it includes a variable degree of parallelism (DOP) and therefore can also be applied to low-cost or low-power applications which is demonstrated for a magnetic recording channel. The implementation is based on a cross-layer design approach featuring optimizations from the algorithm down to the hardware architecture, including a detailed quantization analysis. Moreover, we present a framework to reduce the latency of the ANN-based equalizer under given throughput constraints. As a result, the bit error ratio (BER) of our equalizer for the optical fiber channel is around four times lower than that of a conventional one, while the corresponding FPGA implementation achieves a throughput of more than 40 GBd, outperforming a high-performance graphics processing unit (GPU) by three orders of magnitude for a similar batch size.
△ Less
Submitted 22 April, 2024;
originally announced May 2024.
-
Error Detection and Correction Codes for Safe In-Memory Computations
Authors:
Luca Parrini,
Taha Soliman,
Benjamin Hettwer,
Jan Micha Borrmann,
Simranjeet Singh,
Ankit Bende,
Vikas Rana,
Farhad Merchant,
Norbert Wehn
Abstract:
In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate a…
▽ More
In-Memory Computing (IMC) introduces a new paradigm of computation that offers high efficiency in terms of latency and power consumption for AI accelerators. However, the non-idealities and defects of emerging technologies used in advanced IMC can severely degrade the accuracy of inferred Neural Networks (NN) and lead to malfunctions in safety-critical applications. In this paper, we investigate an architectural-level mitigation technique based on the coordinated action of multiple checksum codes, to detect and correct errors at run-time. This implementation demonstrates higher efficiency in recovering accuracy across different AI algorithms and technologies compared to more traditional methods such as Triple Modular Redundancy (TMR). The results show that several configurations of our implementation recover more than 91% of the original accuracy with less than half of the area required by TMR and less than 40% of latency overhead.
△ Less
Submitted 15 April, 2024;
originally announced April 2024.
-
Real-Time FPGA Demonstrator of ANN-Based Equalization for Optical Communications
Authors:
Jonas Ney,
Patrick Matalla,
Vincent Lauinger,
Laurent Schmalen,
Sebastian Randel,
Norbert Wehn
Abstract:
In this work, we present a high-throughput field programmable gate array (FPGA) demonstrator of an artificial neural network (ANN)-based equalizer. The equalization is performed and illustrated in real-time for a 30 GBd, two-level pulse amplitude modulation (PAM2) optical communication system.
In this work, we present a high-throughput field programmable gate array (FPGA) demonstrator of an artificial neural network (ANN)-based equalizer. The equalization is performed and illustrated in real-time for a 30 GBd, two-level pulse amplitude modulation (PAM2) optical communication system.
△ Less
Submitted 23 February, 2024;
originally announced February 2024.
-
Lessons Learned from Designing an Open-Source Automated Feedback System for STEM Education
Authors:
Steffen Steinert,
Lars Krupp,
Karina E. Avila,
Anke S. Janssen,
Verena Ruf,
David Dzsotjan,
Christian De Schryver,
Jakob Karolus,
Stefan Ruzika,
Karen Joisten,
Paul Lukowicz,
Jochen Kuhn,
Norbert Wehn,
Stefan Küchemann
Abstract:
As distance learning becomes increasingly important and artificial intelligence tools continue to advance, automated systems for individual learning have attracted significant attention. However, the scarcity of open-source online tools that are capable of providing personalized feedback has restricted the widespread implementation of research-based feedback systems. In this work, we present RATsA…
▽ More
As distance learning becomes increasingly important and artificial intelligence tools continue to advance, automated systems for individual learning have attracted significant attention. However, the scarcity of open-source online tools that are capable of providing personalized feedback has restricted the widespread implementation of research-based feedback systems. In this work, we present RATsApp, an open-source automated feedback system (AFS) that incorporates research-based features such as formative feedback. The system focuses on core STEM competencies such as mathematical competence, representational competence, and data literacy. It also allows lecturers to monitor students' progress. We conducted a survey based on the technology acceptance model (TAM2) among a set of students (N=64). Our findings confirm the applicability of the TAM2 framework, revealing that factors such as the relevance of the studies, output quality, and ease of use significantly influence the perceived usefulness. We also found a linear relation between the perceived usefulness and the intention to use, which in turn is a significant predictor of the frequency of use. Moreover, the formative feedback feature of RATsApp received positive feedback, indicating its potential as an educational tool. Furthermore, as an open-source platform, RATsApp encourages public contributions to its ongoing development, fostering a collaborative approach to improve educational tools.
△ Less
Submitted 19 January, 2024;
originally announced January 2024.
-
Fully-blind Neural Network Based Equalization for Severe Nonlinear Distortions in 112 Gbit/s Passive Optical Networks
Authors:
Vincent Lauinger,
Patrick Matalla,
Jonas Ney,
Norbert Wehn,
Sebastian Randel,
Laurent Schmalen
Abstract:
We demonstrate and evaluate a fully-blind digital signal processing (DSP) chain for 100G passive optical networks (PONs), and analyze different equalizer topologies based on neural networks with low hardware complexity.
We demonstrate and evaluate a fully-blind digital signal processing (DSP) chain for 100G passive optical networks (PONs), and analyze different equalizer topologies based on neural networks with low hardware complexity.
△ Less
Submitted 17 January, 2024;
originally announced January 2024.
-
Row-Merged Polar Codes: Analysis, Design and Decoder Implementation
Authors:
Andreas Zunker,
Marvin Geiselhart,
Lucas Johannsen,
Claus Kestel,
Stephan ten Brink,
Timo Vogt,
Norbert Wehn
Abstract:
Row-merged polar codes are a family of pre-transformed polar codes (PTPCs) with little precoding overhead. Providing an improved distance spectrum over plain polar codes, they are capable to perform close to the finite-length capacity bounds. However, there is still a lack of efficient design procedures for row-merged polar codes. Using novel weight enumeration algorithms with low computational co…
▽ More
Row-merged polar codes are a family of pre-transformed polar codes (PTPCs) with little precoding overhead. Providing an improved distance spectrum over plain polar codes, they are capable to perform close to the finite-length capacity bounds. However, there is still a lack of efficient design procedures for row-merged polar codes. Using novel weight enumeration algorithms with low computational complexity, we propose a design methodology for row-merged polar codes that directly considers their minimum distance properties. The codes significantly outperform state-of-the-art cyclic redundancy check (CRC)-aided polar codes under successive cancellation list (SCL) decoding in error-correction performance. Furthermore, we present fast simplified successive cancellation list (Fast-SSCL) decoding of PTPCs, based on which we derive a high-throughput, unrolled architecture template for fully pipelined decoders. Implementation results of SCL decoders for row-merged polar codes in a 12 nm technology additionally demonstrate the superiority of these codes with respect to the implementation costs, compared to state-of-the-art reference decoder implementations.
△ Less
Submitted 22 December, 2023;
originally announced December 2023.
-
A Map** of Triangular Block Interleavers to DRAM for Optical Satellite Communication
Authors:
Lukas Steiner,
Timo Lehnigk-Emden,
Markus Fehrenz,
Norbert Wehn
Abstract:
Communication in optical downlinks of low earth orbit (LEO) satellites requires interleaving to enable reliable data transmission. These interleavers are orders of magnitude larger than conventional interleavers utilized for example in wireless communication. Hence, the capacity of on-chip memories (SRAMs) is insufficient to store all symbols and external memories (DRAMs) must be used. Due to the…
▽ More
Communication in optical downlinks of low earth orbit (LEO) satellites requires interleaving to enable reliable data transmission. These interleavers are orders of magnitude larger than conventional interleavers utilized for example in wireless communication. Hence, the capacity of on-chip memories (SRAMs) is insufficient to store all symbols and external memories (DRAMs) must be used. Due to the overall requirement for very high data rates beyond 100 Gbit/s, DRAM bandwidth then quickly becomes a critical bottleneck of the communication system. In this paper, we investigate triangular block interleavers for the aforementioned application and show that the standard map** of symbols used for SRAMs results in low bandwidth utilization for DRAMs, in some cases below 50 %. As a solution, we present a novel map** approach that combines different optimizations and achieves over 90 % bandwidth utilization in all tested configurations. Further, the map** can be applied to any JEDEC-compliant DRAM device.
△ Less
Submitted 4 December, 2023;
originally announced December 2023.
-
Efficient Hardware Implementation of Constant Time Sampling for HQC
Authors:
Maximilian Schöffel,
Johannes Feldmann,
Norbert Wehn
Abstract:
HQC is one of the code-based finalists in the last round of the NIST post quantum cryptography standardization process. In this process, security and implementation efficiency are key metrics for the selection of the candidates. A critical compute kernel with respect to efficient hardware implementations and security in HQC is the sampling method used to derive random numbers. Due to its security…
▽ More
HQC is one of the code-based finalists in the last round of the NIST post quantum cryptography standardization process. In this process, security and implementation efficiency are key metrics for the selection of the candidates. A critical compute kernel with respect to efficient hardware implementations and security in HQC is the sampling method used to derive random numbers. Due to its security criticality, recently an updated sampling algorithm was presented to increase its robustness against side-channel attacks.
In this paper, we pursue a cross layer approach to optimize this new sampling algorithm to enable an efficient hardware implementation without comprising the original algorithmic security and side-channel attack robustness.
We compare our cross layer based implementation to a direct hardware implementation of the original algorithm and to optimized implementations of the previous sampler version. All implementations are evaluated using the Xilinx Artix 7 FPGA. Our results show that our approach reduces the latency by a factor of 24 compared to the original algorithm and by a factor of 28 compared to the previously used sampler with significantly less resources.
△ Less
Submitted 28 September, 2023;
originally announced September 2023.
-
Variational Quantum Linear Solver enhanced Quantum Support Vector Machine
Authors:
Jianming Yi,
Kalyani Suresh,
Ali Moghiseh,
Norbert Wehn
Abstract:
Quantum Support Vector Machines (QSVM) play a vital role in using quantum resources for supervised machine learning tasks, such as classification. However, current methods are strongly limited in terms of scalability on Noisy Intermediate Scale Quantum (NISQ) devices. In this work, we propose a novel approach called the Variational Quantum Linear Solver (VQLS) enhanced QSVM. This is built upon our…
▽ More
Quantum Support Vector Machines (QSVM) play a vital role in using quantum resources for supervised machine learning tasks, such as classification. However, current methods are strongly limited in terms of scalability on Noisy Intermediate Scale Quantum (NISQ) devices. In this work, we propose a novel approach called the Variational Quantum Linear Solver (VQLS) enhanced QSVM. This is built upon our idea of utilizing the variational quantum linear solver to solve system of linear equations of a least squares-SVM on a NISQ device. The implementation of our approach is evaluated by an extensive series of numerical experiments with the Iris dataset, which consists of three distinct iris plant species. Based on this, we explore the practicality and effectiveness of our algorithm by constructing a classifier capable of classification in a feature space ranging from one to seven dimensions. Furthermore, by strategically exploiting both classical and quantum computing for various subroutines of our algorithm, we effectively mitigate practical challenges associated with the implementation. These include significant improvement in the trainability of the variational ansatz and notable reductions in run-time for cost calculations. Based on the numerical experiments, our approach exhibits the capability of identifying a separating hyperplane in an 8-dimensional feature space. Moreover, it consistently demonstrated strong performance across various instances with the same dataset.
△ Less
Submitted 14 September, 2023;
originally announced September 2023.
-
Successive Cancellation Automorphism List Decoding of Polar Codes
Authors:
Lucas Johannsen,
Claus Kestel,
Marvin Geiselhart,
Timo Vogt,
Stephan ten Brink,
Norbert Wehn
Abstract:
The discovery of suitable automorphisms of polar codes gained a lot of attention by applying them in Automorphism Ensemble Decoding (AED) to improve the error-correction performance, especially for short block lengths. This paper introduces Successive Cancellation Automorphism List (SCAL) decoding of polar codes as a novel application of automorphisms in advanced Successive Cancellation List (SCL)…
▽ More
The discovery of suitable automorphisms of polar codes gained a lot of attention by applying them in Automorphism Ensemble Decoding (AED) to improve the error-correction performance, especially for short block lengths. This paper introduces Successive Cancellation Automorphism List (SCAL) decoding of polar codes as a novel application of automorphisms in advanced Successive Cancellation List (SCL) decoding. Initialized with L permutations sampled from the automorphism group, a superposition of different noise realizations and path splitting takes place inside the decoder. In this way, the SCAL decoder automatically adapts to the channel conditions and outperforms the error-correction performance of conventional SCL decoding and AED. For a polar code of length 128, SCAL performs near Maximum Likelihood (ML) decoding with L=8, in contrast to M=16 needed decoder cores in AED. Application-Specific Integrated Circuit (ASIC) implementations in a 12 nm technology show that high-throughput, pipelined SCAL decoders outperform AED in terms of energy efficiency and power density, and SCL decoders additionally in area efficiency.
△ Less
Submitted 28 June, 2023;
originally announced June 2023.
-
Unsupervised ANN-Based Equalizer and Its Trainable FPGA Implementation
Authors:
Jonas Ney,
Vincent Lauinger,
Laurent Schmalen,
Norbert Wehn
Abstract:
In recent years, communication engineers put strong emphasis on artificial neural network (ANN)-based algorithms with the aim of increasing the flexibility and autonomy of the system and its components. In this context, unsupervised training is of special interest as it enables adaptation without the overhead of transmitting pilot symbols. In this work, we present a novel ANN-based, unsupervised e…
▽ More
In recent years, communication engineers put strong emphasis on artificial neural network (ANN)-based algorithms with the aim of increasing the flexibility and autonomy of the system and its components. In this context, unsupervised training is of special interest as it enables adaptation without the overhead of transmitting pilot symbols. In this work, we present a novel ANN-based, unsupervised equalizer and its trainable field programmable gate array (FPGA) implementation. We demonstrate that our custom loss function allows the ANN to adapt for varying channel conditions, approaching the performance of a supervised baseline. Furthermore, as a first step towards a practical communication system, we design an efficient FPGA implementation of our proposed algorithm, which achieves a throughput in the order of Gbit/s, outperforming a high-performance GPU by a large margin.
△ Less
Submitted 28 July, 2023; v1 submitted 14 April, 2023;
originally announced April 2023.
-
A Hybrid Approach combining ANN-based and Conventional Demap** in Communication for Efficient FPGA-Implementation
Authors:
Jonas Ney,
Bilal Hammoud,
Norbert Wehn
Abstract:
In communication systems, Autoencoder (AE) refers to the concept of replacing parts of the transmitter and receiver by artificial neural networks (ANNs) to train the system end-to-end over a channel model. This approach aims to improve communication performance, especially for varying channel conditions, with the cost of high computational complexity for training and inference. Field-programmable…
▽ More
In communication systems, Autoencoder (AE) refers to the concept of replacing parts of the transmitter and receiver by artificial neural networks (ANNs) to train the system end-to-end over a channel model. This approach aims to improve communication performance, especially for varying channel conditions, with the cost of high computational complexity for training and inference. Field-programmable gate arrays (FPGAs) have been shown to be a suitable platform for energy-efficient ANN implementation. However, the high number of operations and the large model size of ANNs limit the performance on resource-constrained devices, which is critical for low latency and high-throughput communication systems. To tackle his challenge, we propose a novel approach for efficient ANN-based remap** on FPGAs, which combines the adaptability of the AE with the efficiency of conventional demap** algorithms. After adaption to channel conditions, the channel characteristics, implicitly learned by the ANN, are extracted to enable the use of optimized conventional demap** algorithms for inference. We validate the hardware efficiency of our approach by providing FPGA implementation results and by comparing the communication performance to that of conventional systems. Our work opens a door for the practical application of ANN-based communication algorithms on FPGAs.
△ Less
Submitted 11 April, 2023;
originally announced April 2023.
-
Automorphism Ensemble Polar Code Decoders for 6G URLLC
Authors:
Claus Kestel,
Marvin Geiselhart,
Lucas Johannsen,
Stephan ten Brink,
Norbert Wehn
Abstract:
The URLLC scenario in the upcoming 6G standard requires low latency and ultra reliable transmission, i.e., error correction towards ML performance. Achieving near-ML performance is very challenging especially for short block lengths. Polar codes are a promising candidate and already part of the 5G standard. The Successive Cancellation List (SCL) decoding algorithm provides very good error correcti…
▽ More
The URLLC scenario in the upcoming 6G standard requires low latency and ultra reliable transmission, i.e., error correction towards ML performance. Achieving near-ML performance is very challenging especially for short block lengths. Polar codes are a promising candidate and already part of the 5G standard. The Successive Cancellation List (SCL) decoding algorithm provides very good error correction performance but at the cost of high computational decoding complexity resulting in large latency and low area and energy efficiency. Recently, Automorphism Ensemble Decoding (AED) gained a lot of attention to improve the error correction capability. In contrast to SCL, AED performs several low-complexity (e.g., SC) decoding in parallel. However, it is an open question whether AED can compete with sophisticated SCL decoders, especially from an implementation perspective in state of the art silicon technologies. In this paper we present an elaborated AED architecture that uses an advanced path metric based candidate selection to reduce the implementation complexity and compare it to state of the art SCL decoders in a 12nm FinFET technology. Our AED implementation outperform state of the art SCL decoders by up to 4.4x in latency, 8.9x in area efficiency and 4.6x in energy efficiency, while providing the same or even better error correction performance.
△ Less
Submitted 2 March, 2023;
originally announced March 2023.
-
Code-based Cryptography in IoT: A HW/SW Co-Design of HQC
Authors:
Maximilian Schöffel,
Johannes Feldmann,
Norbert Wehn
Abstract:
Recent advances in quantum computing pose a serious threat on the security of widely used public-key cryptosystems. Thus, new post-quantum cryptographic algorithms have been proposed as part of the associated US NIST process to enable secure, encrypted communication in the age of quantum computing. Many hardware accelerators for structured lattice-based algorithms have already been published to me…
▽ More
Recent advances in quantum computing pose a serious threat on the security of widely used public-key cryptosystems. Thus, new post-quantum cryptographic algorithms have been proposed as part of the associated US NIST process to enable secure, encrypted communication in the age of quantum computing. Many hardware accelerators for structured lattice-based algorithms have already been published to meet the strict power, area and latency requirements of low-power IoT edge devices. However, the security of these algorithms is still uncertain. Currently, many new attacks against the lattice structure are investigated to judge on their security. In contrast, code-based algorithms, which rely on deeply explored security metrics and are appealing candidates in the NIST process, have not yet been investigated to the same depth in the context of IoT due to the computational complexity and memory footprint of state-of-the-art software implementations.
In this paper, we present to the best of our knowledge the first HW/SW co-design based implementation of the code-based Hamming Quasi Cyclic Key-Encapsulation Mechanism. We profile and evaluate this algorithm in order to explore the trade-off between software optimizations, tightly coupled hardware acceleration by instruction set extension and modular, loosely coupled accelerators. We provide detailed results on the energy consumption and performance of our design and compare it to existing implementations of lattice- and code-based algorithms. The design was implemented in two technologies: FPGA and ASIC. Our results show that code-based algorithms are valid alternatives in low-power IoT from an implementation perspective.
△ Less
Submitted 12 January, 2023;
originally announced January 2023.
-
Unveiling the Real Performance of LPDDR5 Memories
Authors:
Lukas Steiner,
Matthias Jung,
Michael Huonker,
Norbert Wehn
Abstract:
LPDDR5 is the latest low-power DRAM standard and expected to be used in various application fields. The vendors have published promising peak bandwidths up to 50 % higher than those of the predecessor LPDDR4. In this paper we evaluate the best-case and worst-case real bandwidth utilization of different LPDDR5 configurations and compare the results to corresponding LPDDR4 configurations. We also sh…
▽ More
LPDDR5 is the latest low-power DRAM standard and expected to be used in various application fields. The vendors have published promising peak bandwidths up to 50 % higher than those of the predecessor LPDDR4. In this paper we evaluate the best-case and worst-case real bandwidth utilization of different LPDDR5 configurations and compare the results to corresponding LPDDR4 configurations. We also show that an upgrade from LPDDR4 to LPDDR5 does not always bring a bandwidth advantage and that some LPDDR5 configurations should be avoided for specific workloads.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
A Framework for Formal Verification of DRAM Controllers
Authors:
Lukas Steiner,
Chirag Sudarshan,
Matthias Jung,
Dominik Stoffel,
Norbert Wehn
Abstract:
The large number of recent JEDEC DRAM standard releases and their increasing feature set makes it difficult for designers to rapidly upgrade the memory controller IPs to each new standard. Especially the hardware verification is challenging due to the higher protocol complexity of standards like DDR5, LPDDR5 or HBM3 in comparison with their predecessors. With traditional simulation-based verificat…
▽ More
The large number of recent JEDEC DRAM standard releases and their increasing feature set makes it difficult for designers to rapidly upgrade the memory controller IPs to each new standard. Especially the hardware verification is challenging due to the higher protocol complexity of standards like DDR5, LPDDR5 or HBM3 in comparison with their predecessors. With traditional simulation-based verification it is laborious to guarantee the coverage of all possible states, especially for control flow rich memory controllers. This has a direct impact on the time-to-market. A promising alternative is formal verification because it allows to ensure protocol compliance based on mathematical proofs. However, with regard to memory controllers no fully-automated verification process has been presented in the state-of-the-art yet, which means there is still a potential risk of human error. In this paper we present a framework that automatically generates SystemVerilog Assertions for a DRAM protocol. In addition, we show how the framework can be used efficiently for different tasks of memory controller development.
△ Less
Submitted 28 September, 2022;
originally announced September 2022.
-
Blind and Channel-agnostic Equalization Using Adversarial Networks
Authors:
Vincent Lauinger,
Manuel Hoffmann,
Jonas Ney,
Norbert Wehn,
Laurent Schmalen
Abstract:
Due to the rapid development of autonomous driving, the Internet of Things and streaming services, modern communication systems have to cope with varying channel conditions and a steadily rising number of users and devices. This, and the still rising bandwidth demands, can only be met by intelligent network automation, which requires highly flexible and blind transceiver algorithms. To tackle thos…
▽ More
Due to the rapid development of autonomous driving, the Internet of Things and streaming services, modern communication systems have to cope with varying channel conditions and a steadily rising number of users and devices. This, and the still rising bandwidth demands, can only be met by intelligent network automation, which requires highly flexible and blind transceiver algorithms. To tackle those challenges, we propose a novel adaptive equalization scheme, which exploits the prosperous advances in deep learning by training an equalizer with an adversarial network. The learning is only based on the statistics of the transmit signal, so it is blind regarding the actual transmit symbols and agnostic to the channel model. The proposed approach is independent of the equalizer topology and enables the application of powerful neural network based equalizers. In this work, we prove this concept in simulations of different -- both linear and nonlinear -- transmission channels and demonstrate the capability of the proposed blind learning scheme to approach the performance of non-blind equalizers. Furthermore, we provide a theoretical perspective and highlight the challenges of the approach.
△ Less
Submitted 15 September, 2022;
originally announced September 2022.
-
HALF: Holistic Auto Machine Learning for FPGAs
Authors:
Jonas Ney,
Dominik Loroch,
Vladimir Rybalkin,
Nico Weber,
Jens Krüger,
Norbert Wehn
Abstract:
Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing. To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered from the topology down to the final hardware implementation. Interdependen…
▽ More
Deep Neural Networks (DNNs) are capable of solving complex problems in domains related to embedded systems, such as image and natural language processing. To efficiently implement DNNs on a specific FPGA platform for a given cost criterion, e.g. energy efficiency, an enormous amount of design parameters has to be considered from the topology down to the final hardware implementation. Interdependencies between the different design layers have to be taken into account and explored efficiently, making it hardly possible to find optimized solutions manually. An automatic, holistic design approach can improve the quality of DNN implementations on FPGA significantly. To this end, we present a cross-layer design space exploration methodology. It comprises optimizations starting from a hardware-aware topology search for DNNs down to the final optimized implementation for a given FPGA platform. The methodology is implemented in our Holistic Auto machine Learning for FPGAs (HALF) framework, which combines an evolutionary search algorithm, various optimization steps and a library of parametrizable hardware DNN modules. HALF automates both the exploration process and the implementation of optimized solutions on a target FPGA platform for various applications. We demonstrate the performance of HALF on a medical use case for arrhythmia detection for three different design goals, i.e. low-energy, low-power and high-throughput respectively. Our FPGA implementation outperforms a TensorRT optimized model on an Nvidia Jetson platform in both throughput and energy consumption.
△ Less
Submitted 20 October, 2021; v1 submitted 28 June, 2021;
originally announced June 2021.
-
The gem5 Simulator: Version 20.0+
Authors:
Jason Lowe-Power,
Abdul Mutaal Ahmad,
Ayaz Akram,
Mohammad Alian,
Rico Amslinger,
Matteo Andreozzi,
Adrià Armejach,
Nils Asmussen,
Brad Beckmann,
Srikant Bharadwaj,
Gabe Black,
Gedare Bloom,
Bobby R. Bruce,
Daniel Rodrigues Carvalho,
Jeronimo Castrillon,
Lizhong Chen,
Nicolas Derumigny,
Stephan Diestelhorst,
Wendy Elsasser,
Carlos Escuin,
Marjan Fariborz,
Amin Farmahini-Farahani,
Pouya Fotouhi,
Ryan Gambord,
Jayneel Gandhi
, et al. (53 additional authors not shown)
Abstract:
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 si…
▽ More
The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research.
△ Less
Submitted 29 September, 2020; v1 submitted 6 July, 2020;
originally announced July 2020.
-
eBrainII: A 3 kW Realtime Custom 3D DRAM integrated ASIC implementation of a Biologically Plausible Model of a Human Scale Cortex
Authors:
Dimitrios Stathis,
Chirag Sudarshan,
Yu Yang,
Matthias Jung,
Syed Asad Mohamad Hasan Jafri,
Christian Weis,
Ahmed Hemani,
Anders Lansner,
Norbert Wehn
Abstract:
The Artificial Neural Networks (ANNs) like CNN/DNN and LSTM are not biologically plausible and in spite of their initial success, they cannot attain the cognitive capabilities enabled by the dynamic hierarchical associative memory systems of biological brains. The biologically plausible spiking brain models, for e.g. cortex, basal ganglia and amygdala have a greater potential to achieve biological…
▽ More
The Artificial Neural Networks (ANNs) like CNN/DNN and LSTM are not biologically plausible and in spite of their initial success, they cannot attain the cognitive capabilities enabled by the dynamic hierarchical associative memory systems of biological brains. The biologically plausible spiking brain models, for e.g. cortex, basal ganglia and amygdala have a greater potential to achieve biological brain like cognitive capabilities. Bayesian Confidence Propagation Neural Network (BCPNN) is a biologically plausible spiking model of cortex. A human scale model of BCPNN in real time requires 162 TFlops/s, 50 TBs of synaptic weight storage to be accessed with a bandwidth of 200 TBs. The spiking bandwidth is relatively modest at 250 GBs/s. A hand optimized implementation of rodent scale BCPNN has been implemented on Tesla K80 GPUs require 3 kW, we extrapolate from that a human scale network will require 3 MW. These power numbers rule out such implementations for field deployment as advanced cognition engines in embedded systems. The key innovation that this paper reports is that it is feasible and affordable to implement real time BCPNN as a custom tiled ASIC in 28 nm technology with custom 3D DRAM - eBrain II - that consumes 3 kWs for human scale and 12 W for rodent scale cortex model. Such implementations eminently fulfill the demands for field deployment.
△ Less
Submitted 3 November, 2019;
originally announced November 2019.
-
A Reduced-Complexity Projection Algorithm for ADMM-based LP Decoding
Authors:
Florian Gensheimer,
Tobias Dietz,
Kira Kraft,
Stefan Ruzika,
Norbert Wehn
Abstract:
The Alternating Direction Method of Multipliers has recently been adapted for Linear Programming Decoding of Low-Density Parity-Check codes. The computation of the projection onto the parity polytope is the core of this algorithm and usually involves a sorting operation, which is the main effort of the projection.
In this paper, we present an algorithm with low complexity to compute this project…
▽ More
The Alternating Direction Method of Multipliers has recently been adapted for Linear Programming Decoding of Low-Density Parity-Check codes. The computation of the projection onto the parity polytope is the core of this algorithm and usually involves a sorting operation, which is the main effort of the projection.
In this paper, we present an algorithm with low complexity to compute this projection. The algorithm relies on new findings in the recursive structure of the parity polytope and iteratively fixes selected components. It requires up to 37% less arithmetical operations compared to state-of-the-art projections. Additionally, it does not involve a sorting operation, which is needed in all exact state-of-the-art projection algorithms. These two benefits make it appealing for efficient hard- and software implementations.
△ Less
Submitted 10 January, 2019;
originally announced January 2019.
-
Sparsity in Deep Neural Networks - An Empirical Investigation with TensorQuant
Authors:
Dominik Marek Loroch,
Franz-Josef Pfreundt,
Norbert Wehn,
Janis Keuper
Abstract:
Deep learning is finding its way into the embedded world with applications such as autonomous driving, smart sensors and aug- mented reality. However, the computation of deep neural networks is demanding in energy, compute power and memory. Various approaches have been investigated to reduce the necessary resources, one of which is to leverage the sparsity occurring in deep neural networks due to…
▽ More
Deep learning is finding its way into the embedded world with applications such as autonomous driving, smart sensors and aug- mented reality. However, the computation of deep neural networks is demanding in energy, compute power and memory. Various approaches have been investigated to reduce the necessary resources, one of which is to leverage the sparsity occurring in deep neural networks due to the high levels of redundancy in the network parameters. It has been shown that sparsity can be promoted specifically and the achieved sparsity can be very high. But in many cases the methods are evaluated on rather small topologies. It is not clear if the results transfer onto deeper topologies. In this paper, the TensorQuant toolbox has been extended to offer a platform to investigate sparsity, especially in deeper models. Several practical relevant topologies for varying classification problem sizes are investigated to show the differences in sparsity for activations, weights and gradients.
△ Less
Submitted 27 August, 2018;
originally announced August 2018.
-
FINN-L: Library Extensions and Design Trade-off Analysis for Variable Precision LSTM Networks on FPGAs
Authors:
Vladimir Rybalkin,
Alessandro Pappalardo,
Muhammad Mohsin Ghaffar,
Giulio Gambardella,
Norbert Wehn,
Michaela Blott
Abstract:
It is well known that many types of artificial neural networks, including recurrent networks, can achieve a high classification accuracy even with low-precision weights and activations. The reduction in precision generally yields much more efficient hardware implementations in regards to hardware cost, memory requirements, energy, and achievable throughput. In this paper, we present the first syst…
▽ More
It is well known that many types of artificial neural networks, including recurrent networks, can achieve a high classification accuracy even with low-precision weights and activations. The reduction in precision generally yields much more efficient hardware implementations in regards to hardware cost, memory requirements, energy, and achievable throughput. In this paper, we present the first systematic exploration of this design space as a function of precision for Bidirectional Long Short-Term Memory (BiLSTM) neural network. Specifically, we include an in-depth investigation of precision vs. accuracy using a fully hardware-aware training flow, where during training quantization of all aspects of the network including weights, input, output and in-memory cell activations are taken into consideration. In addition, hardware resource cost, power consumption and throughput scalability are explored as a function of precision for FPGA-based implementations of BiLSTM, and multiple approaches of parallelizing the hardware. We provide the first open source HLS library extension of FINN for parameterizable hardware architectures of LSTM layers on FPGAs which offers full precision flexibility and allows for parameterizable performance scaling offering different levels of parallelism within the architecture. Based on this library, we present an FPGA-based accelerator for BiLSTM neural network designed for optical character recognition, along with numerous other experimental proof points for a Zynq UltraScale+ XCZU7EV MPSoC within the given design space.
△ Less
Submitted 11 July, 2018;
originally announced July 2018.
-
Integrating DRAM Power-Down Modes in gem5 and Quantifying their Impact
Authors:
Radhika Jagtap,
Matthias Jung,
Wendy Elsasser,
Christian Weis,
Andreas Hansson,
Norbert Wehn
Abstract:
Across applications, DRAM is a significant contributor to the overall system power, with the DRAM access energy per bit up to three orders of magnitude higher compared to on-chip memory accesses. To improve the power efficiency, DRAM technology incorporates multiple power-down modes, each with different trade-offs between achievable power savings and performance impact due to entry and exit delay…
▽ More
Across applications, DRAM is a significant contributor to the overall system power, with the DRAM access energy per bit up to three orders of magnitude higher compared to on-chip memory accesses. To improve the power efficiency, DRAM technology incorporates multiple power-down modes, each with different trade-offs between achievable power savings and performance impact due to entry and exit delay requirements. Accurate modeling of these low power modes and entry and exit control is crucial to analyze the trade-offs across controller configurations and workloads with varied memory access characteristics. To address this, we integrate the power-down modes into the DRAM controller model in the open-source simulator gem5. This is the first publicly available full-system simulator with DRAM power-down modes, providing the research community a tool for DRAM power analysis for a breadth of use cases. We validate the power-down functionality with sweep tests, which trigger defined memory access characteristics. We further evaluate the model with real HPC workloads, illustrating the value of integrating low power functionality into a full system simulator.
△ Less
Submitted 20 March, 2018;
originally announced March 2018.
-
TensorQuant - A Simulation Toolbox for Deep Neural Network Quantization
Authors:
Dominik Marek Loroch,
Norbert Wehn,
Franz-Josef Pfreundt,
Janis Keuper
Abstract:
Recent research implies that training and inference of deep neural networks (DNN) can be computed with low precision numerical representations of the training/test data, weights and gradients without a general loss in accuracy. The benefit of such compact representations is twofold: they allow a significant reduction of the communication bottleneck in distributed DNN training and faster neural net…
▽ More
Recent research implies that training and inference of deep neural networks (DNN) can be computed with low precision numerical representations of the training/test data, weights and gradients without a general loss in accuracy. The benefit of such compact representations is twofold: they allow a significant reduction of the communication bottleneck in distributed DNN training and faster neural network implementations on hardware accelerators like FPGAs. Several quantization methods have been proposed to map the original 32-bit floating point problem to low-bit representations. While most related publications validate the proposed approach on a single DNN topology, it appears to be evident, that the optimal choice of the quantization method and number of coding bits is topology dependent. To this end, there is no general theory available, which would allow users to derive the optimal quantization during the design of a DNN topology. In this paper, we present a quantization tool box for the TensorFlow framework. TensorQuant allows a transparent quantization simulation of existing DNN topologies during training and inference. TensorQuant supports generic quantization methods and allows experimental evaluation of the impact of the quantization on single layers as well as on the full topology. In a first series of experiments with TensorQuant, we show an analysis of fix-point quantizations of popular CNN topologies.
△ Less
Submitted 13 October, 2017;
originally announced October 2017.
-
On Complexity, Energy- and Implementation-Efficiency of Channel Decoders
Authors:
Frank Kienle,
Norbert Wehn,
Heinrich Meyr
Abstract:
Future wireless communication systems require efficient and flexible baseband receivers. Meaningful efficiency metrics are key for design space exploration to quantify the algorithmic and the implementation complexity of a receiver. Most of the current established efficiency metrics are based on counting operations, thus neglecting important issues like data and storage complexity. In this paper w…
▽ More
Future wireless communication systems require efficient and flexible baseband receivers. Meaningful efficiency metrics are key for design space exploration to quantify the algorithmic and the implementation complexity of a receiver. Most of the current established efficiency metrics are based on counting operations, thus neglecting important issues like data and storage complexity. In this paper we introduce suitable energy and area efficiency metrics which resolve the afore-mentioned disadvantages. These are decoded information bit per energy and throughput per area unit. Efficiency metrics are assessed by various implementations of turbo decoders, LDPC decoders and convolutional decoders. New exploration methodologies are presented, which permit an appropriate benchmarking of implementation efficiency, communications performance, and flexibility trade-offs. These exploration methodologies are based on efficiency trajectories rather than a single snapshot metric as done in state-of-the-art approaches.
△ Less
Submitted 19 March, 2010;
originally announced March 2010.
-
A Separation Algorithm for Improved LP-Decoding of Linear Block Codes
Authors:
Akin Tanatmis,
Stefan Ruzika,
Horst W. Hamacher,
Mayur Punekar,
Frank Kienle,
Norbert Wehn
Abstract:
Maximum Likelihood (ML) decoding is the optimal decoding algorithm for arbitrary linear block codes and can be written as an Integer Programming (IP) problem. Feldman et al. relaxed this IP problem and presented Linear Programming (LP) based decoding algorithm for linear block codes. In this paper, we propose a new IP formulation of the ML decoding problem and solve the IP with generic methods.…
▽ More
Maximum Likelihood (ML) decoding is the optimal decoding algorithm for arbitrary linear block codes and can be written as an Integer Programming (IP) problem. Feldman et al. relaxed this IP problem and presented Linear Programming (LP) based decoding algorithm for linear block codes. In this paper, we propose a new IP formulation of the ML decoding problem and solve the IP with generic methods. The formulation uses indicator variables to detect violated parity checks. We derive Gomory cuts from our formulation and use them in a separation algorithm to find ML codewords. We further propose an efficient method of finding cuts induced by redundant parity checks (RPC). Under certain circumstances we can guarantee that these RPC cuts are valid and cut off the fractional optimal solutions of LP decoding. We demonstrate on two LDPC codes and one BCH code that our separation algorithm performs significantly better than LP decoding.
△ Less
Submitted 13 December, 2008;
originally announced December 2008.