Search | arXiv e-print repository

Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration

Authors: Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J. M. Pierre Langlois

Abstract: Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized… ▽ More Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively, when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 5 figures, 3 tables

arXiv:2306.12266 [pdf, other]

Combining multi-spectral data with statistical and deep-learning models for improved exoplanet detection in direct imaging at high contrast

Authors: Olivier Flasseur, Théo Bodrito, Julien Mairal, Jean Ponce, Maud Langlois, Anne-Marie Lagrange

Abstract: Exoplanet detection by direct imaging is a difficult task: the faint signals from the objects of interest are buried under a spatially structured nuisance component induced by the host star. The exoplanet signals can only be identified when combining several observations with dedicated detection algorithms. In contrast to most of existing methods, we propose to learn a model of the spatial, tempor… ▽ More Exoplanet detection by direct imaging is a difficult task: the faint signals from the objects of interest are buried under a spatially structured nuisance component induced by the host star. The exoplanet signals can only be identified when combining several observations with dedicated detection algorithms. In contrast to most of existing methods, we propose to learn a model of the spatial, temporal and spectral characteristics of the nuisance, directly from the observations. In a pre-processing step, a statistical model of their correlations is built locally, and the data are centered and whitened to improve both their stationarity and signal-to-noise ratio (SNR). A convolutional neural network (CNN) is then trained in a supervised fashion to detect the residual signature of synthetic sources in the pre-processed images. Our method leads to a better trade-off between precision and recall than standard approaches in the field. It also outperforms a state-of-the-art algorithm based solely on a statistical framework. Besides, the exploitation of the spectral diversity improves the performance compared to a similar model built solely from spatio-temporal data. △ Less

Submitted 21 June, 2023; originally announced June 2023.

Comments: accepted to EUSIPCO 2023

arXiv:2103.07750 [pdf, other]

doi 10.1145/3431920.3439303

Design Principles for Packet Deparsers on FPGAs

Authors: Thomas Luinaud, Jeferson Santiago da Silva, J. M. Pierre Langlois, Yvon Savaria

Abstract: The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications… ▽ More The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor-agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10$\times$ compared to other solutions. △ Less

Submitted 13 March, 2021; originally announced March 2021.

Comments: Presented at ISFPGA'21, 2021 Source code available at : https://github.com/luinaudt/deparser/tree/FPGA_paper

ACM Class: B.5.1

arXiv:2010.00627 [pdf]

CARLA: A Convolution Accelerator with a Reconfigurable and Low-Energy Architecture

Authors: Mehdi Ahmadi, Shervin Vakili, J. M. Pierre Langlois

Abstract: Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major pr… ▽ More Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated by implementing convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3x3 and 1x1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: 12 pages

arXiv:2009.03576 [pdf, other]

Primal-dual splitting scheme with backtracking for handling with epigraphic constraint and sparse analysis regularization

Authors: Laurence Denneulin, Nelly Pustelnik, Maud Langlois, Ignace Loris, Éric Thiébaut

Abstract: The convergence of many proximal algorithms involving a gradient descent relies on its Lipschitz constant. To avoid computing it, backtracking rules can be used. While such a rule has already been designed for the forward-backward algorithm (FBwB), this scheme is not flexible enough when a non-differentiable penalization with a linear operator is added to a constraint. In this work we propose a ba… ▽ More The convergence of many proximal algorithms involving a gradient descent relies on its Lipschitz constant. To avoid computing it, backtracking rules can be used. While such a rule has already been designed for the forward-backward algorithm (FBwB), this scheme is not flexible enough when a non-differentiable penalization with a linear operator is added to a constraint. In this work we propose a backtracking rule for the primal-dual scheme (PDwB), and evaluate its performance for the epigraphical constrained high dynamical reconstruction in high contrast polarimetric imaging, under TV penalization. △ Less

Submitted 8 September, 2020; originally announced September 2020.

Comments: in Proceedings of iTWIST'20, Paper-ID: 12, Nantes, France, December, 2-4, 2020

arXiv:2004.07733 [pdf, other]

Bridging the Gap: FPGAs as Programmable Switches

Authors: Thomas Luinaud, Thibaut Stimpfling, Jeferson Santiago da Silva, Yvon Savaria, J. M. Pierre Langlois

Abstract: The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have p… ▽ More The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have proposed to implement PISA on FPGAs. However, little effort has been dedicated to analyze whether FPGAs are good candidates to implement PISA. In this work, we take a step back and evaluate the micro-architecture efficiency of various PISA blocks. We demonstrate, supported by a theoretical and experimental analysis, that the performance of a few PISA blocks is severely limited by the current FPGA architectures. Specifically, we show that match tables and programmable packet schedulers represent the main performance bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues to alleviate these shortcomings. First, we identify network applications well tailored to current FPGAs. Second, to support a wider range of networking applications, we propose modifications to the FPGA architectures which can also be of interest out of the networking field. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: To be published in : IEEE International Conference on High Performance Switching and Routing 2020

ACM Class: B.5.1

arXiv:2002.09794 [pdf, other]

PoET-BiN: Power Efficient Tiny Binary Neurons

Authors: Sivakumar Chidambaram, J. M. Pierre Langlois, Jean Pierre David

Abstract: The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and… ▽ More The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to address this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks. △ Less

Submitted 22 February, 2020; originally announced February 2020.

Comments: Accepted in MLSys 2020 conference

arXiv:2002.07711 [pdf]

An Energy-Efficient Accelerator Architecture with Serial Accumulation Dataflow for Deep CNNs

Authors: Mehdi Ahmadi, Shervin Vakili, J. M. Pierre Langlois

Abstract: Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto ha… ▽ More Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto hardware resources. In addition, to save battery life and reduce energy consumption, it is essential to reduce the number of DRAM accesses since DRAM consumes orders of magnitude more energy compared to other operations in hardware. In this paper, we propose an energy-efficient architecture which maximally utilizes its computational units for convolution operations while requiring a low number of DRAM accesses. The implementation results show that the proposed architecture performs one image recognition task using the VGGNet model with a latency of 393 ms and only 251.5 MB of DRAM accesses. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: 4 pages

arXiv:1903.06693 [pdf, ps, other]

Module-per-Object: a Human-Driven Methodology for C++-based High-Level Synthesis Design

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, J. M. Pierre Langlois

Abstract: High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which… ▽ More High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to hand-written VHDL code while kee** a high abstraction level, human-readable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parametrization while eliminating the incidence of code duplication. △ Less

Submitted 9 April, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

Comments: 9 pages. Paper accepted for publication at The 27th IEEE International Symposium on Field-Programmable Custom Computing Machines, San Diego CA, April 28 - May 1, 2019

arXiv:1711.09155 [pdf, other]

SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics

Authors: Thibaut Stimpfling, Normand Bélanger, J. M. Pierre Langlois, Yvon Savaria

Abstract: Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet fu… ▽ More Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet future application requirements. Using both prefix length distribution and prefix density, SHIP first clusters prefixes into groups sharing similar characteristics, then it builds a hybrid trie-tree for each prefix group. The compact and scalable data structure built can be stored in on-chip low-latency memories, and allows the traversal process to be parallelized and pipelined at each level in order to support high packet bandwidth. Evaluated on real and synthetic prefix tables holding up to 580 k IPv6 prefixes, SHIP has a logarithmic scaling factor in terms of the number of memory accesses, and a linear memory consumption scaling. Using the largest synthetic prefix table, simulations show that compared to other well-known approaches, SHIP uses at least 44% less memory per prefix, while reducing the memory latency by 61%. △ Less

Submitted 24 November, 2017; originally announced November 2017.

Comments: Submitted to EEE/ACM Transactions on Networking

arXiv:1711.06613 [pdf, other]

doi 10.1145/3174243.3174270

P4-compatible High-level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, J. M. Pierre Langlois

Abstract: Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture f… ▽ More Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated C++ classes. The pipeline layout is derived from a parser graph that corresponds a P4 code after a series of graph transformation rounds. The RTL code is generated from the C++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves 100 Gb/s data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art. △ Less

Submitted 17 November, 2017; originally announced November 2017.

Comments: Accepted for publication at the 26th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays February 25 - 27, 2018 Monterey Marriott Hotel, Monterey, California, 7 pages, 7 figures, 1 table

arXiv:1612.09524 [pdf]

Memory Efficient Multi-Scale Line Detector Architecture for Retinal Blood Vessel Segmentation

Authors: Hamza Bendaoudi, Farida Cheriet, J. M. Pierre Langlois

Abstract: This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilizatio… ▽ More This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilization by reusing the computations and optimizing the bit-width. The throughput is increased by designing fully pipelined functional units. The architecture is capable of achieving a comparable accuracy to its software implementation but 70x faster for low resolution images. For high resolution images, it achieves an acceleration by a factor of 323x. △ Less

Submitted 6 December, 2016; originally announced December 2016.

Comments: This paper was accepted and presented at Conference on Design and Architectures for Signal and Image Processing - DASIP 2016

arXiv:1611.05943 [pdf, ps, other]

Extern Objects in P4: an ROHC Header Compression Scheme Case Study

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, Laurent-Olivier Chiquette, J. M. Pierre Langlois

Abstract: P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC enti… ▽ More P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC entity was implemented and invoked in a P4 program. The tests showed similar relative performance in both methods in terms of normalized packet latency. However, extern instances appear to be more suitable for target-specific switching applications, where the manufacturer/vendor can specify its own specific operations without changes in the P4 syntax and semantics. Extern instances only require changes in the target-specific backend compiler while kee** the P4 frontend compiler unchanged. The use of externs also results in a more elegant code solution since they are implemented outside the switch-core, thus reducing side effects risks that can be caused by a modification in a switch pipeline implementation. △ Less

Submitted 21 March, 2018; v1 submitted 17 November, 2016; originally announced November 2016.

Comments: 6 pages, 4 figures, 3 listings

arXiv:1609.07750 [pdf, other]

Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter

Authors: Ahmed M. Abdelsalam, J. M. Pierre Langlois, F. Cheriet

Abstract: Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arit… ▽ More Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arithmetic operations on stored samples of the hyperbolic tangent function and on input data. The proposed DCTIF implementation achieves two orders of magnitude greater precision than previous work while using the same or fewer computational resources. Various combinations of DCTIF parameters can be chosen to tradeoff the accuracy and complexity of the hyperbolic tangent function. In one case, the proposed architecture approximates the hyperbolic tangent activation function with 10E-5 maximum error while requiring only 1.52 Kbits memory and 57 LUTs of a Virtex-7 FPGA. We also discuss how the activation function accuracy affects the performance of DNNs in terms of their training and testing accuracies. We show that a high accuracy approximation can be necessary in order to maintain the same DNN training and testing performances realized by the exact function. △ Less

Submitted 25 September, 2016; originally announced September 2016.

Comments: 8 pages, 6 figures, 5 tables, submitted for the 25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ISFPGA), 22-24 February 2017, California, USA

Showing 1–14 of 14 results for author: Langlois, M