-
Efficient first-order algorithms for large-scale, non-smooth maximum entropy models with application to wildfire science
Authors:
Gabriel P. Langlois,
Jatan Buch,
Jérôme Darbon
Abstract:
Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these…
▽ More
Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers.
△ Less
Submitted 11 March, 2024;
originally announced March 2024.
-
Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration
Authors:
Shervin Vakili,
Mobin Vaziri,
Amirhossein Zarei,
J. M. Pierre Langlois
Abstract:
Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized…
▽ More
Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively, when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub.
△ Less
Submitted 16 October, 2023;
originally announced October 2023.
-
Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms
Authors:
Jérôme Darbon,
Gabriel P. Langlois
Abstract:
Logistic regression is a widely used statistical model to describe the relationship between a binary response variable and predictor variables in data sets. It is often used in machine learning to identify important predictor variables. This task, variable selection, typically amounts to fitting a logistic regression model regularized by a convex combination of $\ell_1$ and $\ell_{2}^{2}$ penaltie…
▽ More
Logistic regression is a widely used statistical model to describe the relationship between a binary response variable and predictor variables in data sets. It is often used in machine learning to identify important predictor variables. This task, variable selection, typically amounts to fitting a logistic regression model regularized by a convex combination of $\ell_1$ and $\ell_{2}^{2}$ penalties. Since modern big data sets can contain hundreds of thousands to billions of predictor variables, variable selection methods depend on efficient and robust optimization algorithms to perform well. State-of-the-art algorithms for variable selection, however, were not traditionally designed to handle big data sets; they either scale poorly in size or are prone to produce unreliable numerical results. It therefore remains challenging to perform variable selection on big data sets without access to adequate and costly computational resources. In this paper, we propose a nonlinear primal-dual algorithm that addresses these shortcomings. Specifically, we propose an iterative algorithm that provably computes a solution to a logistic regression problem regularized by an elastic net penalty in $O(T(m,n)\log(1/ε))$ operations, where $ε\in (0,1)$ denotes the tolerance and $T(m,n)$ denotes the number of arithmetic operations required to perform matrix-vector multiplication on a data set with $m$ samples each comprising $n$ features. This result improves on the known complexity bound of $O(\min(m^2n,mn^2)\log(1/ε))$ for first-order optimization methods such as the classic primal-dual hybrid gradient or forward-backward splitting methods.
△ Less
Submitted 28 December, 2021; v1 submitted 30 November, 2021;
originally announced November 2021.
-
NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Drop**
Authors:
Alexandre Boulch,
Pierre-Alain Langlois,
Gilles Puy,
Renaud Marlet
Abstract:
There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least…
▽ More
There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least require a dense point cloud (to approximate well enough the distance-to-shape). In contrast, we introduce NeeDrop, a self-supervised method for learning shape representations from possibly extremely sparse point clouds. Like in Buffon's needle problem, we "drop" (sample) needles on the point cloud and consider that, statistically, close to the surface, the needle end points lie on opposite sides of the surface. No shape knowledge is required and the point cloud can be highly sparse, e.g., as lidar point clouds acquired by vehicles. Previous self-supervised shape representation approaches fail to produce good-quality results on this kind of data. We obtain quantitative results on par with existing supervised approaches on shape reconstruction datasets and show promising qualitative results on hard autonomous driving datasets such as KITTI.
△ Less
Submitted 2 December, 2021; v1 submitted 30 November, 2021;
originally announced November 2021.
-
Accelerated nonlinear primal-dual hybrid gradient methods with applications to supervised machine learning
Authors:
Jérôme Darbon,
Gabriel P. Langlois
Abstract:
The linear primal-dual hybrid gradient (PDHG) method is a first-order method that splits convex optimization problems with saddle-point structure into smaller subproblems. Unlike those obtained in most splitting methods, these subproblems can generally be solved efficiently because they involve simple operations such as matrix-vector multiplications or proximal map**s that are fast to evaluate n…
▽ More
The linear primal-dual hybrid gradient (PDHG) method is a first-order method that splits convex optimization problems with saddle-point structure into smaller subproblems. Unlike those obtained in most splitting methods, these subproblems can generally be solved efficiently because they involve simple operations such as matrix-vector multiplications or proximal map**s that are fast to evaluate numerically. This advantage comes at the price that the linear PDHG method requires precise stepsize parameters for the problem at hand to achieve an optimal convergence rate. Unfortunately, these stepsize parameters are often prohibitively expensive to compute for large-scale optimization problems, such as those in machine learning. This issue makes the otherwise simple linear PDHG method unsuitable for such problems, and it is also shared by most first-order optimization methods as well. To address this issue, we introduce accelerated nonlinear PDHG methods that achieve an optimal convergence rate with stepsize parameters that are simple and efficient to compute. We prove rigorous convergence results, including results for strongly convex or smooth problems posed on infinite-dimensional reflexive Banach spaces. We illustrate the efficiency of our methods on $\ell_{1}$-constrained logistic regression and entropy-regularized matrix games. Our numerical experiments show that the nonlinear PDHG methods are considerably faster than competing methods.
△ Less
Submitted 3 April, 2022; v1 submitted 24 September, 2021;
originally announced September 2021.
-
Connecting Hamilton--Jacobi partial differential equations with maximum a posteriori and posterior mean estimators for some non-convex priors
Authors:
Jérôme Darbon,
Gabriel P. Langlois,
Tingwei Meng
Abstract:
Many imaging problems can be formulated as inverse problems expressed as finite-dimensional optimization problems. These optimization problems generally consist of minimizing the sum of a data fidelity and regularization terms. In [23,26], connections between these optimization problems and (multi-time) Hamilton--Jacobi partial differential equations have been proposed under the convexity assumpti…
▽ More
Many imaging problems can be formulated as inverse problems expressed as finite-dimensional optimization problems. These optimization problems generally consist of minimizing the sum of a data fidelity and regularization terms. In [23,26], connections between these optimization problems and (multi-time) Hamilton--Jacobi partial differential equations have been proposed under the convexity assumptions of both the data fidelity and regularization terms. In particular, under these convexity assumptions, some representation formulas for a minimizer can be obtained. From a Bayesian perspective, such a minimizer can be seen as a maximum a posteriori estimator. In this chapter, we consider a certain class of non-convex regularizations and show that similar representation formulas for the minimizer can also be obtained. This is achieved by leveraging min-plus algebra techniques that have been originally developed for solving certain Hamilton--Jacobi partial differential equations arising in optimal control. Note that connections between viscous Hamilton--Jacobi partial differential equations and Bayesian posterior mean estimators with Gaussian data fidelity terms and log-concave priors have been highlighted in [25]. We also present similar results for certain Bayesian posterior mean estimators with Gaussian data fidelity and certain non-log-concave priors using an analogue of min-plus algebra techniques.
△ Less
Submitted 22 April, 2021;
originally announced April 2021.
-
Design Principles for Packet Deparsers on FPGAs
Authors:
Thomas Luinaud,
Jeferson Santiago da Silva,
J. M. Pierre Langlois,
Yvon Savaria
Abstract:
The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications…
▽ More
The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor-agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10$\times$ compared to other solutions.
△ Less
Submitted 13 March, 2021;
originally announced March 2021.
-
CARLA: A Convolution Accelerator with a Reconfigurable and Low-Energy Architecture
Authors:
Mehdi Ahmadi,
Shervin Vakili,
J. M. Pierre Langlois
Abstract:
Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major pr…
▽ More
Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated by implementing convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3x3 and 1x1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned.
△ Less
Submitted 1 October, 2020;
originally announced October 2020.
-
Bridging the Gap: FPGAs as Programmable Switches
Authors:
Thomas Luinaud,
Thibaut Stimpfling,
Jeferson Santiago da Silva,
Yvon Savaria,
J. M. Pierre Langlois
Abstract:
The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have p…
▽ More
The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have proposed to implement PISA on FPGAs. However, little effort has been dedicated to analyze whether FPGAs are good candidates to implement PISA. In this work, we take a step back and evaluate the micro-architecture efficiency of various PISA blocks. We demonstrate, supported by a theoretical and experimental analysis, that the performance of a few PISA blocks is severely limited by the current FPGA architectures. Specifically, we show that match tables and programmable packet schedulers represent the main performance bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues to alleviate these shortcomings. First, we identify network applications well tailored to current FPGAs. Second, to support a wider range of networking applications, we propose modifications to the FPGA architectures which can also be of interest out of the networking field.
△ Less
Submitted 16 April, 2020;
originally announced April 2020.
-
PoET-BiN: Power Efficient Tiny Binary Neurons
Authors:
Sivakumar Chidambaram,
J. M. Pierre Langlois,
Jean Pierre David
Abstract:
The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and…
▽ More
The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to address this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks.
△ Less
Submitted 22 February, 2020;
originally announced February 2020.
-
An Energy-Efficient Accelerator Architecture with Serial Accumulation Dataflow for Deep CNNs
Authors:
Mehdi Ahmadi,
Shervin Vakili,
J. M. Pierre Langlois
Abstract:
Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto ha…
▽ More
Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto hardware resources. In addition, to save battery life and reduce energy consumption, it is essential to reduce the number of DRAM accesses since DRAM consumes orders of magnitude more energy compared to other operations in hardware. In this paper, we propose an energy-efficient architecture which maximally utilizes its computational units for convolution operations while requiring a low number of DRAM accesses. The implementation results show that the proposed architecture performs one image recognition task using the VGGNet model with a latency of 393 ms and only 251.5 MB of DRAM accesses.
△ Less
Submitted 14 February, 2020;
originally announced February 2020.
-
Surface Reconstruction from 3D Line Segments
Authors:
Pierre-Alain Langlois,
Alexandre Boulch,
Renaud Marlet
Abstract:
In man-made environments such as indoor scenes, when point-based 3D reconstruction fails due to the lack of texture, lines can still be detected and used to support surfaces. We present a novel method for watertight piecewise-planar surface reconstruction from 3D line segments with visibility information. First, planes are extracted by a novel RANSAC approach for line segments that allows multiple…
▽ More
In man-made environments such as indoor scenes, when point-based 3D reconstruction fails due to the lack of texture, lines can still be detected and used to support surfaces. We present a novel method for watertight piecewise-planar surface reconstruction from 3D line segments with visibility information. First, planes are extracted by a novel RANSAC approach for line segments that allows multiple shape support. Then, each 3D cell of a plane arrangement is labeled full or empty based on line attachment to planes, visibility and regularization. Experiments show the robustness to sparse input data, noise and outliers.
△ Less
Submitted 1 November, 2019;
originally announced November 2019.
-
Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects
Authors:
Yang Xiao,
Xuchong Qiu,
Pierre-Alain Langlois,
Mathieu Aubry,
Renaud Marlet
Abstract:
Most deep pose estimation methods need to be trained for specific object instances or categories. In this work we propose a completely generic deep pose estimation approach, which does not require the network to have been trained on relevant categories, nor objects in a category to have a canonical pose. We believe this is a crucial step to design robotic systems that can interact with new objects…
▽ More
Most deep pose estimation methods need to be trained for specific object instances or categories. In this work we propose a completely generic deep pose estimation approach, which does not require the network to have been trained on relevant categories, nor objects in a category to have a canonical pose. We believe this is a crucial step to design robotic systems that can interact with new objects in the wild not belonging to a predefined category. Our main insight is to dynamically condition pose estimation with a representation of the 3D shape of the target object. More precisely, we train a Convolutional Neural Network that takes as input both a test image and a 3D model, and outputs the relative 3D pose of the object in the input image with respect to the 3D model. We demonstrate that our method boosts performances for supervised category pose estimation on standard benchmarks, namely Pascal3D+, ObjectNet3D and Pix3D, on which we provide results superior to the state of the art. More importantly, we show that our network trained on everyday man-made objects from ShapeNet generalizes without any additional training to completely new types of 3D objects by providing results on the LINEMOD dataset as well as on natural entities such as animals from ImageNet.
△ Less
Submitted 5 August, 2019; v1 submitted 12 June, 2019;
originally announced June 2019.
-
Module-per-Object: a Human-Driven Methodology for C++-based High-Level Synthesis Design
Authors:
Jeferson Santiago da Silva,
François-Raymond Boyer,
J. M. Pierre Langlois
Abstract:
High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which…
▽ More
High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness.
To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to hand-written VHDL code while kee** a high abstraction level, human-readable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parametrization while eliminating the incidence of code duplication.
△ Less
Submitted 9 April, 2019; v1 submitted 4 March, 2019;
originally announced March 2019.
-
SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics
Authors:
Thibaut Stimpfling,
Normand Bélanger,
J. M. Pierre Langlois,
Yvon Savaria
Abstract:
Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet fu…
▽ More
Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet future application requirements. Using both prefix length distribution and prefix density, SHIP first clusters prefixes into groups sharing similar characteristics, then it builds a hybrid trie-tree for each prefix group. The compact and scalable data structure built can be stored in on-chip low-latency memories, and allows the traversal process to be parallelized and pipelined at each level in order to support high packet bandwidth. Evaluated on real and synthetic prefix tables holding up to 580 k IPv6 prefixes, SHIP has a logarithmic scaling factor in terms of the number of memory accesses, and a linear memory consumption scaling. Using the largest synthetic prefix table, simulations show that compared to other well-known approaches, SHIP uses at least 44% less memory per prefix, while reducing the memory latency by 61%.
△ Less
Submitted 24 November, 2017;
originally announced November 2017.
-
P4-compatible High-level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs
Authors:
Jeferson Santiago da Silva,
François-Raymond Boyer,
J. M. Pierre Langlois
Abstract:
Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture f…
▽ More
Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated C++ classes. The pipeline layout is derived from a parser graph that corresponds a P4 code after a series of graph transformation rounds. The RTL code is generated from the C++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves 100 Gb/s data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art.
△ Less
Submitted 17 November, 2017;
originally announced November 2017.
-
Memory Efficient Multi-Scale Line Detector Architecture for Retinal Blood Vessel Segmentation
Authors:
Hamza Bendaoudi,
Farida Cheriet,
J. M. Pierre Langlois
Abstract:
This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilizatio…
▽ More
This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilization by reusing the computations and optimizing the bit-width. The throughput is increased by designing fully pipelined functional units. The architecture is capable of achieving a comparable accuracy to its software implementation but 70x faster for low resolution images. For high resolution images, it achieves an acceleration by a factor of 323x.
△ Less
Submitted 6 December, 2016;
originally announced December 2016.
-
Extern Objects in P4: an ROHC Header Compression Scheme Case Study
Authors:
Jeferson Santiago da Silva,
François-Raymond Boyer,
Laurent-Olivier Chiquette,
J. M. Pierre Langlois
Abstract:
P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC enti…
▽ More
P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC entity was implemented and invoked in a P4 program. The tests showed similar relative performance in both methods in terms of normalized packet latency.
However, extern instances appear to be more suitable for target-specific switching applications, where the manufacturer/vendor can specify its own specific operations without changes in the P4 syntax and semantics. Extern instances only require changes in the target-specific backend compiler while kee** the P4 frontend compiler unchanged. The use of externs also results in a more elegant code solution since they are implemented outside the switch-core, thus reducing side effects risks that can be caused by a modification in a switch pipeline implementation.
△ Less
Submitted 21 March, 2018; v1 submitted 17 November, 2016;
originally announced November 2016.
-
Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter
Authors:
Ahmed M. Abdelsalam,
J. M. Pierre Langlois,
F. Cheriet
Abstract:
Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arit…
▽ More
Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arithmetic operations on stored samples of the hyperbolic tangent function and on input data. The proposed DCTIF implementation achieves two orders of magnitude greater precision than previous work while using the same or fewer computational resources. Various combinations of DCTIF parameters can be chosen to tradeoff the accuracy and complexity of the hyperbolic tangent function. In one case, the proposed architecture approximates the hyperbolic tangent activation function with 10E-5 maximum error while requiring only 1.52 Kbits memory and 57 LUTs of a Virtex-7 FPGA. We also discuss how the activation function accuracy affects the performance of DNNs in terms of their training and testing accuracies. We show that a high accuracy approximation can be necessary in order to maintain the same DNN training and testing performances realized by the exact function.
△ Less
Submitted 25 September, 2016;
originally announced September 2016.
-
A CMOS Tailed Tent Map for the Generation of Uniformly Distributed Chaotic Sequences
Authors:
Sergio Callegari,
Gianluca Setti,
Peter J. Langlois
Abstract:
This paper describes the design of a modified tent map characterized by a uniform probability density function. The use of this map is proposed as an alternative to the tent map and the Bernoulli shift. It is shown that practical circuits implementing the latter two maps may possess parasitic stable equilibria, fact which would prevent the desired chaotic behavior of the system. On the other hand,…
▽ More
This paper describes the design of a modified tent map characterized by a uniform probability density function. The use of this map is proposed as an alternative to the tent map and the Bernoulli shift. It is shown that practical circuits implementing the latter two maps may possess parasitic stable equilibria, fact which would prevent the desired chaotic behavior of the system. On the other hand, commonly used strategies to avoid the parasitic equilibria onset also affect the uniformity of the probability density function. Conversely, the use of the proposed tailed tent map allows to assure a certain degree of parameter deviation robustness, without compromising on the statistical properties of the system.
△ Less
Submitted 26 September, 2014; v1 submitted 3 February, 2013;
originally announced February 2013.
-
A multiagent urban traffic simulation
Authors:
Pierrick Tranouez,
Eric Daudé,
Patrice Langlois
Abstract:
We built a multiagent simulation of urban traffic to model both ordinary traffic and emergency or crisis mode traffic. This simulation first builds a modeled road network based on detailed geographical information. On this network, the simulation creates two populations of agents: the Transporters and the Mobiles. Transporters embody the roads themselves; they are utilitarian and meant to handle t…
▽ More
We built a multiagent simulation of urban traffic to model both ordinary traffic and emergency or crisis mode traffic. This simulation first builds a modeled road network based on detailed geographical information. On this network, the simulation creates two populations of agents: the Transporters and the Mobiles. Transporters embody the roads themselves; they are utilitarian and meant to handle the low level realism of the simulation. Mobile agents embody the vehicles that circulate on the network. They have one or several destinations they try to reach using initially their beliefs of the structure of the network (length of the edges, speed limits, number of lanes etc.). Nonetheless, when confronted to a dynamic, emergent prone environment (other vehicles, unexpectedly closed ways or lanes, traffic jams etc.), the rather reactive agent will activate more cognitive modules to adapt its beliefs, desires and intentions. It may change its destination(s), change the tactics used to reach the destination (favoring less used roads, following other agents, using general headings), etc. We describe our current validation of our model and the next planned improvements, both in validation and in functionalities.
△ Less
Submitted 26 January, 2012;
originally announced January 2012.
-
A multiagent urban traffic simulation. Part II: dealing with the extraordinary
Authors:
Eric Daudé,
Pierrick Tranouez,
Patrice Langlois
Abstract:
In Probabilistic Risk Management, risk is characterized by two quantities: the magnitude (or severity) of the adverse consequences that can potentially result from the given activity or action, and by the likelihood of occurrence of the given adverse consequences. But a risk seldom exists in isolation: chain of consequences must be examined, as the outcome of one risk can increase the likelihood…
▽ More
In Probabilistic Risk Management, risk is characterized by two quantities: the magnitude (or severity) of the adverse consequences that can potentially result from the given activity or action, and by the likelihood of occurrence of the given adverse consequences. But a risk seldom exists in isolation: chain of consequences must be examined, as the outcome of one risk can increase the likelihood of other risks. Systemic theory must complement classic PRM. Indeed these chains are composed of many different elements, all of which may have a critical importance at many different levels. Furthermore, when urban catastrophes are envisioned, space and time constraints are key determinants of the workings and dynamics of these chains of catastrophes: models must include a correct spatial topology of the studied risk. Finally, literature insists on the importance small events can have on the risk on a greater scale: urban risks management models belong to self-organized criticality theory. We chose multiagent systems to incorporate this property in our model: the behavior of an agent can transform the dynamics of important groups of them.
△ Less
Submitted 6 October, 2009;
originally announced October 2009.
-
A multiagent urban traffic simulation Part I: dealing with the ordinary
Authors:
Pierrick Tranouez,
Patrice Langlois,
Eric Daudé
Abstract:
We describe in this article a multiagent urban traffic simulation, as we believe individual-based modeling is necessary to encompass the complex influence the actions of an individual vehicle can have on the overall flow of vehicles. We first describe how we build a graph description of the network from purely geometric data, ESRI shapefiles. We then explain how we include traffic related data t…
▽ More
We describe in this article a multiagent urban traffic simulation, as we believe individual-based modeling is necessary to encompass the complex influence the actions of an individual vehicle can have on the overall flow of vehicles. We first describe how we build a graph description of the network from purely geometric data, ESRI shapefiles. We then explain how we include traffic related data to this graph. We go on after that with the model of the vehicle agents: origin and destination, driving behavior, multiple lanes, crossroads, and interactions with the other vehicles in day-to-day, ?ordinary? traffic. We conclude with the presentation of the resulting simulation of this model on the Rouen agglomeration.
△ Less
Submitted 5 September, 2009;
originally announced September 2009.
-
Faithful Polynomial Evaluation with Compensated Horner Algorithm
Authors:
Philippe Langlois,
Nicolas Louvet
Abstract:
This paper presents two sufficient conditions to ensure a faithful evaluation of polynomial in IEEE-754 floating point arithmetic. Faithfulness means that the computed value is one of the two floating point neighbours of the exact result; it can be satisfied using a more accurate algorithm than the classic Horner scheme. One condition here provided is an apriori bound of the polynomial condition…
▽ More
This paper presents two sufficient conditions to ensure a faithful evaluation of polynomial in IEEE-754 floating point arithmetic. Faithfulness means that the computed value is one of the two floating point neighbours of the exact result; it can be satisfied using a more accurate algorithm than the classic Horner scheme. One condition here provided is an apriori bound of the polynomial condition number derived from the error analysis of the compensated Horner algorithm. The second condition is both dynamic and validated to check at the running time the faithfulness of a given evaluation. Numerical experiments illustrate the behavior of these two conditions and that associated running time over-cost is really interesting.
△ Less
Submitted 20 October, 2006;
originally announced October 2006.