Search | arXiv e-print repository

arXiv:2403.06816 [pdf, other]

Efficient first-order algorithms for large-scale, non-smooth maximum entropy models with application to wildfire science

Authors: Gabriel P. Langlois, Jatan Buch, Jérôme Darbon

Abstract: Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these… ▽ More Maximum entropy (Maxent) models are a class of statistical models that use the maximum entropy principle to estimate probability distributions from data. Due to the size of modern data sets, Maxent models need efficient optimization algorithms to scale well for big data applications. State-of-the-art algorithms for Maxent models, however, were not originally designed to handle big data sets; these algorithms either rely on technical devices that may yield unreliable numerical results, scale poorly, or require smoothness assumptions that many practical Maxent models lack. In this paper, we present novel optimization algorithms that overcome the shortcomings of state-of-the-art algorithms for training large-scale, non-smooth Maxent models. Our proposed first-order algorithms leverage the Kullback-Leibler divergence to train large-scale and non-smooth Maxent models efficiently. For Maxent models with discrete probability distribution of $n$ elements built from samples, each containing $m$ features, the stepsize parameters estimation and iterations in our algorithms scale on the order of $O(mn)$ operations and can be trivially parallelized. Moreover, the strong $\ell_{1}$ convexity of the Kullback--Leibler divergence allows for larger stepsize parameters, thereby speeding up the convergence rate of our algorithms. To illustrate the efficiency of our novel algorithms, we consider the problem of estimating probabilities of fire occurrences as a function of ecological features in the Western US MTBS-Interagency wildfire data set. Our numerical results show that our algorithms outperform the state of the arts by one order of magnitude and yield results that agree with physical models of wildfire occurrence and previous statistical analyses of wildfire drivers. △ Less

Submitted 11 March, 2024; originally announced March 2024.

Comments: The main text of our manuscript is 20 pages long, the appendices are 4 pages long, and the references are 4 pages long,for a total of 28 pages

MSC Class: 90C30; 90C06; 90C90; 62P12 ACM Class: G.1.6; G.3; J.2

arXiv:2310.10053 [pdf, ps, other]

Fast and Low-Cost Approximate Multiplier for FPGAs using Dynamic Reconfiguration

Authors: Shervin Vakili, Mobin Vaziri, Amirhossein Zarei, J. M. Pierre Langlois

Abstract: Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized… ▽ More Multipliers are widely-used arithmetic operators in digital signal processing and machine learning circuits. Due to their relatively high complexity, they can have high latency and be a significant source of power consumption. One strategy to alleviate these limitations is to use approximate computing. This paper thus introduces an original FPGA-based approximate multiplier specifically optimized for machine learning computations. It utilizes dynamically reconfigurable lookup table (LUT) primitives in AMD-Xilinx technology to realize the core part of the computations. The paper provides an in-depth analysis of the hardware architecture, implementation outcomes, and accuracy evaluations of the multiplier proposed in INT8 precision. Implementation results on an AMD-Xilinx Kintex Ultrascale+ FPGA demonstrate remarkable savings of 64% and 67% in LUT utilization for signed multiplication and multiply-and-accumulation configurations, respectively, when compared to the standard Xilinx multiplier core. Accuracy measurements on four popular deep learning (DL) benchmarks indicate a minimal average accuracy decrease of less than 0.29% during post-training deployment, with the maximum reduction staying less than 0.33%. The source code of this work is available on GitHub. △ Less

Submitted 16 October, 2023; originally announced October 2023.

Comments: 5 figures, 3 tables

arXiv:2111.15426 [pdf, ps, other]

Efficient and robust high-dimensional sparse logistic regression via nonlinear primal-dual hybrid gradient algorithms

Authors: Jérôme Darbon, Gabriel P. Langlois

Abstract: Logistic regression is a widely used statistical model to describe the relationship between a binary response variable and predictor variables in data sets. It is often used in machine learning to identify important predictor variables. This task, variable selection, typically amounts to fitting a logistic regression model regularized by a convex combination of $\ell_1$ and $\ell_{2}^{2}$ penaltie… ▽ More Logistic regression is a widely used statistical model to describe the relationship between a binary response variable and predictor variables in data sets. It is often used in machine learning to identify important predictor variables. This task, variable selection, typically amounts to fitting a logistic regression model regularized by a convex combination of $\ell_1$ and $\ell_{2}^{2}$ penalties. Since modern big data sets can contain hundreds of thousands to billions of predictor variables, variable selection methods depend on efficient and robust optimization algorithms to perform well. State-of-the-art algorithms for variable selection, however, were not traditionally designed to handle big data sets; they either scale poorly in size or are prone to produce unreliable numerical results. It therefore remains challenging to perform variable selection on big data sets without access to adequate and costly computational resources. In this paper, we propose a nonlinear primal-dual algorithm that addresses these shortcomings. Specifically, we propose an iterative algorithm that provably computes a solution to a logistic regression problem regularized by an elastic net penalty in $O(T(m,n)\log(1/ε))$ operations, where $ε\in (0,1)$ denotes the tolerance and $T(m,n)$ denotes the number of arithmetic operations required to perform matrix-vector multiplication on a data set with $m$ samples each comprising $n$ features. This result improves on the known complexity bound of $O(\min(m^2n,mn^2)\log(1/ε))$ for first-order optimization methods such as the classic primal-dual hybrid gradient or forward-backward splitting methods. △ Less

Submitted 28 December, 2021; v1 submitted 30 November, 2021; originally announced November 2021.

Comments: 15 pages

MSC Class: 65K10; 49M29; 62J12 ACM Class: G.1.6; I.2.6

arXiv:2111.15207 [pdf, other]

NeeDrop: Self-supervised Shape Representation from Sparse Point Clouds using Needle Drop**

Authors: Alexandre Boulch, Pierre-Alain Langlois, Gilles Puy, Renaud Marlet

Abstract: There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least… ▽ More There has been recently a growing interest for implicit shape representations. Contrary to explicit representations, they have no resolution limitations and they easily deal with a wide variety of surface topologies. To learn these implicit representations, current approaches rely on a certain level of shape supervision (e.g., inside/outside information or distance-to-shape knowledge), or at least require a dense point cloud (to approximate well enough the distance-to-shape). In contrast, we introduce NeeDrop, a self-supervised method for learning shape representations from possibly extremely sparse point clouds. Like in Buffon's needle problem, we "drop" (sample) needles on the point cloud and consider that, statistically, close to the surface, the needle end points lie on opposite sides of the surface. No shape knowledge is required and the point cloud can be highly sparse, e.g., as lidar point clouds acquired by vehicles. Previous self-supervised shape representation approaches fail to produce good-quality results on this kind of data. We obtain quantitative results on par with existing supervised approaches on shape reconstruction datasets and show promising qualitative results on hard autonomous driving datasets such as KITTI. △ Less

Submitted 2 December, 2021; v1 submitted 30 November, 2021; originally announced November 2021.

Comments: 22 pages

Journal ref: International Conference on 3D Vision (3DV), 2021

arXiv:2109.12222 [pdf, ps, other]

Accelerated nonlinear primal-dual hybrid gradient methods with applications to supervised machine learning

Authors: Jérôme Darbon, Gabriel P. Langlois

Abstract: The linear primal-dual hybrid gradient (PDHG) method is a first-order method that splits convex optimization problems with saddle-point structure into smaller subproblems. Unlike those obtained in most splitting methods, these subproblems can generally be solved efficiently because they involve simple operations such as matrix-vector multiplications or proximal map**s that are fast to evaluate n… ▽ More The linear primal-dual hybrid gradient (PDHG) method is a first-order method that splits convex optimization problems with saddle-point structure into smaller subproblems. Unlike those obtained in most splitting methods, these subproblems can generally be solved efficiently because they involve simple operations such as matrix-vector multiplications or proximal map**s that are fast to evaluate numerically. This advantage comes at the price that the linear PDHG method requires precise stepsize parameters for the problem at hand to achieve an optimal convergence rate. Unfortunately, these stepsize parameters are often prohibitively expensive to compute for large-scale optimization problems, such as those in machine learning. This issue makes the otherwise simple linear PDHG method unsuitable for such problems, and it is also shared by most first-order optimization methods as well. To address this issue, we introduce accelerated nonlinear PDHG methods that achieve an optimal convergence rate with stepsize parameters that are simple and efficient to compute. We prove rigorous convergence results, including results for strongly convex or smooth problems posed on infinite-dimensional reflexive Banach spaces. We illustrate the efficiency of our methods on $\ell_{1}$-constrained logistic regression and entropy-regularized matrix games. Our numerical experiments show that the nonlinear PDHG methods are considerably faster than competing methods. △ Less

Submitted 3 April, 2022; v1 submitted 24 September, 2021; originally announced September 2021.

Comments: 52 pages, no figures

MSC Class: 65K10 (Primary); 49M29 (Secondary); 62J99 (Secondary) ACM Class: G.1.6; I.2.6

arXiv:2104.11285 [pdf, other]

Connecting Hamilton--Jacobi partial differential equations with maximum a posteriori and posterior mean estimators for some non-convex priors

Authors: Jérôme Darbon, Gabriel P. Langlois, Tingwei Meng

Abstract: Many imaging problems can be formulated as inverse problems expressed as finite-dimensional optimization problems. These optimization problems generally consist of minimizing the sum of a data fidelity and regularization terms. In [23,26], connections between these optimization problems and (multi-time) Hamilton--Jacobi partial differential equations have been proposed under the convexity assumpti… ▽ More Many imaging problems can be formulated as inverse problems expressed as finite-dimensional optimization problems. These optimization problems generally consist of minimizing the sum of a data fidelity and regularization terms. In [23,26], connections between these optimization problems and (multi-time) Hamilton--Jacobi partial differential equations have been proposed under the convexity assumptions of both the data fidelity and regularization terms. In particular, under these convexity assumptions, some representation formulas for a minimizer can be obtained. From a Bayesian perspective, such a minimizer can be seen as a maximum a posteriori estimator. In this chapter, we consider a certain class of non-convex regularizations and show that similar representation formulas for the minimizer can also be obtained. This is achieved by leveraging min-plus algebra techniques that have been originally developed for solving certain Hamilton--Jacobi partial differential equations arising in optimal control. Note that connections between viscous Hamilton--Jacobi partial differential equations and Bayesian posterior mean estimators with Gaussian data fidelity terms and log-concave priors have been highlighted in [25]. We also present similar results for certain Bayesian posterior mean estimators with Gaussian data fidelity and certain non-log-concave priors using an analogue of min-plus algebra techniques. △ Less

Submitted 22 April, 2021; originally announced April 2021.

arXiv:2103.07750 [pdf, other]

doi 10.1145/3431920.3439303

Design Principles for Packet Deparsers on FPGAs

Authors: Thomas Luinaud, Jeferson Santiago da Silva, J. M. Pierre Langlois, Yvon Savaria

Abstract: The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications… ▽ More The P4 language has drastically changed the networking field as it allows to quickly describe and implement new networking applications. Although a large variety of applications can be described with the P4 language, current programmable switch architectures impose significant constraints on P4 programs. To address this shortcoming, FPGAs have been explored as potential targets for P4 applications. P4 applications are described using three abstractions: a packet parser, match-action tables, and a packet deparser, which reassembles the output packet with the result of the match-action tables. While implementations of packet parsers and match-action tables on FPGAs have been widely covered in the literature, no general design principles have been presented for the packet deparser. Indeed, implementing a high-speed and efficient deparser on FPGAs remains an open issue because it requires a large amount of interconnections and the architecture must be tailored to a P4 program. As a result, in several works where a P4 application is implemented on FPGAs, the deparser consumes a significant proportion of chip resources. Hence, in this paper, we address this issue by presenting design principles for efficient and high-speed deparsers on FPGAs. As an artifact, we introduce a tool that generates an efficient vendor-agnostic deparser architecture from a P4 program. Our design has been validated and simulated with a cocotb-based framework. The resulting architecture is implemented on Xilinx Ultrascale+ FPGAs and supports a throughput of more than 200 Gbps while reducing resource usage by almost 10$\times$ compared to other solutions. △ Less

Submitted 13 March, 2021; originally announced March 2021.

Comments: Presented at ISFPGA'21, 2021 Source code available at : https://github.com/luinaudt/deparser/tree/FPGA_paper

ACM Class: B.5.1

arXiv:2010.00627 [pdf]

CARLA: A Convolution Accelerator with a Reconfigurable and Low-Energy Architecture

Authors: Mehdi Ahmadi, Shervin Vakili, J. M. Pierre Langlois

Abstract: Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major pr… ▽ More Convolutional Neural Networks (CNNs) have proven to be extremely accurate for image recognition, even outperforming human recognition capability. When deployed on battery-powered mobile devices, efficient computer architectures are required to enable fast and energy-efficient computation of costly convolution operations. Despite recent advances in hardware accelerator design for CNNs, two major problems have not yet been addressed effectively, particularly when the convolution layers have highly diverse structures: (1) minimizing energy-hungry off-chip DRAM data movements; (2) maximizing the utilization factor of processing resources to perform convolutions. This work thus proposes an energy-efficient architecture equipped with several optimized dataflows to support the structural diversity of modern CNNs. The proposed approach is evaluated by implementing convolutional layers of VGGNet-16 and ResNet-50. Results show that the architecture achieves a Processing Element (PE) utilization factor of 98% for the majority of 3x3 and 1x1 convolutional layers, while limiting latency to 396.9 ms and 92.7 ms when performing convolutional layers of VGGNet-16 and ResNet-50, respectively. In addition, the proposed architecture benefits from the structured sparsity in ResNet-50 to reduce the latency to 42.5 ms when half of the channels are pruned. △ Less

Submitted 1 October, 2020; originally announced October 2020.

Comments: 12 pages

arXiv:2004.07733 [pdf, other]

Bridging the Gap: FPGAs as Programmable Switches

Authors: Thomas Luinaud, Thibaut Stimpfling, Jeferson Santiago da Silva, Yvon Savaria, J. M. Pierre Langlois

Abstract: The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have p… ▽ More The emergence of P4, a domain specific language, coupled to PISA, a domain specific architecture, is revolutionizing the networking field. P4 allows to describe how packets are processed by a programmable data plane, spanning ASICs and CPUs, implementing PISA. Because the processing flexibility can be limited on ASICs, while the CPUs performance for networking tasks lag behind, recent works have proposed to implement PISA on FPGAs. However, little effort has been dedicated to analyze whether FPGAs are good candidates to implement PISA. In this work, we take a step back and evaluate the micro-architecture efficiency of various PISA blocks. We demonstrate, supported by a theoretical and experimental analysis, that the performance of a few PISA blocks is severely limited by the current FPGA architectures. Specifically, we show that match tables and programmable packet schedulers represent the main performance bottlenecks for FPGA-based programmable switches. Thus, we explore two avenues to alleviate these shortcomings. First, we identify network applications well tailored to current FPGAs. Second, to support a wider range of networking applications, we propose modifications to the FPGA architectures which can also be of interest out of the networking field. △ Less

Submitted 16 April, 2020; originally announced April 2020.

Comments: To be published in : IEEE International Conference on High Performance Switching and Routing 2020

ACM Class: B.5.1

arXiv:2002.09794 [pdf, other]

PoET-BiN: Power Efficient Tiny Binary Neurons

Authors: Sivakumar Chidambaram, J. M. Pierre Langlois, Jean Pierre David

Abstract: The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and… ▽ More The success of neural networks in image classification has inspired various hardware implementations on embedded platforms such as Field Programmable Gate Arrays, embedded processors and Graphical Processing Units. These embedded platforms are constrained in terms of power, which is mainly consumed by the Multiply Accumulate operations and the memory accesses for weight fetching. Quantization and pruning have been proposed to address this issue. Though effective, these techniques do not take into account the underlying architecture of the embedded hardware. In this work, we propose PoET-BiN, a Look-Up Table based power efficient implementation on resource constrained embedded devices. A modified Decision Tree approach forms the backbone of the proposed implementation in the binary domain. A LUT access consumes far less power than the equivalent Multiply Accumulate operation it replaces, and the modified Decision Tree algorithm eliminates the need for memory accesses. We applied the PoET-BiN architecture to implement the classification layers of networks trained on MNIST, SVHN and CIFAR-10 datasets, with near state-of-the art results. The energy reduction for the classifier portion reaches up to six orders of magnitude compared to a floating point implementations and up to three orders of magnitude when compared to recent binary quantized neural networks. △ Less

Submitted 22 February, 2020; originally announced February 2020.

Comments: Accepted in MLSys 2020 conference

arXiv:2002.07711 [pdf]

An Energy-Efficient Accelerator Architecture with Serial Accumulation Dataflow for Deep CNNs

Authors: Mehdi Ahmadi, Shervin Vakili, J. M. Pierre Langlois

Abstract: Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto ha… ▽ More Convolutional Neural Networks (CNNs) have shown outstanding accuracy for many vision tasks during recent years. When deploying CNNs on portable devices and embedded systems, however, the large number of parameters and computations result in long processing time and low battery life. An important factor in designing CNN hardware accelerators is to efficiently map the convolution computation onto hardware resources. In addition, to save battery life and reduce energy consumption, it is essential to reduce the number of DRAM accesses since DRAM consumes orders of magnitude more energy compared to other operations in hardware. In this paper, we propose an energy-efficient architecture which maximally utilizes its computational units for convolution operations while requiring a low number of DRAM accesses. The implementation results show that the proposed architecture performs one image recognition task using the VGGNet model with a latency of 393 ms and only 251.5 MB of DRAM accesses. △ Less

Submitted 14 February, 2020; originally announced February 2020.

Comments: 4 pages

arXiv:1911.00451 [pdf, other]

doi 10.1109/3DV.2019.00067

Surface Reconstruction from 3D Line Segments

Authors: Pierre-Alain Langlois, Alexandre Boulch, Renaud Marlet

Abstract: In man-made environments such as indoor scenes, when point-based 3D reconstruction fails due to the lack of texture, lines can still be detected and used to support surfaces. We present a novel method for watertight piecewise-planar surface reconstruction from 3D line segments with visibility information. First, planes are extracted by a novel RANSAC approach for line segments that allows multiple… ▽ More In man-made environments such as indoor scenes, when point-based 3D reconstruction fails due to the lack of texture, lines can still be detected and used to support surfaces. We present a novel method for watertight piecewise-planar surface reconstruction from 3D line segments with visibility information. First, planes are extracted by a novel RANSAC approach for line segments that allows multiple shape support. Then, each 3D cell of a plane arrangement is labeled full or empty based on line attachment to planes, visibility and regularization. Experiments show the robustness to sparse input data, noise and outliers. △ Less

Submitted 1 November, 2019; originally announced November 2019.

Comments: In 3DV 2019 (Oral)

arXiv:1906.05105 [pdf, other]

Pose from Shape: Deep Pose Estimation for Arbitrary 3D Objects

Authors: Yang Xiao, Xuchong Qiu, Pierre-Alain Langlois, Mathieu Aubry, Renaud Marlet

Abstract: Most deep pose estimation methods need to be trained for specific object instances or categories. In this work we propose a completely generic deep pose estimation approach, which does not require the network to have been trained on relevant categories, nor objects in a category to have a canonical pose. We believe this is a crucial step to design robotic systems that can interact with new objects… ▽ More Most deep pose estimation methods need to be trained for specific object instances or categories. In this work we propose a completely generic deep pose estimation approach, which does not require the network to have been trained on relevant categories, nor objects in a category to have a canonical pose. We believe this is a crucial step to design robotic systems that can interact with new objects in the wild not belonging to a predefined category. Our main insight is to dynamically condition pose estimation with a representation of the 3D shape of the target object. More precisely, we train a Convolutional Neural Network that takes as input both a test image and a 3D model, and outputs the relative 3D pose of the object in the input image with respect to the 3D model. We demonstrate that our method boosts performances for supervised category pose estimation on standard benchmarks, namely Pascal3D+, ObjectNet3D and Pix3D, on which we provide results superior to the state of the art. More importantly, we show that our network trained on everyday man-made objects from ShapeNet generalizes without any additional training to completely new types of 3D objects by providing results on the LINEMOD dataset as well as on natural entities such as animals from ImageNet. △ Less

Submitted 5 August, 2019; v1 submitted 12 June, 2019; originally announced June 2019.

arXiv:1903.06693 [pdf, ps, other]

Module-per-Object: a Human-Driven Methodology for C++-based High-Level Synthesis Design

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, J. M. Pierre Langlois

Abstract: High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which… ▽ More High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to hand-written VHDL code while kee** a high abstraction level, human-readable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parametrization while eliminating the incidence of code duplication. △ Less

Submitted 9 April, 2019; v1 submitted 4 March, 2019; originally announced March 2019.

Comments: 9 pages. Paper accepted for publication at The 27th IEEE International Symposium on Field-Programmable Custom Computing Machines, San Diego CA, April 28 - May 1, 2019

arXiv:1711.09155 [pdf, other]

SHIP: A Scalable High-performance IPv6 Lookup Algorithm that Exploits Prefix Characteristics

Authors: Thibaut Stimpfling, Normand Bélanger, J. M. Pierre Langlois, Yvon Savaria

Abstract: Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet fu… ▽ More Due to the emergence of new network applications, current IP lookup engines must support high-bandwidth, low lookup latency and the ongoing growth of IPv6 networks. However, existing solutions are not designed to address jointly those three requirements. This paper introduces SHIP, an IPv6 lookup algorithm that exploits prefix characteristics to build a two-level data structure designed to meet future application requirements. Using both prefix length distribution and prefix density, SHIP first clusters prefixes into groups sharing similar characteristics, then it builds a hybrid trie-tree for each prefix group. The compact and scalable data structure built can be stored in on-chip low-latency memories, and allows the traversal process to be parallelized and pipelined at each level in order to support high packet bandwidth. Evaluated on real and synthetic prefix tables holding up to 580 k IPv6 prefixes, SHIP has a logarithmic scaling factor in terms of the number of memory accesses, and a linear memory consumption scaling. Using the largest synthetic prefix table, simulations show that compared to other well-known approaches, SHIP uses at least 44% less memory per prefix, while reducing the memory latency by 61%. △ Less

Submitted 24 November, 2017; originally announced November 2017.

Comments: Submitted to EEE/ACM Transactions on Networking

arXiv:1711.06613 [pdf, other]

doi 10.1145/3174243.3174270

P4-compatible High-level Synthesis of Low Latency 100 Gb/s Streaming Packet Parsers in FPGAs

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, J. M. Pierre Langlois

Abstract: Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture f… ▽ More Packet parsing is a key step in SDN-aware devices. Packet parsers in SDN networks need to be both reconfigurable and fast, to support the evolving network protocols and the increasing multi-gigabit data rates. The combination of packet processing languages with FPGAs seems to be the perfect match for these requirements. In this work, we develop an open-source FPGA-based configurable architecture for arbitrary packet parsing to be used in SDN networks. We generate low latency and high-speed streaming packet parsers directly from a packet processing program. Our architecture is pipelined and entirely modeled using templated C++ classes. The pipeline layout is derived from a parser graph that corresponds a P4 code after a series of graph transformation rounds. The RTL code is generated from the C++ description using Xilinx Vivado HLS and synthesized with Xilinx Vivado. Our architecture achieves 100 Gb/s data rate in a Xilinx Virtex-7 FPGA while reducing the latency by 45% and the LUT usage by 40% compared to the state-of-the-art. △ Less

Submitted 17 November, 2017; originally announced November 2017.

Comments: Accepted for publication at the 26th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays February 25 - 27, 2018 Monterey Marriott Hotel, Monterey, California, 7 pages, 7 figures, 1 table

arXiv:1612.09524 [pdf]

Memory Efficient Multi-Scale Line Detector Architecture for Retinal Blood Vessel Segmentation

Authors: Hamza Bendaoudi, Farida Cheriet, J. M. Pierre Langlois

Abstract: This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilizatio… ▽ More This paper presents a memory efficient architecture that implements the Multi-Scale Line Detector (MSLD) algorithm for real-time retinal blood vessel detection in fundus images on a Zynq FPGA. This implementation benefits from the FPGA parallelism to drastically reduce the memory requirements of the MSLD from two images to a few values. The architecture is optimized in terms of resource utilization by reusing the computations and optimizing the bit-width. The throughput is increased by designing fully pipelined functional units. The architecture is capable of achieving a comparable accuracy to its software implementation but 70x faster for low resolution images. For high resolution images, it achieves an acceleration by a factor of 323x. △ Less

Submitted 6 December, 2016; originally announced December 2016.

Comments: This paper was accepted and presented at Conference on Design and Architectures for Signal and Image Processing - DASIP 2016

arXiv:1611.05943 [pdf, ps, other]

Extern Objects in P4: an ROHC Header Compression Scheme Case Study

Authors: Jeferson Santiago da Silva, François-Raymond Boyer, Laurent-Olivier Chiquette, J. M. Pierre Langlois

Abstract: P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC enti… ▽ More P4 is an emergent packet-processing language with which the user can describe how the packets are to be processed in a switching element. This paper presents a way to implement complex operations that are not natively supported in P4. In this work, we explored two different methods to add extensions to P4: i) using new native primitives and ii) using extern instances. As a case study, an ROHC entity was implemented and invoked in a P4 program. The tests showed similar relative performance in both methods in terms of normalized packet latency. However, extern instances appear to be more suitable for target-specific switching applications, where the manufacturer/vendor can specify its own specific operations without changes in the P4 syntax and semantics. Extern instances only require changes in the target-specific backend compiler while kee** the P4 frontend compiler unchanged. The use of externs also results in a more elegant code solution since they are implemented outside the switch-core, thus reducing side effects risks that can be caused by a modification in a switch pipeline implementation. △ Less

Submitted 21 March, 2018; v1 submitted 17 November, 2016; originally announced November 2016.

Comments: 6 pages, 4 figures, 3 listings

arXiv:1609.07750 [pdf, other]

Accurate and Efficient Hyperbolic Tangent Activation Function on FPGA using the DCT Interpolation Filter

Authors: Ahmed M. Abdelsalam, J. M. Pierre Langlois, F. Cheriet

Abstract: Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arit… ▽ More Implementing an accurate and fast activation function with low cost is a crucial aspect to the implementation of Deep Neural Networks (DNNs) on FPGAs. We propose a high-accuracy approximation approach for the hyperbolic tangent activation function of artificial neurons in DNNs. It is based on the Discrete Cosine Transform Interpolation Filter (DCTIF). The proposed architecture combines simple arithmetic operations on stored samples of the hyperbolic tangent function and on input data. The proposed DCTIF implementation achieves two orders of magnitude greater precision than previous work while using the same or fewer computational resources. Various combinations of DCTIF parameters can be chosen to tradeoff the accuracy and complexity of the hyperbolic tangent function. In one case, the proposed architecture approximates the hyperbolic tangent activation function with 10E-5 maximum error while requiring only 1.52 Kbits memory and 57 LUTs of a Virtex-7 FPGA. We also discuss how the activation function accuracy affects the performance of DNNs in terms of their training and testing accuracies. We show that a high accuracy approximation can be necessary in order to maintain the same DNN training and testing performances realized by the exact function. △ Less

Submitted 25 September, 2016; originally announced September 2016.

Comments: 8 pages, 6 figures, 5 tables, submitted for the 25th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (ISFPGA), 22-24 February 2017, California, USA

arXiv:1302.0511 [pdf, other]

doi 10.1109/ISCAS.1997.621829

A CMOS Tailed Tent Map for the Generation of Uniformly Distributed Chaotic Sequences

Authors: Sergio Callegari, Gianluca Setti, Peter J. Langlois

Abstract: This paper describes the design of a modified tent map characterized by a uniform probability density function. The use of this map is proposed as an alternative to the tent map and the Bernoulli shift. It is shown that practical circuits implementing the latter two maps may possess parasitic stable equilibria, fact which would prevent the desired chaotic behavior of the system. On the other hand,… ▽ More This paper describes the design of a modified tent map characterized by a uniform probability density function. The use of this map is proposed as an alternative to the tent map and the Bernoulli shift. It is shown that practical circuits implementing the latter two maps may possess parasitic stable equilibria, fact which would prevent the desired chaotic behavior of the system. On the other hand, commonly used strategies to avoid the parasitic equilibria onset also affect the uniformity of the probability density function. Conversely, the use of the proposed tailed tent map allows to assure a certain degree of parameter deviation robustness, without compromising on the statistical properties of the system. △ Less

Submitted 26 September, 2014; v1 submitted 3 February, 2013; originally announced February 2013.

Journal ref: Proceedings of 1997 IEEE International Symposium on Circuits and Systems, 1997. ISCAS '97., vol 2. pages 781 - 784

arXiv:1201.5472 [pdf, other]

A multiagent urban traffic simulation

Authors: Pierrick Tranouez, Eric Daudé, Patrice Langlois

Abstract: We built a multiagent simulation of urban traffic to model both ordinary traffic and emergency or crisis mode traffic. This simulation first builds a modeled road network based on detailed geographical information. On this network, the simulation creates two populations of agents: the Transporters and the Mobiles. Transporters embody the roads themselves; they are utilitarian and meant to handle t… ▽ More We built a multiagent simulation of urban traffic to model both ordinary traffic and emergency or crisis mode traffic. This simulation first builds a modeled road network based on detailed geographical information. On this network, the simulation creates two populations of agents: the Transporters and the Mobiles. Transporters embody the roads themselves; they are utilitarian and meant to handle the low level realism of the simulation. Mobile agents embody the vehicles that circulate on the network. They have one or several destinations they try to reach using initially their beliefs of the structure of the network (length of the edges, speed limits, number of lanes etc.). Nonetheless, when confronted to a dynamic, emergent prone environment (other vehicles, unexpectedly closed ways or lanes, traffic jams etc.), the rather reactive agent will activate more cognitive modules to adapt its beliefs, desires and intentions. It may change its destination(s), change the tactics used to reach the destination (favoring less used roads, following other agents, using general headings), etc. We describe our current validation of our model and the next planned improvements, both in validation and in functionalities. △ Less

Submitted 26 January, 2012; originally announced January 2012.

Comments: arXiv admin note: significant text overlap with arXiv:0909.1021 and arXiv:0910.1026

Journal ref: Journal of Nonlinear Systems and Applications 1, 3 (2010) 9 pp (in print)

arXiv:0910.1026 [pdf]

A multiagent urban traffic simulation. Part II: dealing with the extraordinary

Authors: Eric Daudé, Pierrick Tranouez, Patrice Langlois

Abstract: In Probabilistic Risk Management, risk is characterized by two quantities: the magnitude (or severity) of the adverse consequences that can potentially result from the given activity or action, and by the likelihood of occurrence of the given adverse consequences. But a risk seldom exists in isolation: chain of consequences must be examined, as the outcome of one risk can increase the likelihood… ▽ More In Probabilistic Risk Management, risk is characterized by two quantities: the magnitude (or severity) of the adverse consequences that can potentially result from the given activity or action, and by the likelihood of occurrence of the given adverse consequences. But a risk seldom exists in isolation: chain of consequences must be examined, as the outcome of one risk can increase the likelihood of other risks. Systemic theory must complement classic PRM. Indeed these chains are composed of many different elements, all of which may have a critical importance at many different levels. Furthermore, when urban catastrophes are envisioned, space and time constraints are key determinants of the workings and dynamics of these chains of catastrophes: models must include a correct spatial topology of the studied risk. Finally, literature insists on the importance small events can have on the risk on a greater scale: urban risks management models belong to self-organized criticality theory. We chose multiagent systems to incorporate this property in our model: the behavior of an agent can transform the dynamics of important groups of them. △ Less

Submitted 6 October, 2009; originally announced October 2009.

Journal ref: ICCSA 2009, France (2009)

arXiv:0909.1021 [pdf]

A multiagent urban traffic simulation Part I: dealing with the ordinary

Authors: Pierrick Tranouez, Patrice Langlois, Eric Daudé

Abstract: We describe in this article a multiagent urban traffic simulation, as we believe individual-based modeling is necessary to encompass the complex influence the actions of an individual vehicle can have on the overall flow of vehicles. We first describe how we build a graph description of the network from purely geometric data, ESRI shapefiles. We then explain how we include traffic related data t… ▽ More We describe in this article a multiagent urban traffic simulation, as we believe individual-based modeling is necessary to encompass the complex influence the actions of an individual vehicle can have on the overall flow of vehicles. We first describe how we build a graph description of the network from purely geometric data, ESRI shapefiles. We then explain how we include traffic related data to this graph. We go on after that with the model of the vehicle agents: origin and destination, driving behavior, multiple lanes, crossroads, and interactions with the other vehicles in day-to-day, ?ordinary? traffic. We conclude with the presentation of the resulting simulation of this model on the Rouen agglomeration. △ Less

Submitted 5 September, 2009; originally announced September 2009.

Journal ref: ICCSA 2009, France (2009)

arXiv:cs/0610122 [pdf, ps, other]

Faithful Polynomial Evaluation with Compensated Horner Algorithm

Authors: Philippe Langlois, Nicolas Louvet

Abstract: This paper presents two sufficient conditions to ensure a faithful evaluation of polynomial in IEEE-754 floating point arithmetic. Faithfulness means that the computed value is one of the two floating point neighbours of the exact result; it can be satisfied using a more accurate algorithm than the classic Horner scheme. One condition here provided is an apriori bound of the polynomial condition… ▽ More This paper presents two sufficient conditions to ensure a faithful evaluation of polynomial in IEEE-754 floating point arithmetic. Faithfulness means that the computed value is one of the two floating point neighbours of the exact result; it can be satisfied using a more accurate algorithm than the classic Horner scheme. One condition here provided is an apriori bound of the polynomial condition number derived from the error analysis of the compensated Horner algorithm. The second condition is both dynamic and validated to check at the running time the faithfulness of a given evaluation. Numerical experiments illustrate the behavior of these two conditions and that associated running time over-cost is really interesting. △ Less

Submitted 20 October, 2006; originally announced October 2006.

ACM Class: G.4

Showing 1–24 of 24 results for author: Langlois, P