Search | arXiv e-print repository

Resilient and Secure Programmable System-on-Chip Accelerator Offload

Authors: Inês Pinto Gouveia, Ahmad T. Sheikh, Ali Shoker, Suhaib A. Fahmy, Paulo Esteves-Verissimo

Abstract: Computational offload to hardware accelerators is gaining traction due to increasing computational demands and efficiency challenges. Programmable hardware, like FPGAs, offers a promising platform in rapidly evolving application areas, with the benefits of hardware acceleration and software programmability. Unfortunately, such systems composed of multiple hardware components must consider integrit… ▽ More Computational offload to hardware accelerators is gaining traction due to increasing computational demands and efficiency challenges. Programmable hardware, like FPGAs, offers a promising platform in rapidly evolving application areas, with the benefits of hardware acceleration and software programmability. Unfortunately, such systems composed of multiple hardware components must consider integrity in the case of malicious components. In this work, we propose Samsara, the first secure and resilient platform that derives, from Byzantine Fault Tolerant (BFT), protocols to enhance the computing resilience of programmable hardware. Samsara uses a novel lightweight hardware-based BFT protocol for Systems-on-Chip, called H-Quorum, that implements the theoretical-minimum latency between applications and replicated compute nodes. To withstand malicious behaviors, Samsara supports hardware rejuvenation, which is used to replace, relocate, or diversify faulty compute nodes. Samsara's architecture ensures the security of the entire workflow while kee** the latency overhead, of both computation and rejuvenation, close to the non-replicated counterpart. △ Less

Submitted 26 June, 2024; originally announced June 2024.

Comments: To be published in The 43rd International Symposium on Reliable Distributed Systems (SRDS 2024)

arXiv:2201.03950 [pdf, other]

High Throughput Multidimensional Tridiagonal Systems Solvers on FPGAs

Authors: Kamalavasan Kamalakkannan, Istvan Z. Reguly, Suhaib A. Fahmy, Gihan R. Mudalige

Abstract: We present a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new… ▽ More We present a design space exploration for synthesizing optimized, high-throughput implementations of multiple multi-dimensional tridiagonal system solvers on FPGAs. Re-evaluating the characteristics of algorithms for the direct solution of tridiagonal systems, we develop a new tridiagonal solver library aimed at implementing high-performance computing applications on Xilinx FPGA hardware. Key new features of the library are (1) the unification of standard state-of-the-art techniques for implementing implicit numerical solvers with a number of novel high-gain optimizations such as vectorization and batching, motivated by multi-dimensional systems in real-world applications, (2) data-flow techniques that provide application specific optimizations for both 2D and 3D problems, including integration of explicit loops commonplace in real workloads, and (3) the development of an analytic model to explore the design space, and obtain rapid performance estimates. The new library provide an order of magnitude better performance for solving large batches of systems compared to Xilinx's current tridiagonal solver library. Two representative applications are implemented using the new solver on a Xilinx Alveo U280 FPGA, demonstrating over 85% predictive model accuracy. These are compared with a current state-of-the-art GPU library for solving multi-dimensional tridiagonal systems on an Nvidia V100 GPU, analyzing time to solution, bandwidth, and energy consumption. Results show the FPGAs achieving competitive or better runtime performance for a range of multi-dimensional problems compared to the V100 GPU. Additionally, the significant energy savings offered by FPGA implementations, over 30% for the most complex application, are quantified. We discuss the algorithmic trade-offs required to obtain good performance on FPGAs, giving insights into the feasibility and profitability of FPGA implementations. △ Less

Submitted 11 January, 2022; originally announced January 2022.

Comments: Under review

arXiv:2111.01108 [pdf, other]

doi 10.1145/3552326.3567485

Resource-Efficient Federated Learning

Authors: Ahmed M. Abdelmoniem, Atal Narayan Sahu, Marco Canini, Suhaib A. Fahmy

Abstract: Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant s… ▽ More Federated Learning (FL) enables distributed training by learners using local data, thereby enhancing privacy and reducing communication. However, it presents numerous challenges relating to the heterogeneity of the data distribution, device capabilities, and participant availability as deployments scale, which can impact both model convergence and bias. Existing FL schemes use random participant selection to improve fairness; however, this can result in inefficient use of resources and lower quality training. In this work, we systematically address the question of resource efficiency in FL, showing the benefits of intelligent participant selection, and incorporation of updates from straggling participants. We demonstrate how these factors enable resource efficiency while also improving trained model quality. △ Less

Submitted 4 November, 2022; v1 submitted 1 November, 2021; originally announced November 2021.

Comments: Accepted to appear in ACM EuroSys 2023

arXiv:2101.01177 [pdf, other]

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

Authors: Kamalavasan Kamalakkannan, Gihan R. Mudalige, Istvan Z. Reguly, Suhaib A. Fahmy

Abstract: This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance computing domain. Key new features of the workflow are (… ▽ More This paper presents a workflow for synthesizing near-optimal FPGA implementations for structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the application class, its computation-communication pattern, and the architectural capabilities of the FPGA to accelerate solvers from the high-performance computing domain. Key new features of the workflow are (1) the unification of standard state-of-the-art techniques with a number of high-gain optimizations such as batching and spatial blocking/tiling, motivated by increasing throughput for real-world work loads and (2) the development and use of a predictive analytic model for exploring the design space, resource estimates and performance. Three representative applications are implemented using the design workflow on a Xilinx Alveo U280 FPGA, demonstrating near-optimal performance and over 85% predictive model accuracy. These are compared with equivalent highly-optimized implementations of the same applications on modern HPC-grade GPUs (Nvidia V100) analyzing time to solution, bandwidth and energy consumption. Performance results indicate equivalent runtime performance of the FPGA implementations to the V100 GPU, with over 2x energy savings, for the largest non-trivial application synthesized on the FPGA compared to the best performing GPU-based solution. Our investigation shows the considerable challenges in gaining high performance on current generation FPGAs compared to traditional architectures. We discuss determinants for a given stencil code to be amenable to FPGA implementation, providing insights into the feasibility and profitability of a design and its resulting performance. △ Less

Submitted 7 January, 2021; v1 submitted 4 January, 2021; originally announced January 2021.

Comments: Preprint - Accepted to the 35th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2021), May 2021, Portland, Oregon USA

arXiv:1710.05154 [pdf, other]

High Throughput 2D Spatial Image Filters on FPGAs

Authors: Abdullah Al-Dujaili, Suhaib A. Fahmy

Abstract: FPGAs are well established in the signal processing domain, where their fine-grained programmable nature allows the inherent parallelism in these applications to be exploited for enhanced performance. As architectures have evolved, FPGA vendors have added more heterogeneous resources to allow often-used functions to be implemented with higher performance, at lower power and using less area. DSP bl… ▽ More FPGAs are well established in the signal processing domain, where their fine-grained programmable nature allows the inherent parallelism in these applications to be exploited for enhanced performance. As architectures have evolved, FPGA vendors have added more heterogeneous resources to allow often-used functions to be implemented with higher performance, at lower power and using less area. DSP blocks, for example, have evolved from basic multipliers to support the multiply-accumulate operations that are the core of many signal processing tasks. While more features were added to DSP blocks, their structure and connectivity has been optimised primarily for one-dimensional signal processing. Basic operations in image processing are similar, but performed in a two-dimensional structure, and hence, many of the optimisations in newer DSP blocks are not exploited when map** image processing algorithms to them. We present a detailed study of two-dimensional spatial filter implementation on FPGAs, showing how to maximise performance through exploitation of DSP block capabilities, while also presenting a lean border pixel management policy. △ Less

Submitted 17 October, 2017; v1 submitted 14 October, 2017; originally announced October 2017.

arXiv:1705.02730 [pdf, other]

Resource-Aware Just-in-Time OpenCL Compiler for Coarse-Grained FPGA Overlays

Authors: Abhishek Kumar Jain, Douglas L. Maskell, Suhaib A. Fahmy

Abstract: FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohibitive compilation times (specifically the FPGA pl… ▽ More FPGA vendors have recently started focusing on OpenCL for FPGAs because of its ability to leverage the parallelism inherent to heterogeneous computing platforms. OpenCL allows programs running on a host computer to launch accelerator kernels which can be compiled at run-time for a specific architecture, thus enabling portability. However, the prohibitive compilation times (specifically the FPGA place and route times) are a major stumbling block when using OpenCL tools from FPGA vendors. The long compilation times mean that the tools cannot effectively use just-in-time (JIT) compilation or runtime performance scaling. Coarse-grained overlays represent a possible solution by virtue of their coarse granularity and fast compilation. In this paper, we present a methodology for run-time compilation of OpenCL kernels to a DSP block based coarse-grained overlay, rather than directly to the fine-grained FPGA fabric. The proposed methodology allows JIT compilation and on-demand resource-aware kernel replication to better utilize available overlay resources, raising the abstraction level while reducing compile times significantly. We further demonstrate that this approach can even be used for run-time compilation of OpenCL kernels on the ARM processor of the embedded heterogeneous Zynq device. △ Less

Submitted 7 May, 2017; originally announced May 2017.

Comments: Presented at 3rd International Workshop on Overlay Architectures for FPGAs (OLAF 2017) arXiv:1704.08802

Report number: OLAF/2017/02

arXiv:1703.03652 [pdf, other]

doi 10.1145/2960407

Security in Automotive Networks: Lightweight Authentication and Authorization

Authors: Philipp Mundhenk, Andrew Paverd, Artur Mrowca, Sebastian Steinhorst, Martin Lukasiewycz, Suhaib A. Fahmy, Samarjit Chakraborty

Abstract: With the increasing amount of interconnections between vehicles, the attack surface of internal vehicle networks is rising steeply. Although these networks are shielded against external attacks, they often do not have any internal security to protect against malicious components or adversaries who breach the network perimeter. To secure the in-vehicle network, all communicating components must be… ▽ More With the increasing amount of interconnections between vehicles, the attack surface of internal vehicle networks is rising steeply. Although these networks are shielded against external attacks, they often do not have any internal security to protect against malicious components or adversaries who breach the network perimeter. To secure the in-vehicle network, all communicating components must be authenticated, and only authorized components should be allowed to send and receive messages. This is achieved using an authentication framework. Cryptography is widely used to authenticate communicating parties and provide secure communication channels (e.g., Internet communication). However, the real-time performance requirements of in-vehicle networks restrict the types of cryptographic algorithms and protocols that may be used. In particular, asymmetric cryptography is computationally infeasible during vehicle operation. In this work, we address the challenges of designing authentication protocols for automotive systems. We present Lightweight Authentication for Secure Automotive Networks (LASAN), a full lifecycle authentication approach. We describe the core LASAN protocols and show how they protect the internal vehicle network while complying with the real-time constraints and low computational resources of this domain. Unlike previous work, we also explain how this framework can be integrated into all aspects of the automotive lifecycle, including manufacturing, vehicle maintenance, and software updates. We evaluate LASAN in two different ways: First, we analyze the security properties of the protocols using established protocol verification techniques based on formal methods. Second, we evaluate the timing requirements of LASAN and compare these to other frameworks using a new highly modular discrete event simulator for in-vehicle networks, which we have developed for this evaluation. △ Less

Submitted 15 March, 2017; v1 submitted 10 March, 2017; originally announced March 2017.

Comments: Authors' preprint of an article to appear in ACM Transactions on Design Automation of Electronic Systems (ACM TODAES) 2017

arXiv:1606.06460 [pdf]

An Area-Efficient FPGA Overlay using DSP Block based Time-multiplexed Functional Units

Authors: Xiangwei Li, Abhishek Jain, Douglas Maskell, Suhaib A. Fahmy

Abstract: Coarse grained overlay architectures improve FPGA design productivity by providing fast compilation and software-like programmability. Throughput oriented spatially configurable overlays typically suffer from area overheads due to the requirement of one functional unit for each compute kernel operation. Hence, these overlays have often been of limited size, supporting only relatively small compute… ▽ More Coarse grained overlay architectures improve FPGA design productivity by providing fast compilation and software-like programmability. Throughput oriented spatially configurable overlays typically suffer from area overheads due to the requirement of one functional unit for each compute kernel operation. Hence, these overlays have often been of limited size, supporting only relatively small compute kernels while consuming considerable FPGA resources. This paper examines the possibility of sharing the functional units among kernel operations for reducing area overheads. We propose a linear interconnected array of time-multiplexed FUs as an overlay architecture with reduced instruction storage and interconnect resource requirements, which uses a fully-pipelined, architecture-aware FU design supporting a fast context switching time. The results presented show a reduction of up to 85% in FPGA resource requirements compared to existing throughput oriented overlay architectures, with an operating frequency which approaches the theoretical limit for the FPGA device. △ Less

Submitted 21 June, 2016; originally announced June 2016.

Comments: Presented at 2nd International Workshop on Overlay Architectures for FPGAs (OLAF 2016) arXiv:1605.08149

Report number: OLAF/2016/02

Showing 1–8 of 8 results for author: Fahmy, S A