Search | arXiv e-print repository

A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures

Authors: Fabrizio Ferrandi, Serena Curzel, Leandro Fiorin, Daniele Ielmini, Cristina Silvano, Francesco Conti, Alessio Burrello, Francesco Barchi, Luca Benini, Luciano Lavagno, Teodoro Urso, Enrico Calore, Sebastiano Fabio Schifano, Cristian Zambelli, Maurizio Palesi, Giuseppe Ascia, Enrico Russo, Nicola Petra, Davide De Caro, Gennaro Di Meo, Valeria Cardellini, Salvatore Filippone, Francesco Lo Presti, Francesco Silvestri, Paolo Palazzari , et al. (1 additional authors not shown)

Abstract: In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning… ▽ More In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms. △ Less

Submitted 29 November, 2023; originally announced November 2023.

arXiv:2306.15552 [pdf, other]

A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms

Authors: Cristina Silvano, Daniele Ielmini, Fabrizio Ferrandi, Leandro Fiorin, Serena Curzel, Luca Benini, Francesco Conti, Angelo Garofalo, Cristian Zambelli, Enrico Calore, Sebastiano Fabio Schifano, Maurizio Palesi, Giuseppe Ascia, Davide Patti, Nicola Petra, Davide De Caro, Luciano Lavagno, Teodoro Urso, Valeria Cardellini, Gian Carlo Cardarilli, Robert Birke, Stefania Perri

Abstract: Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the performance requirements of HPC applications. In par… ▽ More Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the performance requirements of HPC applications. In particular, it highlights the most advanced approaches to support deep learning accelerations including not only GPU and TPU-based accelerators but also design-specific hardware accelerators such as FPGA-based and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators and co-processors. The survey also describes accelerators based on emerging memory technologies and computing paradigms, such as 3D-stacked Processor-In-Memory, non-volatile memories (mainly, Resistive RAM and Phase Change Memories) to implement in-memory computing, Neuromorphic Processing Units, and accelerators based on Multi-Chip Modules. Among emerging technologies, we also include some insights into quantum-based accelerators and photonics. To conclude, the survey classifies the most influential architectures and technologies proposed in the last years, with the purpose of offering the reader a comprehensive perspective in the rapidly evolving field of deep learning. △ Less

Submitted 12 July, 2024; v1 submitted 27 June, 2023; originally announced June 2023.

Comments: Preprint version of our manuscript submitted to the journal @ ACM CSUR (58 pages including Appendix) on June 22nd, 2023. Major revision submitted on July 12th, 2024

arXiv:1804.01918 [pdf, other]

doi 10.1007/978-3-319-78024-5_45

Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications

Authors: Enrico Calore, Alessandro Gabbana, Sebastiano Fabio Schifano, Raffaele Tripiccione

Abstract: The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6… ▽ More The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6 Tflops in single precision - but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency. △ Less

Submitted 5 April, 2018; originally announced April 2018.

arXiv:1804.01911 [pdf, other]

doi 10.3233/978-1-61499-843-3-733

Energy-efficiency evaluation of Intel KNL for HPC workloads

Authors: E. Calore, A. Gabbana, S. F. Schifano, R. Tripiccione

Abstract: Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the… ▽ More Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze its energy performance using both the on-chip MCDRAM and the regular DDR4 system memory as main storage for the application data-domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture and implemented using different memory data layouts to store its lattice. We assessthen the energy consumption using different memory data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core. △ Less

Submitted 5 April, 2018; originally announced April 2018.

arXiv:1703.04594 [pdf, other]

doi 10.1177/1094342017703771

Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers

Authors: E. Calore, A. Gabbana, S. F. Schifano, R. Tripiccione

Abstract: High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The… ▽ More High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs. △ Less

Submitted 14 March, 2017; originally announced March 2017.

arXiv:1703.02788 [pdf, ps, other]

doi 10.1002/cpe.4143

Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications

Authors: Enrico Calore, Alessandro Gabbana, Sebastiano Fabio Schifano, Raffaele Tripiccione

Abstract: Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and… ▽ More Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications. △ Less

Submitted 8 March, 2017; originally announced March 2017.

arXiv:1703.00186 [pdf, other]

doi 10.1002/cpe.3862

Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC

Authors: E. Calore, A. Gabbana, J. Kraus, S. F. Schifano, R. Tripiccione

Abstract: An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming… ▽ More An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures. △ Less

Submitted 1 March, 2017; originally announced March 2017.

arXiv:1703.00185 [pdf, other]

doi 10.1016/j.parco.2016.08.005

Massively parallel lattice-Boltzmann codes on large GPU clusters

Authors: E. Calore, A. Gabbana, J. Kraus, E. Pellegrini, S. F. Schifano, R. Tripiccione

Abstract: This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as… ▽ More This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics. △ Less

Submitted 1 March, 2017; originally announced March 2017.

arXiv:1611.04833 [pdf, other]

Steady State Visually Evoked Potentials detection using a single electrode consumer-grade EEG device for BCI applications

Authors: Enrico Calore

Abstract: Brain-Computer Interfaces (BCIs) implement a direct communication pathway between the brain of an user and an external device, as a computer or a machine in general. One of the most used brain responses to implement non-invasive BCIs is the so called steady-state visually evoked potential (SSVEP). This periodic response is generated when an user gazes to a light flickering at a constant frequency.… ▽ More Brain-Computer Interfaces (BCIs) implement a direct communication pathway between the brain of an user and an external device, as a computer or a machine in general. One of the most used brain responses to implement non-invasive BCIs is the so called steady-state visually evoked potential (SSVEP). This periodic response is generated when an user gazes to a light flickering at a constant frequency. The SSVEP response can be detected in the user's electroencephalogram (EEG) at the corresponding frequency of the attended flickering stimulus. In SSVEP based BCIs, multiple stimuli, flickering at different frequencies, are commonly presented to the user, where to each stimulus is associated a command for an actuator. One of the limitations to a wider adoption of BCIs is given by the need of EEG acquisition devices and software tools which are commonly not meant for end-user usage. In this work, exploiting state-of-the-art software tools, the use of a low cost easy to wear single electrode EEG device is demonstrated to be exploitable to implement simple SSVEP based BCIs. The obtained results, although less impressive than the ones obtainable with professional EEG equipment, are interesting in view of practical low cost BCI applications meant for end-users. △ Less

Submitted 15 November, 2016; originally announced November 2016.

Comments: Work conducted between 2013 and 2014

Showing 1–9 of 9 results for author: Calore, E