-
A Survey on Design Methodologies for Accelerating Deep Learning on Heterogeneous Architectures
Authors:
Fabrizio Ferrandi,
Serena Curzel,
Leandro Fiorin,
Daniele Ielmini,
Cristina Silvano,
Francesco Conti,
Alessio Burrello,
Francesco Barchi,
Luca Benini,
Luciano Lavagno,
Teodoro Urso,
Enrico Calore,
Sebastiano Fabio Schifano,
Cristian Zambelli,
Maurizio Palesi,
Giuseppe Ascia,
Enrico Russo,
Nicola Petra,
Davide De Caro,
Gennaro Di Meo,
Valeria Cardellini,
Salvatore Filippone,
Francesco Lo Presti,
Francesco Silvestri,
Paolo Palazzari
, et al. (1 additional authors not shown)
Abstract:
In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning…
▽ More
In recent years, the field of Deep Learning has seen many disruptive and impactful advancements. Given the increasing complexity of deep neural networks, the need for efficient hardware accelerators has become more and more pressing to design heterogeneous HPC platforms. The design of Deep Learning accelerators requires a multidisciplinary approach, combining expertise from several areas, spanning from computer architecture to approximate computing, computational models, and machine learning algorithms. Several methodologies and tools have been proposed to design accelerators for Deep Learning, including hardware-software co-design approaches, high-level synthesis methods, specific customized compilers, and methodologies for design space exploration, modeling, and simulation. These methodologies aim to maximize the exploitable parallelism and minimize data movement to achieve high performance and energy efficiency. This survey provides a holistic review of the most influential design methodologies and EDA tools proposed in recent years to implement Deep Learning accelerators, offering the reader a wide perspective in this rapidly evolving field. In particular, this work complements the previous survey proposed by the same authors in [203], which focuses on Deep Learning hardware accelerators for heterogeneous HPC platforms.
△ Less
Submitted 29 November, 2023;
originally announced November 2023.
-
A Survey on Deep Learning Hardware Accelerators for Heterogeneous HPC Platforms
Authors:
Cristina Silvano,
Daniele Ielmini,
Fabrizio Ferrandi,
Leandro Fiorin,
Serena Curzel,
Luca Benini,
Francesco Conti,
Angelo Garofalo,
Cristian Zambelli,
Enrico Calore,
Sebastiano Fabio Schifano,
Maurizio Palesi,
Giuseppe Ascia,
Davide Patti,
Nicola Petra,
Davide De Caro,
Luciano Lavagno,
Teodoro Urso,
Valeria Cardellini,
Gian Carlo Cardarilli,
Robert Birke,
Stefania Perri
Abstract:
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the performance requirements of HPC applications. In par…
▽ More
Recent trends in deep learning (DL) imposed hardware accelerators as the most viable solution for several classes of high-performance computing (HPC) applications such as image classification, computer vision, and speech recognition. This survey summarizes and classifies the most recent advances in designing DL accelerators suitable to reach the performance requirements of HPC applications. In particular, it highlights the most advanced approaches to support deep learning accelerations including not only GPU and TPU-based accelerators but also design-specific hardware accelerators such as FPGA-based and ASIC-based accelerators, Neural Processing Units, open hardware RISC-V-based accelerators and co-processors. The survey also describes accelerators based on emerging memory technologies and computing paradigms, such as 3D-stacked Processor-In-Memory, non-volatile memories (mainly, Resistive RAM and Phase Change Memories) to implement in-memory computing, Neuromorphic Processing Units, and accelerators based on Multi-Chip Modules. Among emerging technologies, we also include some insights into quantum-based accelerators and photonics. To conclude, the survey classifies the most influential architectures and technologies proposed in the last years, with the purpose of offering the reader a comprehensive perspective in the rapidly evolving field of deep learning.
△ Less
Submitted 12 July, 2024; v1 submitted 27 June, 2023;
originally announced June 2023.
-
Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications
Authors:
Enrico Calore,
Alessandro Gabbana,
Sebastiano Fabio Schifano,
Raffaele Tripiccione
Abstract:
The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6…
▽ More
The Knights Landing (KNL) is the codename for the latest generation of Intel processors based on Intel Many Integrated Core (MIC) architecture. It relies on massive thread and data parallelism, and fast on-chip memory. This processor operates in standalone mode, booting an off-the-shelf Linux operating system. The KNL peak performance is very high - approximately 3 Tflops in double precision and 6 Tflops in single precision - but sustained performance depends critically on how well all parallel features of the processor are exploited by real-life applications. We assess the performance of this processor for Lattice Boltzmann codes, widely used in computational fluid-dynamics. In our OpenMP code we consider several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance. We make some performance comparisons with other processors and accelerators, and also discuss the impact of the various memory layouts on energy efficiency.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Energy-efficiency evaluation of Intel KNL for HPC workloads
Authors:
E. Calore,
A. Gabbana,
S. F. Schifano,
R. Tripiccione
Abstract:
Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the…
▽ More
Energy consumption is increasingly becoming a limiting factor to the design of faster large-scale parallel systems, and development of energy-efficient and energy-aware applications is today a relevant issue for HPC code-developer communities. In this work we focus on energy performance of the Knights Landing (KNL) Xeon Phi, the latest many-core architecture processor introduced by Intel into the HPC market. We take into account the 64-core Xeon Phi 7230, and analyze its energy performance using both the on-chip MCDRAM and the regular DDR4 system memory as main storage for the application data-domain. As a benchmark application we use a Lattice Boltzmann code heavily optimized for this architecture and implemented using different memory data layouts to store its lattice. We assessthen the energy consumption using different memory data-layouts, kind of memory (DDR4 or MCDRAM) and number of threads per core.
△ Less
Submitted 5 April, 2018;
originally announced April 2018.
-
Optimization of Lattice Boltzmann Simulations on Heterogeneous Computers
Authors:
E. Calore,
A. Gabbana,
S. F. Schifano,
R. Tripiccione
Abstract:
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The…
▽ More
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting performance and energy efficiency. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and accelerators. In this paper we consider exactly this problem for a class of applications based on Lattice Boltzmann Methods, widely used in computational fluid-dynamics. Our goal is to develop just one program, portable and able to run efficiently on several different combinations of hosts and accelerators. To reach this goal, we define common data layouts enabling the code to exploit efficiently the different parallel and vector options of the various accelerators, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. We also define models and metrics that predict the best partitioning of workloads among host and accelerator, and the optimally achievable overall performance level. We test the performance of our codes and their scaling properties using as testbeds HPC clusters incorporating different accelerators: Intel Xeon-Phi many-core processors, NVIDIA GPUs and AMD GPUs.
△ Less
Submitted 14 March, 2017;
originally announced March 2017.
-
Evaluation of DVFS techniques on modern HPC processors and accelerators for energy-aware applications
Authors:
Enrico Calore,
Alessandro Gabbana,
Sebastiano Fabio Schifano,
Raffaele Tripiccione
Abstract:
Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and…
▽ More
Energy efficiency is becoming increasingly important for computing systems, in particular for large scale HPC facilities. In this work we evaluate, from an user perspective, the use of Dynamic Voltage and Frequency Scaling (DVFS) techniques, assisted by the power and energy monitoring capabilities of modern processors in order to tune applications for energy efficiency. We run selected kernels and a full HPC application on two high-end processors widely used in the HPC context, namely an NVIDIA K80 GPU and an Intel Haswell CPU. We evaluate the available trade-offs between energy-to-solution and time-to-solution, attempting a function-by-function frequency tuning. We finally estimate the benefits obtainable running the full code on a HPC multi-GPU node, with respect to default clock frequency governors. We instrument our code to accurately monitor power consumption and execution time without the need of any additional hardware, and we enable it to change CPUs and GPUs clock frequencies while running. We analyze our results on the different architectures using a simple energy-performance model, and derive a number of energy saving strategies which can be easily adopted on recent high-end HPC systems for generic applications.
△ Less
Submitted 8 March, 2017;
originally announced March 2017.
-
Performance and Portability of Accelerated Lattice Boltzmann Applications with OpenACC
Authors:
E. Calore,
A. Gabbana,
J. Kraus,
S. F. Schifano,
R. Tripiccione
Abstract:
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming…
▽ More
An increasingly large number of HPC systems rely on heterogeneous architectures combining traditional multi-core CPUs with power efficient accelerators. Designing efficient applications for these systems has been troublesome in the past as accelerators could usually be programmed using specific programming languages threatening maintainability, portability and correctness. Several new programming environments try to tackle this problem. Among them, OpenACC offers a high-level approach based on compiler directive clauses to mark regions of existing C, C++ or Fortran codes to run on accelerators. This approach directly addresses code portability, leaving to compilers the support of each different accelerator, but one has to carefully assess the relative costs of portable approaches versus computing efficiency. In this paper we address precisely this issue, using as a test-bench a massively parallel Lattice Boltzmann algorithm. We first describe our multi-node implementation and optimization of the algorithm, using OpenACC and MPI. We then benchmark the code on a variety of processors, including traditional CPUs and GPUs, and make accurate performance comparisons with other GPU implementations of the same algorithm using CUDA and OpenCL. We also asses the performance impact associated to portable programming, and the actual portability and performance-portability of OpenACC-based applications across several state-of-the- art architectures.
△ Less
Submitted 1 March, 2017;
originally announced March 2017.
-
Massively parallel lattice-Boltzmann codes on large GPU clusters
Authors:
E. Calore,
A. Gabbana,
J. Kraus,
E. Pellegrini,
S. F. Schifano,
R. Tripiccione
Abstract:
This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as…
▽ More
This paper describes a massively parallel code for a state-of-the art thermal lattice- Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient programs for large clusters is not an easy task as codes must adapt to increasingly parallel architectures, and the overheads of node-to-node communications must be properly handled. We describe the structure of our code, discussing several key design choices that were guided by theoretical models of performance and experimental benchmarks. We present an extensive set of performance measurements and identify the corresponding main bot- tlenecks; finally we compare the results of our GPU code with those measured on other currently available high performance processors. Our results are a production-grade code able to deliver a sustained performance of several tens of Tflops as well as a design and op- timization methodology that can be used for the development of other high performance applications for computational physics.
△ Less
Submitted 1 March, 2017;
originally announced March 2017.
-
Steady State Visually Evoked Potentials detection using a single electrode consumer-grade EEG device for BCI applications
Authors:
Enrico Calore
Abstract:
Brain-Computer Interfaces (BCIs) implement a direct communication pathway between the brain of an user and an external device, as a computer or a machine in general. One of the most used brain responses to implement non-invasive BCIs is the so called steady-state visually evoked potential (SSVEP). This periodic response is generated when an user gazes to a light flickering at a constant frequency.…
▽ More
Brain-Computer Interfaces (BCIs) implement a direct communication pathway between the brain of an user and an external device, as a computer or a machine in general. One of the most used brain responses to implement non-invasive BCIs is the so called steady-state visually evoked potential (SSVEP). This periodic response is generated when an user gazes to a light flickering at a constant frequency. The SSVEP response can be detected in the user's electroencephalogram (EEG) at the corresponding frequency of the attended flickering stimulus. In SSVEP based BCIs, multiple stimuli, flickering at different frequencies, are commonly presented to the user, where to each stimulus is associated a command for an actuator. One of the limitations to a wider adoption of BCIs is given by the need of EEG acquisition devices and software tools which are commonly not meant for end-user usage. In this work, exploiting state-of-the-art software tools, the use of a low cost easy to wear single electrode EEG device is demonstrated to be exploitable to implement simple SSVEP based BCIs. The obtained results, although less impressive than the ones obtainable with professional EEG equipment, are interesting in view of practical low cost BCI applications meant for end-users.
△ Less
Submitted 15 November, 2016;
originally announced November 2016.