-
Design of an energy aware petaflops class high performance cluster based on power architecture
Authors:
W. A. Ahmad,
A. Bartolini,
F. Beneventi,
L. Benini,
A. Borghesi,
M. Cicala,
P. Forestieri,
C. Gianfreda,
D. Gregori,
A. Libri,
F. Spiga,
S. Tinti
Abstract:
In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 10…
▽ More
In this paper we present D.A.V.I.D.E. (Development for an Added Value Infrastructure Designed in Europe), an innovative and energy efficient High Performance Computing cluster designed by E4 Computer Engineering for PRACE (Partnership for Advanced Computing in Europe). D.A.V.I.D.E. is built using best-in-class components (IBM's POWER8-NVLink CPUs, NVIDIA TESLA P100 GPUs, Mellanox InfiniBand EDR 100 Gb/s networking) plus custom hardware and an innovative system middleware software. D.A.V.I.D.E. features (i) a dedicated power monitor interface, built around the BeagleBone Black Board that allows high frequency sampling directly from the power backplane and scalable integration with the internal node telemetry and system level power management software; (ii) a custom-built chassis, based on OpenRack form factor, and liquid cooling that allows the system to be used in modern, energy efficient, datacenter; (iii) software components designed for enabling fine grain power monitoring, power management (i.e. power cap** and energy aware job scheduling) and application power profiling, based on dedicated machine learning components. Software APIs are offered to developers and users to tune the computing node performance and power consumption around on the application requirements. The first pilot system that we will deploy at the beginning of 2017, will demonstrate key HPC applications from different fields ported and optimized for this innovative platform.
△ Less
Submitted 11 July, 2023;
originally announced July 2023.
-
pAElla: Edge-AI based Real-Time Malware Detection in Data Centers
Authors:
Antonio Libri,
Andrea Bartolini,
Luca Benini
Abstract:
The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more invest…
▽ More
The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code.
△ Less
Submitted 7 April, 2020;
originally announced April 2020.
-
Online Anomaly Detection in HPC Systems
Authors:
Andrea Borghesi,
Antonio Libri,
Luca Benini,
Andrea Bartolini
Abstract:
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale super…
▽ More
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.
△ Less
Submitted 22 February, 2019;
originally announced February 2019.
-
The ANTAREX Domain Specific Language for High Performance Computing
Authors:
Cristina Silvano,
Giovanni Agosta,
Andrea Bartolini,
Andrea R. Beccari,
Luca Benini,
Loïc Besnard,
João Bispo,
Radim Cmar,
João M. P. Cardoso,
Carlo Cavazzoni,
Daniele Cesarini,
Stefano Cherubin,
Federico Ficarelli,
Davide Gadioli,
Martin Golasowski,
Antonio Libri,
Jan Martinovič,
Gianluca Palermo,
Pedro Pinto,
Erven Rohou,
Kateřina Slaninová,
Emanuele Vitali
Abstract:
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enfo…
▽ More
The ANTAREX project relies on a Domain Specific Language (DSL) based on Aspect Oriented Programming (AOP) concepts to allow applications to enforce extra functional properties such as energy-efficiency and performance and to optimize Quality of Service (QoS) in an adaptive way. The DSL approach allows the definition of energy-efficiency, performance, and adaptivity strategies as well as their enforcement at runtime through application autotuning and resource and power management. In this paper, we present an overview of the key outcome of the project, the ANTAREX DSL, and some of its capabilities through a number of examples, including how the DSL is applied in the context of the project use cases.
△ Less
Submitted 18 January, 2019;
originally announced January 2019.
-
DiG: Enabling Out-of-Band Scalable High-Resolution Monitoring for Data-Center Analytics, Automation and Control (Extended)
Authors:
Antonio Libri,
Andrea Bartolini,
Luca Benini
Abstract:
Data centers are increasing in size and complexity, and we need scalable approaches to support their automated analysis and control. Performance counters and power consumption are their key "vital signs". State-of-the-Art (SoA) monitoring systems provide built-in tools to collect performance measurements, and custom solutions to get insight on their power consumption. However, with the increase in…
▽ More
Data centers are increasing in size and complexity, and we need scalable approaches to support their automated analysis and control. Performance counters and power consumption are their key "vital signs". State-of-the-Art (SoA) monitoring systems provide built-in tools to collect performance measurements, and custom solutions to get insight on their power consumption. However, with the increase in measurement resolution (in time and space) and the ensuing huge amount of measurement data to handle, new challenges arise, such as bottlenecks on the network bandwidth, storage and software overhead on the monitoring units. To face these challenges we propose a novel monitoring platform for data centers, which enables real-time high-resolution profiling (i.e., all available performance counters and the entire signal bandwidth of the power consumption at the plug - sampling up to 20us - with an error below 1%) and analytics, both at the edge (node-level analysis) and on a centralized unit (cluster-level analysis). The monitoring infrastructure is completely out-of-band, scalable, technology agnostic and low cost, and it is already installed in a SoA high-performance compute cluster (i.e., D.A.V.I.D.E. - 18th in Green500 November 2017).
△ Less
Submitted 17 July, 2019; v1 submitted 7 June, 2018;
originally announced June 2018.