Search | arXiv e-print repository

Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

Authors: Alessio Netti, Michael Ott, Carla Guillen, Daniele Tafani, Martin Schulz

Abstract: As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multi-faceted problem, much effort has gone into researching its separate aspec… ▽ More As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multi-faceted problem, much effort has gone into researching its separate aspects: however, accounts of production ODA experiences are still hard to come across. In this work we aim to bridge the gap between ODA research and production use by presenting our experiences with ODA in production, involving in particular the control of cooling infrastructures and visualization of job data on two HPC systems. We cover the entire development process, from design to deployment, highlighting our insights in an effort to drive the community forward. We rely on open-source tools, which make for a generic ODA framework suitable for most scenarios. △ Less

Submitted 28 June, 2021; originally announced June 2021.

Comments: Preliminary version of the article

arXiv:2010.06186 [pdf, other]

Correlation-wise Smoothing: Lightweight Knowledge Extraction for HPC Monitoring Data

Authors: Alessio Netti, Daniele Tafani, Michael Ott, Martin Schulz

Abstract: Modern High-Performance Computing (HPC) and data center operators rely more and more on data analytics techniques to improve the efficiency and reliability of their operations. They employ models that ingest time-series monitoring sensor data and transform it into actionable knowledge for system tuning: a process known as Operational Data Analytics (ODA). However, monitoring data has a high dimens… ▽ More Modern High-Performance Computing (HPC) and data center operators rely more and more on data analytics techniques to improve the efficiency and reliability of their operations. They employ models that ingest time-series monitoring sensor data and transform it into actionable knowledge for system tuning: a process known as Operational Data Analytics (ODA). However, monitoring data has a high dimensionality, is hardware-dependent and difficult to interpret. This, coupled with the strict requirements of ODA, makes most traditional data mining methods impractical and in turn renders this type of data cumbersome to process. Most current ODA solutions use ad-hoc processing methods that are not generic, are sensible to the sensors' features and are not fit for visualization. In this paper we propose a novel method, called Correlation-wise Smoothing (CS), to extract descriptive signatures from time-series monitoring data in a generic and lightweight way. Our CS method exploits correlations between data dimensions to form groups and produces image-like signatures that can be easily manipulated, visualized and compared. We evaluate the CS method on HPC-ODA, a collection of datasets that we release with this work, and show that it leads to the same performance as most state-of-the-art methods while producing signatures that are up to ten times smaller and up to ten times faster, while gaining visualizability, portability across systems and clear scaling properties. △ Less

Submitted 19 February, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

Comments: Accepted for publication at the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)

arXiv:2007.14241 [pdf, other]

doi 10.1016/j.future.2019.11.029

A Machine Learning Approach to Online Fault Classification in HPC Systems

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which i… ▽ More As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay. △ Less

Submitted 27 July, 2020; originally announced July 2020.

Comments: arXiv admin note: text overlap with arXiv:1807.10056, arXiv:1810.11208

Journal ref: Future Generation Computer Systems, Volume 110, September 2020, Pages 1009-1022

arXiv:1910.06156 [pdf, other]

doi 10.1145/3369583.3392674

DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

Authors: Alessio Netti, Micha Mueller, Carla Guillen, Michael Ott, Daniele Tafani, Gence Ozer, Martin Schulz

Abstract: As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reali… ▽ More As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reality in HPC, there is no well-stated and comprehensive list of requirements, nor matching frameworks, to support holistic and online ODA. This leads to insular ad-hoc solutions, each addressing only specific aspects of the problem. In this paper we propose Wintermute, a novel generic framework to enable online ODA on large-scale HPC installations. Its design is based on the results of a literature survey of common operational requirements. We implement Wintermute on top of the holistic DCDB monitoring system, offering a large variety of configuration options to accommodate the varying requirements of ODA applications. Moreover, Wintermute is based on a set of logical abstractions to ease the configuration of models at a large scale and maximize code re-use. We highlight Wintermute's flexibility through a series of practical case studies, each targeting a different aspect of the management of HPC systems, and then demonstrate the small resource footprint of our implementation. △ Less

Submitted 18 April, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

Comments: Accepted for publication at the 29th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020)

arXiv:1906.07509 [pdf, other]

doi 10.1145/3295500.3356191

From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

Authors: Alessio Netti, Micha Mueller, Axel Auweter, Carla Guillen, Michael Ott, Daniele Tafani, Martin Schulz

Abstract: Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solut… ▽ More Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solutions exist, a system for holistic and facility-wide monitoring is still lacking in the current HPC ecosystem. In this paper we introduce DCDB, a comprehensive monitoring system capable of integrating data from all system levels. It is designed as a modular and highly-scalable framework based on a plugin infrastructure. All monitored data is aggregated at a distributed noSQL data store for analysis and cross-system correlation. We demonstrate the performance and scalability of DCDB, and describe two use cases in the area of energy management and characterization. △ Less

Submitted 14 August, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

Comments: Accepted at the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019

arXiv:1810.11208 [pdf, other]

Online Fault Classification in HPC Systems through Machine Learning

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for… ▽ More As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for HPC systems based on machine learning that has been designed specifically to operate with live streamed data. We cast the problem and its solution within realistic operating constraints of online use. Our results show that almost perfect classification accuracy can be reached for different fault types with low computational overhead and minimal delay. We have based our study on a local dataset, which we make publicly available, that was acquired by injecting faults to an in-house experimental HPC system. △ Less

Submitted 11 July, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

Comments: Accepted for publication at the Euro-Par 2019 conference

arXiv:1807.10056 [pdf, other]

FINJ: A Fault Injection Tool for HPC Systems

Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Abstract: We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing use… ▽ More We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing users to create and monitor a variety of highly-complex and diverse fault conditions in HPC systems that would be difficult to recreate in practice. FINJ is suitable for experiments involving many, potentially interacting nodes, making it a very versatile design and evaluation tool. △ Less

Submitted 1 September, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

Comments: To be presented at the 11th Resilience Workshop in the 2018 Euro-Par conference

arXiv:1806.06728 [pdf, other]

AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

Authors: Cristian Galleguillos, Zeynep Kiziltan, Alessio Netti, Ricardo Soto

Abstract: We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an at… ▽ More We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an attractive tool for conducting job dispatching research in HPC systems. △ Less

Submitted 18 June, 2018; originally announced June 2018.

Comments: 27 pages

Showing 1–8 of 8 results for author: Netti, A