-
Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations
Authors:
Francieli Boito,
Jim Brandt,
Valeria Cardellini,
Philip Carns,
Florina M. Ciorba,
Hilary Egan,
Ahmed Eleliemy,
Ann Gentile,
Thomas Gruber,
Jeff Hanson,
Utz-Uwe Haus,
Kevin Huck,
Thomas Ilsche,
Thomas Jakobsche,
Terry Jones,
Sven Karlsson,
Abdullah Mueen,
Michael Ott,
Tapasya Patki,
Ivy Peng,
Krishnan Raghavan,
Stephen Simms,
Kathleen Shoga,
Michael Showerman,
Devesh Tiwari
, et al. (2 additional authors not shown)
Abstract:
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more…
▽ More
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and develo** such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers
Authors:
Thomas Jakobsche,
Nicolas Lachiche,
Florina M. Ciorba
Abstract:
This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastru…
▽ More
This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.
△ Less
Submitted 1 October, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
An Execution Fingerprint Dictionary for HPC Application Recognition
Authors:
Thomas Jakobsche,
Nicolas Lachiche,
Aurélien Cavelan,
Florina M. Ciorba
Abstract:
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noi…
▽ More
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noisy system, we draw inspiration from the way Shazam recognizes known songs playing in a crowded bar. Our contribution is an Execution Fingerprint Dictionary (EFD) that stores execution fingerprints of system metrics (keys) linked to application and input size information (values) as key-value pairs for application recognition. Related work often relies on extensive system monitoring (many system metrics collected over large time windows) and employs machine learning methods to identify applications. Our solution only uses the first 2 minutes and a single system metric to achieve F-scores above 95 percent, providing comparable results to related work but with a fraction of the necessary data and a straightforward mechanism of recognition.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.