-
Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers
Authors:
Thomas Jakobsche,
Nicolas Lachiche,
Florina M. Ciorba
Abstract:
This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastru…
▽ More
This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastructure, job scheduling, and application parameter tuning). In this work, we take the position that QCS in general, and MODA in particular, require close exchange with the ML community to realize the full potential of data-driven analysis for the benefit of existing and future HPC systems. This exchange will facilitate identifying the appropriate ML methods to gain insights into current HPC systems and to go beyond expert-based knowledge and rules of thumb.
△ Less
Submitted 1 October, 2022; v1 submitted 15 September, 2022;
originally announced September 2022.
-
An Execution Fingerprint Dictionary for HPC Application Recognition
Authors:
Thomas Jakobsche,
Nicolas Lachiche,
Aurélien Cavelan,
Florina M. Ciorba
Abstract:
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noi…
▽ More
Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noisy system, we draw inspiration from the way Shazam recognizes known songs playing in a crowded bar. Our contribution is an Execution Fingerprint Dictionary (EFD) that stores execution fingerprints of system metrics (keys) linked to application and input size information (values) as key-value pairs for application recognition. Related work often relies on extensive system monitoring (many system metrics collected over large time windows) and employs machine learning methods to identify applications. Our solution only uses the first 2 minutes and a single system metric to achieve F-scores above 95 percent, providing comparable results to related work but with a fraction of the necessary data and a straightforward mechanism of recognition.
△ Less
Submitted 10 September, 2021;
originally announced September 2021.
-
CASP-DM: Context Aware Standard Process for Data Mining
Authors:
Fernando Martínez-Plumed,
Lidia Contreras-Ochando,
Cèsar Ferri,
Peter Flach,
José Hernández-Orallo,
Meelis Kull,
Nicolas Lachiche,
María José Ramírez-Quintana
Abstract:
We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs.
We propose an extension of the Cross Industry Standard Process for Data Mining (CRISPDM) which addresses specific challenges of machine learning and data mining for context and model reuse handling. This new general context-aware process model is mapped with CRISP-DM reference model proposing some new or enhanced outputs.
△ Less
Submitted 19 September, 2017;
originally announced September 2017.