Skip to main content

Showing 1–3 of 3 results for author: Jakobsche, T

.
  1. arXiv:2401.16971  [pdf, other

    cs.DC

    Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations

    Authors: Francieli Boito, Jim Brandt, Valeria Cardellini, Philip Carns, Florina M. Ciorba, Hilary Egan, Ahmed Eleliemy, Ann Gentile, Thomas Gruber, Jeff Hanson, Utz-Uwe Haus, Kevin Huck, Thomas Ilsche, Thomas Jakobsche, Terry Jones, Sven Karlsson, Abdullah Mueen, Michael Ott, Tapasya Patki, Ivy Peng, Krishnan Raghavan, Stephen Simms, Kathleen Shoga, Michael Showerman, Devesh Tiwari , et al. (2 additional authors not shown)

    Abstract: Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more… ▽ More

    Submitted 30 January, 2024; originally announced January 2024.

  2. arXiv:2209.07164  [pdf

    cs.DC

    Challenges and Opportunities of Machine Learning for Monitoring and Operational Data Analytics in Quantitative Codesign of Supercomputers

    Authors: Thomas Jakobsche, Nicolas Lachiche, Florina M. Ciorba

    Abstract: This work examines the challenges and opportunities of Machine Learning (ML) for Monitoring and Operational Data Analytics (MODA) in the context of Quantitative Codesign of Supercomputers (QCS). MODA is employed to gain insights into the behavior of current High Performance Computing (HPC) systems to improve system efficiency, performance, and reliability (e.g. through optimizing cooling infrastru… ▽ More

    Submitted 1 October, 2022; v1 submitted 15 September, 2022; originally announced September 2022.

  3. arXiv:2109.04766  [pdf, other

    cs.DC

    An Execution Fingerprint Dictionary for HPC Application Recognition

    Authors: Thomas Jakobsche, Nicolas Lachiche, Aurélien Cavelan, Florina M. Ciorba

    Abstract: Applications running on HPC systems waste time and energy if they: (a) use resources inefficiently, (b) deviate from allocation purpose (e.g. cryptocurrency mining), or (c) encounter errors and failures. It is important to know which applications are running on the system, how they use the system, and whether they have been executed before. To recognize known applications during execution on a noi… ▽ More

    Submitted 10 September, 2021; originally announced September 2021.