Skip to main content

Showing 1–8 of 8 results for author: Netti, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2106.14423  [pdf, other

    cs.DC eess.SY

    Operational Data Analytics in Practice: Experiences from Design to Deployment in Production HPC Environments

    Authors: Alessio Netti, Michael Ott, Carla Guillen, Daniele Tafani, Martin Schulz

    Abstract: As HPC systems grow in complexity, efficient and manageable operation is increasingly critical. Many centers are thus starting to explore the use of Operational Data Analytics (ODA) techniques, which extract knowledge from massive amounts of monitoring data and use it for control and visualization purposes. As ODA is a multi-faceted problem, much effort has gone into researching its separate aspec… ▽ More

    Submitted 28 June, 2021; originally announced June 2021.

    Comments: Preliminary version of the article

  2. arXiv:2010.06186  [pdf, other

    cs.DC cs.LG eess.SY

    Correlation-wise Smoothing: Lightweight Knowledge Extraction for HPC Monitoring Data

    Authors: Alessio Netti, Daniele Tafani, Michael Ott, Martin Schulz

    Abstract: Modern High-Performance Computing (HPC) and data center operators rely more and more on data analytics techniques to improve the efficiency and reliability of their operations. They employ models that ingest time-series monitoring sensor data and transform it into actionable knowledge for system tuning: a process known as Operational Data Analytics (ODA). However, monitoring data has a high dimens… ▽ More

    Submitted 19 February, 2021; v1 submitted 13 October, 2020; originally announced October 2020.

    Comments: Accepted for publication at the 35th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2021)

  3. A Machine Learning Approach to Online Fault Classification in HPC Systems

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which i… ▽ More

    Submitted 27 July, 2020; originally announced July 2020.

    Comments: arXiv admin note: text overlap with arXiv:1807.10056, arXiv:1810.11208

    Journal ref: Future Generation Computer Systems, Volume 110, September 2020, Pages 1009-1022

  4. DCDB Wintermute: Enabling Online and Holistic Operational Data Analytics on HPC Systems

    Authors: Alessio Netti, Micha Mueller, Carla Guillen, Michael Ott, Daniele Tafani, Gence Ozer, Martin Schulz

    Abstract: As we approach the exascale era, the size and complexity of HPC systems continues to increase, raising concerns about their manageability and sustainability. For this reason, more and more HPC centers are experimenting with fine-grained monitoring coupled with Operational Data Analytics (ODA) to optimize efficiency and effectiveness of system operations. However, while monitoring is a common reali… ▽ More

    Submitted 18 April, 2020; v1 submitted 14 October, 2019; originally announced October 2019.

    Comments: Accepted for publication at the 29th ACM International Symposium on High-Performance Parallel and Distributed Computing (HPDC 2020)

  5. From Facility to Application Sensor Data: Modular, Continuous and Holistic Monitoring with DCDB

    Authors: Alessio Netti, Micha Mueller, Axel Auweter, Carla Guillen, Michael Ott, Daniele Tafani, Martin Schulz

    Abstract: Today's HPC installations are highly-complex systems, and their complexity will only increase as we move to exascale and beyond. At each layer, from facilities to systems, from runtimes to applications, a wide range of tuning decisions must be made in order to achieve efficient operation. This, however, requires systematic and continuous monitoring of system and user data. While many insular solut… ▽ More

    Submitted 14 August, 2019; v1 submitted 18 June, 2019; originally announced June 2019.

    Comments: Accepted at the The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC) 2019

  6. arXiv:1810.11208  [pdf, other

    cs.DC

    Online Fault Classification in HPC Systems through Machine Learning

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: As High-Performance Computing (HPC) systems strive towards the exascale goal, studies suggest that they will experience excessive failure rates. For this reason, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures will be essential for continued operation. In this paper, we propose a fault classification method for… ▽ More

    Submitted 11 July, 2019; v1 submitted 26 October, 2018; originally announced October 2018.

    Comments: Accepted for publication at the Euro-Par 2019 conference

  7. arXiv:1807.10056  [pdf, other

    cs.DC

    FINJ: A Fault Injection Tool for HPC Systems

    Authors: Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

    Abstract: We present FINJ, a high-level fault injection tool for High-Performance Computing (HPC) systems, with a focus on the management of complex experiments. FINJ provides support for custom workloads and allows generation of anomalous conditions through the use of fault-triggering executable programs. FINJ can also be integrated seamlessly with most other lower-level fault injection tools, allowing use… ▽ More

    Submitted 1 September, 2018; v1 submitted 26 July, 2018; originally announced July 2018.

    Comments: To be presented at the 11th Resilience Workshop in the 2018 Euro-Par conference

  8. arXiv:1806.06728  [pdf, other

    cs.DC

    AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

    Authors: Cristian Galleguillos, Zeynep Kiziltan, Alessio Netti, Ricardo Soto

    Abstract: We present AccaSim, a simulator for workload management in HPC systems. Thanks to AccaSim's scalability to large workload datasets, support for easy customization, and practical automated tools to aid experimentation, users can easily represent various real HPC systems, develop novel advanced dispatchers and evaluate them in a convenient way across different workload sources. AccaSim is thus an at… ▽ More

    Submitted 18 June, 2018; originally announced June 2018.

    Comments: 27 pages