A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Isakov, Mihailo; Currier, Mikaela; del Rosario, Eliakin; Madireddy, Sandeep; Balaprakash, Prasanna; Carns, Philip; Ross, Robert B.; Lockwood, Glenn K.; Kinsy, Michel A.

Computer Science > Distributed, Parallel, and Cluster Computing

arXiv:2204.08180 (cs)

[Submitted on 18 Apr 2022]

Title:A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Authors:Mihailo Isakov, Mikaela Currier, Eliakin del Rosario, Sandeep Madireddy, Prasanna Balaprakash, Philip Carns, Robert B. Ross, Glenn K. Lockwood, Michel A. Kinsy

View PDF

Abstract:I/O efficiency is crucial to productivity in scientific computing, but the increasing complexity of the system and the applications makes it difficult for practitioners to understand and optimize I/O behavior at scale. Data-driven machine learning-based I/O throughput models offer a solution: they can be used to identify bottlenecks, automate I/O tuning, or optimize job scheduling with minimal human intervention. Unfortunately, current state-of-the-art I/O models are not robust enough for production use and underperform after being deployed.
We analyze multiple years of application, scheduler, and storage system logs on two leadership-class HPC platforms to understand why I/O models underperform in practice. We propose a taxonomy consisting of five categories of I/O modeling errors: poor application and system modeling, inadequate dataset coverage, I/O contention, and I/O noise. We develop litmus tests to quantify each category, allowing researchers to narrow down failure modes, enhance I/O throughput models, and improve future generations of HPC logging and analysis tools.

Subjects:	Distributed, Parallel, and Cluster Computing (cs.DC); Performance (cs.PF)
Report number:	STAM01
Cite as:	arXiv:2204.08180 [cs.DC]
	(or arXiv:2204.08180v1 [cs.DC] for this version)
	https://doi.org/10.48550/arXiv.2204.08180

Submission history

From: Michel Kinsy [view email]
[v1] Mon, 18 Apr 2022 06:26:28 UTC (9,380 KB)

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Distributed, Parallel, and Cluster Computing

Title:A Taxonomy of Error Sources in HPC I/O Machine Learning Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators