Search | arXiv e-print repository

arXiv:2406.19058 [pdf, other]

Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis

Authors: Jeremy J. Williams, Stefan Costea, Allen D. Malony, David Tskhakaya, Leon Kos, Ales Podolnik, Jakub Hromadka, Kevin Huck, Erwin Laure, Stefano Markidis

Abstract: Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and s… ▽ More Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1's performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1's runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation. △ Less

Submitted 27 June, 2024; originally announced June 2024.

Comments: Accepted by the Euro-Par 2024 workshops (PHYSHPC 2024), prepared in the standardized Springer LNCS format and consists of 12 pages, which includes the main text, references, and figures

arXiv:2304.11205 [pdf, ps, other]

STaKTAU: profiling HPC applications' operating system usage

Authors: Camille Coti, Kevin Huck, Allen D. Malony

Abstract: This paper presents a approach for measuring the time spent by HPC applications in the operating system's kernel. We use the SystemTap interface to insert timers before and after system calls, and take advantage of its stability to design a tool that can be used with multiple versions of the kernel. We evaluate its performance overhead, using an OS-intensive mini-benchmark and a raytracing mini ap… ▽ More This paper presents a approach for measuring the time spent by HPC applications in the operating system's kernel. We use the SystemTap interface to insert timers before and after system calls, and take advantage of its stability to design a tool that can be used with multiple versions of the kernel. We evaluate its performance overhead, using an OS-intensive mini-benchmark and a raytracing mini app. △ Less

Submitted 21 April, 2023; originally announced April 2023.

arXiv:2210.00798 [pdf, other]

doi 10.1109/CLUSTER51413.2022.00049

HPC Storage Service Autotuning Using Variational-Autoencoder-Guided Asynchronous Bayesian Optimization

Authors: Matthieu Dorier, Romain Egele, Prasanna Balaprakash, Jaehoon Koo, Sandeep Madireddy, Srinivasan Ramesh, Allen D. Malony, Rob Ross

Abstract: Distributed data storage services tailored to specific applications have grown popular in the high-performance computing (HPC) community as a way to address I/O and storage challenges. These services offer a variety of specific interfaces, semantics, and data representations. They also expose many tuning parameters, making it difficult for their users to find the best configuration for a given wor… ▽ More Distributed data storage services tailored to specific applications have grown popular in the high-performance computing (HPC) community as a way to address I/O and storage challenges. These services offer a variety of specific interfaces, semantics, and data representations. They also expose many tuning parameters, making it difficult for their users to find the best configuration for a given workload and platform. To address this issue, we develop a novel variational-autoencoder-guided asynchronous Bayesian optimization method to tune HPC storage service parameters. Our approach uses transfer learning to leverage prior tuning results and use a dynamically updated surrogate model to explore the large parameter search space in a systematic way. We implement our approach within the DeepHyper open-source framework, and apply it to the autotuning of a high-energy physics workflow on Argonne's Theta supercomputer. We show that our transfer-learning approach enables a more than $40\times$ search speedup over random search, compared with a $2.5\times$ to $10\times$ speedup when not using transfer learning. Additionally, we show that our approach is on par with state-of-the-art autotuning frameworks in speed and outperforms them in resource utilization and parallelization capabilities. △ Less

Submitted 3 October, 2022; originally announced October 2022.

Comments: Accepted at IEEE Cluster 2022

arXiv:2202.08948 [pdf, other]

SKaMPI-OpenSHMEM: Measuring OpenSHMEM Communication Routines

Authors: Camille Coti, Allen D. Malony

Abstract: Benchmarking is an important challenge in HPC, in particular, to be able to tune the basic blocks of the software environment used by applications. The communication library and distributed run-time environment are among the most critical ones. In particular, many of the routines provided by communication libraries can be adjusted using parameters such as buffer sizes and communication algorithm.… ▽ More Benchmarking is an important challenge in HPC, in particular, to be able to tune the basic blocks of the software environment used by applications. The communication library and distributed run-time environment are among the most critical ones. In particular, many of the routines provided by communication libraries can be adjusted using parameters such as buffer sizes and communication algorithm. As a consequence, being able to measure accurately the time taken by these routines is crucial in order to optimize them and achieve the best performance. For instance, the SKaMPI library was designed to measure the time taken by MPI routines, relying on MPI's two-sided communication model to measure one-sided and two-sided peer-to-peer communication and collective routines. In this paper, we discuss the benchmarking challenges specific to OpenSHMEM's communication model, mainly to avoid inter-call pipelining and overlap** when measuring the time taken by its routines. We extend SKaMPI for OpenSHMEM for this purpose and demonstrate measurement algorithms that address OpenSHMEM's communication model in practice. Scaling experiments are run on the Summit platform to compare different benchmarking approaches on the SKaMPI benchmark operations. These show the advantages of our techniques for more accurate performance characterization. △ Less

Submitted 17 February, 2022; originally announced February 2022.

Comments: 17 pages, OpenSHMEM workshop 2021

arXiv:2105.13395 [pdf, ps, other]

Measuring OpenSHMEM Communication Routines with SKaMPI-OpenSHMEM User's manual

Authors: Camille Coti, Allen D Malony

Abstract: This document presents the OpenSHMEM extension for the Special Karlsruhe MPI benchmark and the measurement algorithms used to measure the routines. This document presents the OpenSHMEM extension for the Special Karlsruhe MPI benchmark and the measurement algorithms used to measure the routines. △ Less

Submitted 27 May, 2021; originally announced May 2021.

Comments: This paper is a technical report that comes with our benchmarking software. It implements distributed algorithms for the measurement of distributed operations

arXiv:2003.01081 [pdf, other]

On-the-fly Optimization of Parallel Computation of Symbolic Symplectic Invariants

Authors: Joseph Ben Geloun, Camille Coti, Allen D. Malony

Abstract: Group invariants are used in high energy physics to define quantum field theory interactions. In this paper, we are presenting the parallel algebraic computation of special invariants called symplectic and even focusing on one particular invariant that finds recent interest in physics. Our results will export to other invariants. The cost of performing basic computations on the multivariate polyno… ▽ More Group invariants are used in high energy physics to define quantum field theory interactions. In this paper, we are presenting the parallel algebraic computation of special invariants called symplectic and even focusing on one particular invariant that finds recent interest in physics. Our results will export to other invariants. The cost of performing basic computations on the multivariate polynomials involved evolves during the computation, as the polynomials get larger or with an increasing number of terms. However, in some cases, they stay small. Traditionally, high-performance software is optimized by running it on a smaller data set in order to use profiling information to set some tuning parameters. Since the (communication and computation) costs evolve during the computation, the first iterations of the computation might not be representative of the rest of the computation and this approach cannot be applied in this case. To cope with this evolution, we are presenting an approach to get performance data and tune the algorithm during the execution. △ Less

Submitted 2 March, 2020; originally announced March 2020.

arXiv:1906.05020 [pdf, other]

doi 10.1016/j.parco.2019.02.006

Checkpoint/restart approaches for a thread-based MPI runtime

Authors: Julien Adam, Maxime Kermarquer, Jean-Baptiste Besnard, Leonardo Bautista-Gomez, Marc Perache, Patrick Carribault, Julien Jaeger, Allen D. Malony, Sameer Shende

Abstract: Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable bot… ▽ More Fault-tolerance has always been an important topic when it comes to running massively parallel programs at scale. Statistically, hardware and software failures are expected to occur more often on systems gathering millions of computing units. Moreover, the larger jobs are, the more computing hours would be wasted by a crash. In this paper, we describe the work done in our MPI runtime to enable both transparent and application-level checkpointing mechanisms. Unlike the MPI 4.0 User-Level Failure Mitigation (ULFM) interface, our work targets solely Checkpoint/Restart and ignores other features such as resiliency. We show how existing checkpointing methods can be practically applied to a thread-based MPI implementation given sufficient runtime collaboration. The two main contributions are the preservation of high-speed network performance during transparent C/R and the over-subscription of checkpoint data replication thanks to a dedicated user-level scheduler support. These techniques are measured on MPI benchmarks such as IMB, Lulesh and Heatdis, and associated overhead and trade-offs are discussed. △ Less

Submitted 12 June, 2019; originally announced June 2019.

Comments: This research has been partially sponsored by the European Union s Horizon 2020 Programme under the LEGaTO Project (www.legato-project.eu), grant agreement 780681 and the Mont-Blanc 2020 project, grant agreement no. 779877

arXiv:1701.08547 [pdf, other]

Autotuning GPU Kernels via Static and Predictive Analysis

Authors: Robert V. Lim, Boyana Norris, Allen D. Malony

Abstract: Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although emp… ▽ More Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical autotuning addresses some of these challenges, it requires extensive experimentation and search for optimal code variants. This research presents an approach for tuning CUDA kernels based on static analysis that considers fine-grained code structure and the specific GPU architecture features. Notably, our approach does not require any program runs in order to discover near-optimal parameter settings. We demonstrate the applicability of our approach in enabling code autotuners such as Orio to produce competitive code variants comparable with empirical-based methods, without the high cost of experiments. △ Less

Submitted 29 June, 2017; v1 submitted 30 January, 2017; originally announced January 2017.

Showing 1–8 of 8 results for author: Malony, A D