-
Understanding the Impact of openPMD on BIT1, a Particle-in-Cell Monte Carlo Code, through Instrumentation, Monitoring, and In-Situ Analysis
Authors:
Jeremy J. Williams,
Stefan Costea,
Allen D. Malony,
David Tskhakaya,
Leon Kos,
Ales Podolnik,
Jakub Hromadka,
Kevin Huck,
Erwin Laure,
Stefano Markidis
Abstract:
Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and s…
▽ More
Particle-in-Cell Monte Carlo simulations on large-scale systems play a fundamental role in understanding the complexities of plasma dynamics in fusion devices. Efficient handling and analysis of vast datasets are essential for advancing these simulations. Previously, we addressed this challenge by integrating openPMD with BIT1, a Particle-in-Cell Monte Carlo code, streamlining data streaming and storage. This integration not only enhanced data management but also improved write throughput and storage efficiency. In this work, we delve deeper into the impact of BIT1 openPMD BP4 instrumentation, monitoring, and in-situ analysis. Utilizing cutting-edge profiling and monitoring tools such as gprof, CrayPat, Cray Apprentice2, IPM, and Darshan, we dissect BIT1's performance post-integration, shedding light on computation, communication, and I/O operations. Fine-grained instrumentation offers insights into BIT1's runtime behavior, while immediate monitoring aids in understanding system dynamics and resource utilization patterns, facilitating proactive performance optimization. Advanced visualization techniques further enrich our understanding, enabling the optimization of BIT1 simulation workflows aimed at controlling plasma-material interfaces with improved data analysis and visualization at every checkpoint without causing any interruption to the simulation.
△ Less
Submitted 27 June, 2024;
originally announced June 2024.
-
Autonomy Loops for Monitoring, Operational Data Analytics, Feedback, and Response in HPC Operations
Authors:
Francieli Boito,
Jim Brandt,
Valeria Cardellini,
Philip Carns,
Florina M. Ciorba,
Hilary Egan,
Ahmed Eleliemy,
Ann Gentile,
Thomas Gruber,
Jeff Hanson,
Utz-Uwe Haus,
Kevin Huck,
Thomas Ilsche,
Thomas Jakobsche,
Terry Jones,
Sven Karlsson,
Abdullah Mueen,
Michael Ott,
Tapasya Patki,
Ivy Peng,
Krishnan Raghavan,
Stephen Simms,
Kathleen Shoga,
Michael Showerman,
Devesh Tiwari
, et al. (2 additional authors not shown)
Abstract:
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more…
▽ More
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and develo** such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.
△ Less
Submitted 30 January, 2024;
originally announced January 2024.
-
STaKTAU: profiling HPC applications' operating system usage
Authors:
Camille Coti,
Kevin Huck,
Allen D. Malony
Abstract:
This paper presents a approach for measuring the time spent by HPC applications in the operating system's kernel. We use the SystemTap interface to insert timers before and after system calls, and take advantage of its stability to design a tool that can be used with multiple versions of the kernel. We evaluate its performance overhead, using an OS-intensive mini-benchmark and a raytracing mini ap…
▽ More
This paper presents a approach for measuring the time spent by HPC applications in the operating system's kernel. We use the SystemTap interface to insert timers before and after system calls, and take advantage of its stability to design a tool that can be used with multiple versions of the kernel. We evaluate its performance overhead, using an OS-intensive mini-benchmark and a raytracing mini app.
△ Less
Submitted 21 April, 2023;
originally announced April 2023.
-
Simulating Stellar Merger using HPX/Kokkos on A64FX on Supercomputer Fugaku
Authors:
Patrick Diehl,
Gregor Daiß,
Kevin Huck,
Dominic Marcello,
Sagiv Shiber,
Hartmut Kaiser,
Dirk Pflüger
Abstract:
The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different…
▽ More
The increasing availability of machines relying on non-GPU architectures, such as ARM A64FX in high-performance computing, provides a set of interesting challenges to application developers. In addition to requiring code portability across different parallelization schemes, programs targeting these architectures have to be highly adaptable in terms of compute kernel sizes to accommodate different execution characteristics for various heterogeneous workloads. In this paper, we demonstrate an approach to code and performance portability that is based entirely on established standards in the industry. In addition to applying Kokkos as an abstraction over the execution of compute kernels on different heterogeneous execution environments, we show that the use of standard C++ constructs as exposed by the HPX runtime system enables superb portability in terms of code and performance based on the real-world Octo-Tiger astrophysics application. We report our experience with porting Octo-Tiger to the ARM A64FX architecture provided by Stony Brook's Ookami and Riken's Supercomputer Fugaku and compare the resulting performance with that achieved on well established GPU-oriented HPC machines such as ORNL's Summit, NERSC's Perlmutter and CSCS's Piz Daint systems. Octo-Tiger scaled well on Supercomputer Fugaku without any major code changes due to the abstraction levels provided by HPX and Kokkos. Adding vectorization support for ARM's SVE to Octo-Tiger was trivial thanks to using standard C++
△ Less
Submitted 15 March, 2023;
originally announced April 2023.
-
Distributed, combined CPU and GPU profiling within HPX using APEX
Authors:
Patrick Diehl,
Gregor Daiss,
Kevin Huck,
Dominic Marcello,
Sagiv Shiber,
Hartmut Kaiser,
Juhan Frank,
Geoffrey C. Clayton,
Dirk Pflueger
Abstract:
Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a…
▽ More
Benchmarking and comparing performance of a scientific simulation across hardware platforms is a complex task. When the simulation in question is constructed with an asynchronous, many-task (AMT) runtime offloading work to GPUs, the task becomes even more complex. In this paper, we discuss the use of a uniquely suited performance measurement library, APEX, to capture the performance behavior of a simulation built on HPX, a highly scalable, distributed AMT runtime. We examine the performance of the astrophysics simulation carried-out by Octo-Tiger on two different supercomputing architectures. We analyze the results of scaling and measurement overheads. In addition, we look in-depth at two similarly configured executions on the two systems to study how architectural differences affect performance and identify opportunities for optimization. As one such opportunity, we optimize the communication for the hydro solver and investigated its performance impact.
△ Less
Submitted 21 September, 2022;
originally announced October 2022.
-
Traveler: Navigating Task Parallel Traces for Performance Analysis
Authors:
Sayef Azad Sakin,
Alex Bigelow,
R. Tohid,
Connor Scully-Allison,
Carlos Scheidegger,
Steven R. Brandt,
Christopher Taylor,
Kevin A. Huck,
Hartmut Kaiser,
Katherine E. Isaacs
Abstract:
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activit…
▽ More
Understanding the behavior of software in execution is a key step in identifying and fixing performance issues. This is especially important in high performance computing contexts where even minor performance tweaks can translate into large savings in terms of computational resource use. To aid performance analysis, developers may collect an execution trace - a chronological log of program activity during execution. As traces represent the full history, developers can discover a wide array of possibly previously unknown performance issues, making them an important artifact for exploratory performance analysis. However, interactive trace visualization is difficult due to issues of data size and complexity of meaning. Traces represent nanosecond-level events across many parallel processes, meaning the collected data is often large and difficult to explore. The rise of asynchronous task parallel programming paradigms complicates the relation between events and their probable cause. To address these challenges, we conduct a continuing design study in collaboration with high performance computing researchers. We develop diverse and hierarchical ways to navigate and represent execution trace data in support of their trace analysis tasks. Through an iterative design process, we developed Traveler, an integrated visualization platform for task parallel traces. Traveler provides multiple linked interfaces to help navigate trace data from multiple contexts. We evaluate the utility of Traveler through feedback from users and a case study, finding that integrating multiple modes of navigation in our design supported performance analysis tasks and led to the discovery of previously unknown behavior in a distributed array library.
△ Less
Submitted 3 September, 2022; v1 submitted 29 July, 2022;
originally announced August 2022.
-
Octo-Tiger's New Hydro Module and Performance Using HPX+CUDA on ORNL's Summit
Authors:
Patrick Diehl,
Gregor Daiß,
Dominic Marcello,
Kevin Huck,
Sagiv Shiber,
Hartmut Kaiser,
Juhan Frank,
Dirk Pflüger
Abstract:
Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver.…
▽ More
Octo-Tiger is a code for modeling three-dimensional self-gravitating astrophysical fluids. It was particularly designed for the study of dynamical mass transfer between interacting binary stars. Octo-Tiger is parallelized for distributed systems using the asynchronous many-task runtime system, the C++ standard library for parallelism and concurrency (HPX) and utilizes CUDA for its gravity solver. Recently, we have remodeled Octo-Tiger's hydro solver to use a three-dimensional reconstruction scheme. In addition, we have ported the hydro solver to GPU using CUDA kernels. We present scaling results for the new hydro kernels on ORNL's Summit machine using a Sedov-Taylor blast wave problem. We also compare Octo-Tiger's new hydro scheme with its old hydro scheme, using a rotating star as a test problem.
△ Less
Submitted 26 July, 2021; v1 submitted 22 July, 2021;
originally announced July 2021.
-
Memory Reduction using a Ring Abstraction over GPU RDMA for Distributed Quantum Monte Carlo Solver
Authors:
Weile Wei,
Eduardo D'Azevedo,
Kevin Huck,
Arghya Chatterjee,
Oscar Hernandez,
Hartmut Kaiser
Abstract:
Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper disc…
▽ More
Scientific applications that run on leadership computing facilities often face the challenge of being unable to fit leading science cases onto accelerator devices due to memory constraints (memory-bound applications). In this work, the authors studied one such US Department of Energy mission-critical condensed matter physics application, Dynamical Cluster Approximation (DCA++), and this paper discusses how device memory-bound challenges were successfully reduced by proposing an effective "all-to-all" communication method -- a ring communication algorithm. This implementation takes advantage of acceleration on GPUs and remote direct memory access (RDMA) for fast data exchange between GPUs.
Additionally, the ring algorithm was optimized with sub-ring communicators and multi-threaded support to further reduce communication overhead and expose more concurrency, respectively. The computation and communication were also analyzed by using the Autonomic Performance Environment for Exascale (APEX) profiling tool, and this paper further discusses the performance trade-off for the ring algorithm implementation. The memory analysis on the ring algorithm shows that the allocation size for the authors' most memory-intensive data structure per GPU is now reduced to 1/p of the original size, where p is the number of GPUs in the ring communicator. The communication analysis suggests that the distributed Quantum Monte Carlo execution time grows linearly as sub-ring size increases, and the cost of messages passing through the network interface connector could be a limiting factor.
△ Less
Submitted 13 May, 2021; v1 submitted 30 April, 2021;
originally announced May 2021.
-
Performance Measurements within Asynchronous Task-based Runtime Systems: A Double White Dwarf Merger as an Application
Authors:
Patrick Diehl,
Dominic Marcello,
Parsa Amini,
Hartmut Kaiser,
Sagiv Shiber,
Geoffrey C. Clayton,
Juhan Frank,
Gregor Daiß,
Dirk Pflüger,
David Eder,
Alice Koniges,
Kevin Huck
Abstract:
Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Oct…
▽ More
Analyzing performance within asynchronous many-task-based runtime systems is challenging because millions of tasks are launched concurrently. Especially for long-term runs the amount of data collected becomes overwhelming. We study HPX and its performance-counter framework and APEX to collect performance data and energy consumption. We added HPX application-specific performance counters to the Octo-Tiger full 3D AMR astrophysics application. This enables the combined visualization of physical and performance data to highlight bottlenecks with respect to different solvers. We examine the overhead introduced by these measurements, which is around 1%, with respect to the overall application runtime. We perform a convergence study for four different levels of refinement and analyze the application's performance with respect to adaptive grid refinement. The measurements' overheads are small, enabling the combined use of performance data and physical properties with the goal of improving the code's performance. All of these measurements were obtained on NERSC's Cori, Louisiana Optical Network Infrastructure's QueenBee2, and Indiana University's Big Red 3.
△ Less
Submitted 9 June, 2021; v1 submitted 30 January, 2021;
originally announced February 2021.
-
Performance Analysis of a Quantum Monte Carlo Application on Multiple Hardware Architectures Using the HPX Runtime
Authors:
Weile Wei,
Arghya Chatterjee,
Kevin Huck,
Oscar Hernandez,
Hartmut Kaiser
Abstract:
This paper describes how we successfully used the HPX programming model to port the DCA++ application on multiple architectures that include POWER9, x86, ARM v8, and NVIDIA GPUs. We describe the lessons we can learn from this experience as well as the benefits of enabling the HPX in the application to improve the CPU threading part of the code, which led to an overall 21% improvement across archit…
▽ More
This paper describes how we successfully used the HPX programming model to port the DCA++ application on multiple architectures that include POWER9, x86, ARM v8, and NVIDIA GPUs. We describe the lessons we can learn from this experience as well as the benefits of enabling the HPX in the application to improve the CPU threading part of the code, which led to an overall 21% improvement across architectures. We also describe how we used HPX-APEX to raise the level of abstraction to understand performance issues and to identify tasking optimization opportunities in the code, and how these relate to CPU/GPU utilization counters, device memory allocation over time, and CPU kernel-level context switches on a given architecture.
△ Less
Submitted 19 October, 2020; v1 submitted 14 October, 2020;
originally announced October 2020.
-
Chimbuko: A Workflow-Level Scalable Performance Trace Analysis Tool
Authors:
Sungsoo Ha,
Wonyong Jeong,
Gyorgy Matyasfalvi,
Cong Xie,
Kevin Huck,
Jong Youl Choi,
Abid Malik,
Li Tang,
Hubertus Van Dam,
Line Pouchard,
Wei Xu,
Shinjae Yoo,
Nicholas D'Imperio,
Kerstin Kleese Van Dam
Abstract:
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance tra…
▽ More
Because of the limits input/output systems currently impose on high-performance computing systems, a new generation of workflows that include online data reduction and analysis is emerging. Diagnosing their performance requires sophisticated performance analysis capabilities due to the complexity of execution patterns and underlying hardware, and no tool could handle the voluminous performance trace data needed to detect potential problems. This work introduces Chimbuko, a performance analysis framework that provides real-time, distributed, in situ anomaly detection. Data volumes are reduced for human-level processing without losing necessary details. Chimbuko supports online performance monitoring via a visualization module that presents the overall workflow anomaly distribution, call stacks, and timelines. Chimbuko also supports the capture and reduction of performance provenance. To the best of our knowledge, Chimbuko is the first online, distributed, and scalable workflow-level performance trace analysis framework, and we demonstrate the tool's usefulness on Oak Ridge National Laboratory's Summit system.
△ Less
Submitted 31 August, 2020;
originally announced August 2020.
-
From Piz Daint to the Stars: Simulation of Stellar Mergers using High-Level Abstractions
Authors:
Gregor Daiß,
Parsa Amini,
John Biddiscombe,
Patrick Diehl,
Juhan Frank,
Kevin Huck,
Hartmut Kaiser,
Dominic Marcello,
David Pfander,
Dirk Pflüger
Abstract:
We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems,…
▽ More
We study the simulation of stellar mergers, which requires complex simulations with high computational demands. We have developed Octo-Tiger, a finite volume grid-based hydrodynamics simulation code with Adaptive Mesh Refinement which is unique in conserving both linear and angular momentum to machine precision. To face the challenge of increasingly complex, diverse, and heterogeneous HPC systems, Octo-Tiger relies on high-level programming abstractions.
We use HPX with its futurization capabilities to ensure scalability both between nodes and within, and present first results replacing MPI with libfabric achieving up to a 2.8x speedup. We extend Octo-Tiger to heterogeneous GPU-accelerated supercomputers, demonstrating node-level performance and portability. We show scalability up to full system runs on Piz Daint. For the scenario's maximum resolution, the compute-critical parts (hydrodynamics and gravity) achieve 68.1% parallel efficiency at 2048 nodes.
△ Less
Submitted 9 August, 2019; v1 submitted 8 August, 2019;
originally announced August 2019.
-
Asynchronous Execution of Python Code on Task Based Runtime Systems
Authors:
R. Tohid,
Bibek Wagle,
Shahrzad Shirzad,
Patrick Diehl,
Adrian Serio,
Alireza Kheirkhahan,
Parsa Amini,
Katy Williams,
Kate Isaacs,
Kevin Huck,
Steven Brandt,
Hartmut Kaiser
Abstract:
Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenienc…
▽ More
Despite advancements in the areas of parallel and distributed computing, the complexity of programming on High Performance Computing (HPC) resources has deterred many domain experts, especially in the areas of machine learning and artificial intelligence (AI), from utilizing performance benefits of such systems. Researchers and scientists favor high-productivity languages to avoid the inconvenience of programming in low-level languages and costs of acquiring the necessary skills required for programming at this level. In recent years, Python, with the support of linear algebra libraries like NumPy, has gained popularity despite facing limitations which prevent this code from distributed runs. Here we present a solution which maintains both high level programming abstractions as well as parallel and distributed efficiency. Phylanx, is an asynchronous array processing toolkit which transforms Python and NumPy operations into code which can be executed in parallel on HPC resources by map** Python and NumPy functions and variables into a dependency tree executed by HPX, a general purpose, parallel, task-based runtime system written in C++. Phylanx additionally provides introspection and visualization capabilities for debugging and performance analysis. We have tested the foundations of our approach by comparing our implementation of widely used machine learning algorithms to accepted NumPy standards.
△ Less
Submitted 22 October, 2018; v1 submitted 17 October, 2018;
originally announced October 2018.