-
Continual learning autoencoder training for a particle-in-cell simulation via streaming
Authors:
Patrick Stiller,
Varun Makdani,
Franz Pöschel,
Richard Pausch,
Alexander Debus,
Michael Bussmann,
Nico Hoffmann
Abstract:
The upcoming exascale era will provide a new generation of physics simulations. These simulations will have a high spatiotemporal resolution, which will impact the training of machine learning models since storing a high amount of simulation data on disk is nearly impossible. Therefore, we need to rethink the training of machine learning models for simulations for the upcoming exascale era. This w…
▽ More
The upcoming exascale era will provide a new generation of physics simulations. These simulations will have a high spatiotemporal resolution, which will impact the training of machine learning models since storing a high amount of simulation data on disk is nearly impossible. Therefore, we need to rethink the training of machine learning models for simulations for the upcoming exascale era. This work presents an approach that trains a neural network concurrently to a running simulation without storing data on a disk. The training pipeline accesses the training data by in-memory streaming. Furthermore, we apply methods from the domain of continual learning to enhance the generalization of the model. We tested our pipeline on the training of a 3d autoencoder trained concurrently to laser wakefield acceleration particle-in-cell simulation. Furthermore, we experimented with various continual learning methods and their effect on the generalization.
△ Less
Submitted 9 November, 2022;
originally announced November 2022.
-
Improving I/O Performance for Exascale Applications through Online Data Layout Reorganization
Authors:
Lipeng Wan,
Axel Huebl,
Junmin Gu,
Franz Poeschel,
Ana Gainaru,
Ruonan Wang,
Jieyang Chen,
Xin Liang,
Dmitry Ganyushin,
Todd Munson,
Ian Foster,
Jean-Luc Vay,
Norbert Podhorszki,
Kesheng Wu,
Scott Klasky
Abstract:
The applications being developed within the U.S. Exascale Computing Project (ECP) to run on imminent Exascale computers will generate scientific results with unprecedented fidelity and record turn-around time. Many of these codes are based on particle-mesh methods and use advanced algorithms, especially dynamic load-balancing and mesh-refinement, to achieve high performance on Exascale machines. Y…
▽ More
The applications being developed within the U.S. Exascale Computing Project (ECP) to run on imminent Exascale computers will generate scientific results with unprecedented fidelity and record turn-around time. Many of these codes are based on particle-mesh methods and use advanced algorithms, especially dynamic load-balancing and mesh-refinement, to achieve high performance on Exascale machines. Yet, as such algorithms improve parallel application efficiency, they raise new challenges for I/O logic due to their irregular and dynamic data distributions. Thus, while the enormous data rates of Exascale simulations already challenge existing file system write strategies, the need for efficient read and processing of generated data introduces additional constraints on the data layout strategies that can be used when writing data to secondary storage. We review these I/O challenges and introduce two online data layout reorganization approaches for achieving good tradeoffs between read and write performance. We demonstrate the benefits of using these two approaches for the ECP particle-in-cell simulation WarpX, which serves as a motif for a large class of important Exascale applications. We show that by understanding application I/O patterns and carefully designing data layouts we can increase read performance by more than 80%.
△ Less
Submitted 15 July, 2021;
originally announced July 2021.
-
Transitioning from file-based HPC workflows to streaming data pipelines with openPMD and ADIOS2
Authors:
Franz Poeschel,
Juncheng E,
William F. Godoy,
Norbert Podhorszki,
Scott Klasky,
Greg Eisenhauer,
Philip E. Davis,
Lipeng Wan,
Ana Gainaru,
Junmin Gu,
Fabian Koller,
René Widera,
Michael Bussmann,
Axel Huebl
Abstract:
This paper aims to create a transition path from file-based IO to streaming-based workflows for scientific applications in an HPC environment. By using the openPMP-api, traditional workflows limited by filesystem bottlenecks can be overcome and flexibly extended for in situ analysis. The openPMD-api is a library for the description of scientific data according to the Open Standard for Particle-Mes…
▽ More
This paper aims to create a transition path from file-based IO to streaming-based workflows for scientific applications in an HPC environment. By using the openPMP-api, traditional workflows limited by filesystem bottlenecks can be overcome and flexibly extended for in situ analysis. The openPMD-api is a library for the description of scientific data according to the Open Standard for Particle-Mesh Data (openPMD). Its approach towards recent challenges posed by hardware heterogeneity lies in the decoupling of data description in domain sciences, such as plasma physics simulations, from concrete implementations in hardware and IO. The streaming backend is provided by the ADIOS2 framework, developed at Oak Ridge National Laboratory. This paper surveys two openPMD-based loosely-coupled setups to demonstrate flexible applicability and to evaluate performance. In loose coupling, as opposed to tight coupling, two (or more) applications are executed separately, e.g. in individual MPI contexts, yet cooperate by exchanging data. This way, a streaming-based workflow allows for standalone codes instead of tightly-coupled plugins, using a unified streaming-aware API and leveraging high-speed communication infrastructure available in modern compute clusters for massive data exchange. We determine new challenges in resource allocation and in the need of strategies for a flexible data distribution, demonstrating their influence on efficiency and scaling on the Summit compute system. The presented setups show the potential for a more flexible use of compute resources brought by streaming IO as well as the ability to increase throughput by avoiding filesystem bottlenecks.
△ Less
Submitted 19 January, 2022; v1 submitted 13 July, 2021;
originally announced July 2021.