-
Scalable Training of Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN
Authors:
Massimiliano Lupo Pasini,
Jong Youl Choi,
Kshitij Mehta,
Pei Zhang,
David Rogers,
Jonghyun Bae,
Khaled Z. Ibrahim,
Ashwin M. Aji,
Karl W. Schulz,
Jorda Polo,
Prasanna Balaprakash
Abstract:
We present our work on develo** and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that de…
▽ More
We present our work on develo** and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stop** was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution opens the door for rapidly develo**, training, and deploying GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.
△ Less
Submitted 28 June, 2024; v1 submitted 12 June, 2024;
originally announced June 2024.
-
Cost-Effective Methodology for Complex Tuning Searches in HPC: Navigating Interdependencies and Dimensionality
Authors:
Adrian Perez Dieguez,
Min Choi,
Mahmut Okyay,
Mauro Del Ben,
Bryan M. Wong,
Khaled Z. Ibrahim
Abstract:
Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practi…
▽ More
Tuning searches are pivotal in High-Performance Computing (HPC), addressing complex optimization challenges in computational applications. The complexity arises not only from finely tuning parameters within routines but also potential interdependencies among them, rendering traditional optimization methods inefficient. Instead of scrutinizing interdependencies among parameters and routines, practitioners often face the dilemma of conducting independent tuning searches for each routine, thereby overlooking interdependence, or pursuing a more resource-intensive joint search for all routines. This decision is driven by the consideration that some interdependence analysis and high-dimensional decomposition techniques in literature may be prohibitively expensive in HPC tuning searches. Our methodology adapts and refines these methods to ensure computational feasibility while maximizing performance gains in real-world scenarios. Our methodology leverages a cost-effective interdependence analysis to decide whether to merge several tuning searches into a joint search or conduct orthogonal searches. Tested on synthetic functions with varying levels of parameter interdependence, our methodology efficiently explores the search space. In comparison to Bayesian-optimization-based full independent or fully joint searches, our methodology suggested an optimized breakdown of independent and merged searches that led to final configurations up to 8% more accurate, reducing the search time by up to 95%. When applied to GPU-offloaded Real-Time Time-Dependent Density Functional Theory (RT-TDDFT), an application in computational materials science that challenges modern HPC autotuners, our methodology achieved an effective tuning search. Its adaptability and efficiency extend beyond RT-TDDFT, making it valuable for related applications in HPC.
△ Less
Submitted 12 March, 2024;
originally announced March 2024.
-
Sparse-Stochastic Fragmented Exchange for Large-Scale Hybrid TDDFT Calculations
Authors:
Mykola Sereda,
Tucker Allen,
Nadine C. Bradbury,
Khaled Z. Ibrahim,
Daniel Neuhauser
Abstract:
We extend our recently developed sparse-stochastic fragmented exchange formalism for ground-state hybrid DFT (ngH-DFT) to calculate absorption spectra within linear-response time-dependent Generalized Kohn-Sham DFT (LR-GKS-TDDFT), for systems consisting of thousands of valence electrons within a grid-based/plane-wave representation. A mixed deterministic/fragmented-stochastic compression of the ex…
▽ More
We extend our recently developed sparse-stochastic fragmented exchange formalism for ground-state hybrid DFT (ngH-DFT) to calculate absorption spectra within linear-response time-dependent Generalized Kohn-Sham DFT (LR-GKS-TDDFT), for systems consisting of thousands of valence electrons within a grid-based/plane-wave representation. A mixed deterministic/fragmented-stochastic compression of the exchange kernel, here using long-range explicit exchange functionals, provides an efficient method for accurate optical spectra. Both real-time propagation as well frequency-resolved Casida-equation-type approaches for spectra are presented, and the method is applied to large molecular dyes.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
An Evaluation of Real-time Adaptive Sampling Change Point Detection Algorithm using KCUSUM
Authors:
Vijayalakshmi Saravanan,
Perry Siehien,
Shinjae Yoo,
Hubertus Van Dam,
Thomas Flynn,
Christopher Kelly,
Khaled Z Ibrahim
Abstract:
Detecting abrupt changes in real-time data streams from scientific simulations presents a challenging task, demanding the deployment of accurate and efficient algorithms. Identifying change points in live data stream involves continuous scrutiny of incoming observations for deviations in their statistical characteristics, particularly in high-volume data scenarios. Maintaining a balance between su…
▽ More
Detecting abrupt changes in real-time data streams from scientific simulations presents a challenging task, demanding the deployment of accurate and efficient algorithms. Identifying change points in live data stream involves continuous scrutiny of incoming observations for deviations in their statistical characteristics, particularly in high-volume data scenarios. Maintaining a balance between sudden change detection and minimizing false alarms is vital. Many existing algorithms for this purpose rely on known probability distributions, limiting their feasibility. In this study, we introduce the Kernel-based Cumulative Sum (KCUSUM) algorithm, a non-parametric extension of the traditional Cumulative Sum (CUSUM) method, which has gained prominence for its efficacy in online change point detection under less restrictive conditions. KCUSUM splits itself by comparing incoming samples directly with reference samples and computes a statistic grounded in the Maximum Mean Discrepancy (MMD) non-parametric framework. This approach extends KCUSUM's pertinence to scenarios where only reference samples are available, such as atomic trajectories of proteins in vacuum, facilitating the detection of deviations from the reference sample without prior knowledge of the data's underlying distribution. Furthermore, by harnessing MMD's inherent random-walk structure, we can theoretically analyze KCUSUM's performance across various use cases, including metrics like expected delay and mean runtime to false alarms. Finally, we discuss real-world use cases from scientific simulations such as NWChem CODAR and protein folding data, demonstrating KCUSUM's practical effectiveness in online change point detection.
△ Less
Submitted 4 April, 2024; v1 submitted 15 February, 2024;
originally announced February 2024.
-
Velocity-gauge real-time time-dependent density functional tight-binding for large-scale condensed matter systems
Authors:
Qiang Xu,
Mauro Del Ben,
Mahmut Sait Okyay,
Min Choi,
Khaled Z. Ibrahim,
Bryan M. Wong
Abstract:
We present a new velocity-gauge real-time, time-dependent density functional tight-binding (VG-rtTDDFTB) implementation in the open-source DFTB+ software package (https://dftbplus.org) for probing electronic excitations in large, condensed matter systems. Our VG-rtTDDFTB approach enables real-time electron dynamics simulations of large, periodic, condensed matter systems containing thousands of at…
▽ More
We present a new velocity-gauge real-time, time-dependent density functional tight-binding (VG-rtTDDFTB) implementation in the open-source DFTB+ software package (https://dftbplus.org) for probing electronic excitations in large, condensed matter systems. Our VG-rtTDDFTB approach enables real-time electron dynamics simulations of large, periodic, condensed matter systems containing thousands of atoms with a favorable computational scaling as a function of system size. We provide computational details and benchmark calculations to demonstrate its accuracy and computational parallelizability on a variety of large material systems. As a representative example, we calculate laser-induced electron dynamics in a 512-atom amorphous silicon supercell to highlight the large periodic systems that can be examined with our implementation. Taken together, our VG-rtTDDFTB approach enables new electron dynamics simulations of complex systems that require large periodic supercells, such as crystal defects, complex surfaces, nanowires, and amorphous materials.
△ Less
Submitted 21 May, 2024; v1 submitted 18 August, 2023;
originally announced August 2023.
-
Dynamic Mode Decomposition for Extrapolating Non-equilibrium Green's Functions Dynamics
Authors:
Cian C. Reeves,
Jia Yin,
Yuanran Zhu,
Khaled Z. Ibrahim,
Chao Yang,
Vojtech Vlcek
Abstract:
The HF-GKBA offers an approximate numerical procedure for propagating the two-time non-equilibrium Green's function(NEGF). Here we compare the HF-GKBA to exact results for a variety of systems with long and short-range interactions, different two-body interaction strengths and various non-equilibrium preparations. We find excellent agreement between the HF-GKBA and exact time evolution in models w…
▽ More
The HF-GKBA offers an approximate numerical procedure for propagating the two-time non-equilibrium Green's function(NEGF). Here we compare the HF-GKBA to exact results for a variety of systems with long and short-range interactions, different two-body interaction strengths and various non-equilibrium preparations. We find excellent agreement between the HF-GKBA and exact time evolution in models when more realistic long-range exponentially decaying interactions are considered. This agreement persists for long times and for intermediate to strong interaction strengths. In large systems, HF-GKBA becomes prohibitively expensive for long-time evolutions. For this reason, look at the use of dynamical mode decomposition(DMD) to reconstruct long-time NEGF trajectories from a sample of the initial trajectory. Using no more than 16\% of the total time evolution we reconstruct the total trajectory with high fidelity. Our results show the potential for DMD to be used in conjunction with HF-GKBA to calculate long time trajectories in large-scale systems.
△ Less
Submitted 20 January, 2023; v1 submitted 10 November, 2022;
originally announced November 2022.
-
Enhancing Scalability of a Matrix-Free Eigensolver for Studying Many-Body Localization
Authors:
Roel Van Beeumen,
Khaled Z. Ibrahim,
Gregory D. Kahanamoku-Meyer,
Norman Y. Yao,
Chao Yang
Abstract:
In [Van Beeumen, et. al, HPC Asia 2020, https://www.doi.org/10.1145/3368474.3368497] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor $XX+YY$ interactions plus $Z$ terms. This type of problem is computationally challenging because the vector space dimension grows exponentially w…
▽ More
In [Van Beeumen, et. al, HPC Asia 2020, https://www.doi.org/10.1145/3368474.3368497] a scalable and matrix-free eigensolver was proposed for studying the many-body localization (MBL) transition of two-level quantum spin chain models with nearest-neighbor $XX+YY$ interactions plus $Z$ terms. This type of problem is computationally challenging because the vector space dimension grows exponentially with the physical system size, and averaging over different configurations of the random disorder is needed to obtain relevant statistical behavior. For each eigenvalue problem, eigenvalues from different regions of the spectrum and their corresponding eigenvectors need to be computed. Traditionally, the interior eigenstates for a single eigenvalue problem are computed via the shift-and-invert Lanczos algorithm. Due to the extremely high memory footprint of the LU factorizations, this technique is not well suited for large number of spins $L$, e.g., one needs thousands of compute nodes on modern high performance computing infrastructures to go beyond $L = 24$. The matrix-free approach does not suffer from this memory bottleneck, however, its scalability is limited by a computation and communication imbalance. We present a few strategies to reduce this imbalance and to significantly enhance the scalability of the matrix-free eigensolver. To optimize the communication performance, we leverage the consistent space runtime, CSPACER, and show its efficiency in accelerating the MBL irregular communication patterns at scale compared to optimized MPI non-blocking two-sided and one-sided RMA implementation variants. The efficiency and effectiveness of the proposed algorithm is demonstrated by computing eigenstates on a massively parallel many-core high performance computer.
△ Less
Submitted 30 November, 2020;
originally announced December 2020.
-
An efficient basis set representation for calculating electrons in molecules
Authors:
Jeremiah R. Jones,
Francois-Henry Rouet,
Keith V. Lawler,
Eugene Vecharynski,
Khaled Z. Ibrahim,
Samuel Williams,
Brant Abeln,
Chao Yang,
Daniel J. Haxton,
C. William McCurdy,
Xiaoye S. Li,
Thomas N. Rescigno
Abstract:
The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resoluti…
▽ More
The method of McCurdy, Baertschy, and Rescigno, J. Phys. B, 37, R137 (2004) is generalized to obtain a straightforward, surprisingly accurate, and scalable numerical representation for calculating the electronic wave functions of molecules. It uses a basis set of product sinc functions arrayed on a Cartesian grid, and yields 1 kcal/mol precision for valence transition energies with a grid resolution of approximately 0.1 bohr. The Coulomb matrix elements are replaced with matrix elements obtained from the kinetic energy operator. A resolution-of-the-identity approximation renders the primitive one- and two-electron matrix elements diagonal; in other words, the Coulomb operator is local with respect to the grid indices. The calculation of contracted two-electron matrix elements among orbitals requires only O(N log(N)) multiplication operations, not O(N^4), where N is the number of basis functions; N = n^3 on cubic grids. The representation not only is numerically expedient, but also produces energies and properties superior to those calculated variationally. Absolute energies, absorption cross sections, transition energies, and ionization potentials are reported for one- (He^+, H_2^+ ), two- (H_2, He), ten- (CH_4) and 56-electron (C_8H_8) systems.
△ Less
Submitted 22 January, 2016; v1 submitted 13 July, 2015;
originally announced July 2015.