Scalable Training of Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN

Massimiliano Lupo Pasini1, Jong Youl Choi2, Kshitij Mehta2, Pei Zhang1,
David Rogers4, Jonghyun Bae5, Khaled Z. Ibrahim5, Ashwin M. Aji6, Karl W. Schulz6, Jordà Polo6,
Prasanna Balaprakash3
4Oak Ridge National Laboratory, National Center for Computational Sciences Division, Oak Ridge, TN, USA
5Lawrence Berkeley National Laboratory, Computer Science Department, Berkeley, CA, USA
6AMD Research, Advanced Micro Devices, USA
Email: 1[email protected], 2[email protected], 3[email protected], 4[email protected]
1Oak Ridge National Laboratory, Computational Sciences and Engineering Division, Oak Ridge, TN, USA
2Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA
3Oak Ridge National Laboratory, Computer and Computational Sciences Directorate, Oak Ridge, TN, USA
5[email protected], 6[email protected], 7[email protected], 8[email protected],
9[email protected], 10[email protected], 11[email protected]
Abstract

We present our work on develo** and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network computations in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define nearest-neighbor convolution in graph neural networks. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two state-of-the-art United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stop** was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution establishes core capabilities for rapidly develo**, training, and deploying further GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design.

Index Terms:
ML data parallelism, graph neural networks, message passing, large-scale data processing for ML, atomistic materials modeling

This manuscript has been authored in part by UT-Battelle, LLC, under contract DE-AC05-00OR22725 with the US Department of Energy (DOE). The US government retains and the publisher, by accepting the article for publication, acknowledges that the US government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this manuscript, or allow others to do so, for US government purposes. DOE will provide public access to these results of federally sponsored research in accordance with the DOE Public Access Plan (http://energy.gov/downloads/doe-public-access-plan).

I Introduction

Discovery of new materials with desired properties, accurate predictions of a material’s behavior throughout its entire lifespan, and new chemical processes enabling unparalleled control of chemical transformations and transport are crucial to fundamental scientific progress in energy generation, transportation, electronics, and information technology [1]. Machine learning (ML) has shown great potential in accelerating the screening and pre-selection of materials for further experimental testing. In particular, deep learning (DL) models have shown the ability to effectively capture relevant underlying relationships due to the arrangement of atoms of different constituents within an atomistic structure [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]. DL models can be trained on the data generated from experiments and/or first-principles calculations and then used to predict the properties of interest for new inputs. The inference takes only a fraction of the time it would otherwise take to run an experiment or a full first-principles calculation while still producing sufficiently accurate results. This drastic reduction in time to predict material properties using atomistic information results in a promising path towards accelerating material discovery and design [13, 14].

However, generating vast volumes of experimental and/or first-principles data is impractical even with sophisticated experimental facilities and powerful supercomputers. Recently, foundation models (FMs) have demonstrated the capability to navigate around the challenge: once pre-trained over a large volume of available open-source data [15], an FM holds the promise to overcome this limitation by providing a jump-start to refined models by fine-tuning on smaller amounts of data for customized applications (also called downstream tasks). Reducing the number of simulations and/or experiments for generating domain-specific training data also drastically reduces the energy costs of develo** domain-specific DL models.

While state-of-the-art language-based FMs with a transformer architecture have reached outstanding results in several domains [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31], they fail to capture important topological aspects of the atomistic structures. Therefore, alternative DL architectures that are better suited to retain important topological aspects at the atomistic scale need to be be considered for the development of trustworthy FMs for materials using atomistic scale information.

Since atomistic material structures for a generic type of compound can be mapped onto a graph (where atoms can be treated as nodes and interatomic bonds as edges), graph foundation models (GFMs), which are FMs that operate on data structures as graphs, are the candidate of choice for these applications. Currently, GFMs proposed in the literature are developed by training graph neural networks (GNNs) architectures on a sufficiently large and comprehensive dataset for the domain of interest. While a few efforts have already been undertaken to develop GFMs for atomistic materials modeling applications [32, 33, 34, 35], the existing work is still at an incipient stage. Current efforts do not yet ensure their proposed approach achieves trustworthiness (interpreted as simultaneous achievement of accuracy and high confidence).

The work described in this manuscript is the first-of-its- kind large-scale training of GFMs for atomistic materials modeling. We describe our approach to develo** trustworthy supervised GFMs for atomistic materials modeling for the simultaneous prediction of energies and atomic forces. Trustworthiness is obtained (i) by performing HPO at extreme scale to identify a sufficiently broad set of architectures capable of reaching the desired accuracy and (ii) by fully training these HPO candidates to obtain high confidence over the predictions using ensemble UQ. The GFMs have been constructed using HydraGNN [36], a fully scalable GNN architecture developed at ORNL. In addition to full scalability [37], HydraGNN offers several other important capabilities: (i) multi-task learning (MTL) for simultaneously predicting multiple properties and stabilizing the training [38], which we leverage for the simultaneous prediction of energies and atomic forces; (ii) an object-oriented design for message passing neural network (MPNN) layers that allows for automated search of the best performing MPNN by treating the choice of the MPNN as a hyperparameter [39]; (iii) invariant and equivariant features that reduce the computational redundancy and time-to-solution [40], therefore saving energy; and (iv) scalable input/output (I/O) data management for efficient DDP on supercomputing facilities [41]. In addition, for this work we have added important capabilities to HydraGNN for the (a) integration of scalable hyperparameter optimization (HPO) to identify the best-performing configurations of hyperparameters in a computationally efficient manner and (b) scalable ensemble UQ to assess the confidence level of the GFM predictions.

Our focus in this paper is more on the high performance computing (HPC) aspect of the study; we illustrate our approach toward scalable data management, scaling the training process, using HPO at scale, and using ensemble UQ techniques. For training and applying such extremely large GFMs, energy consumption is a paramount concern. During the pre-training of the GFM on large volumes of open-source data, we reduced redundant computations and drastically saved energy by using (1) equivariant features to reduce the computational redundancy and (2) early stop** to select the most promising HydraGNN architectures already at very early initial stages of HPO without fully training architectures that are underperforming. Experiments were conducted on two large US-DOE supercomputers: the Perlmutter petascale machine at National Energy Research Scientific Computing Center (NERSC) and the Frontier exascale system at Oak Ridge National Laboratory (ORNL).

The rest of the paper is organized as follows. We discuss the current state of the art and introduce HydraGNN in Section II. In Section III, we discuss our approach toward develo** a scalable framework and list the different optimizations for scalable training. We discuss our use of large-scale HPO to develop a trustworthy GFM. Section IV shows the performance of different components of this work: reading large data, scaling the training process, and performing HPO at large scale. We conclude our study and discuss future work in Section V.

II Current State of the Art

II-A GNN training open-source atomistic data

To date, there have been a few approaches proposed in the literature to develop GFMs for atomistic materials modeling. In [32], the authors proposed a multi-modal approach where multiple encoding DL architectures are trained on different types of data representations and describing different types of quantities. The models are aligned to each other through a penalization term in the training loss function that forces latent vectors from each embedding space to coincide. Even if the approach is proposed to develop FMs to accelerate materials design, the datasets used comprise only organic molecules, which allows to cover only a relatively narrow set of natural elements on the periodic table.

In [34], the authors collected open-source datasets that provide labels for different properties of organic molecules. Using such a diverse collection of datasets, a GNN architecture is used for MTL in order to identify embedding spaces that capture meaningful correlations between the different labeled properties, with the promise that such an embedding space would reduce the data requirement on downstream tasks specific to organic chemistry. Since this approach is trained on open-source datasets that describe only organic molecules, this approach is not transferable to inorganic compounds. Moreover, the authors compare the performance of different MPNN layers to construct the GNN architecture by performing computationally inexpensive hyperparameter tuning on small models with few parameters and transfer the use of such hyperparameters to models of much larger scale. Albeit this approach helps limit the computational burden of HPO on large scale GFMs, the best performing configuration of hyperparameters at small scale is not guaranteed to be the best performing configuration of hyperparameters at a larger scale and on a larger set of data, because the conclusions drawn from the HPO study are model and data dependent.

In [42], the authors developed a GFM trained on the Materials Project Trajectories (MPTrj) dataset [43], using an MPNN layer that is capable to model 4-body interactions. As the authors themselves recognize in their conclusions, albeit their approach sheds light onto a promising path towards building effective GFMs for atomistic materials modeling, the impact of their work is limited by the fact that the GFM has a very small number parameters that was deliberately maintained low due to computational limitations, and this limits the expressivity of the GFM.

While not explicitly presented by their developers as GFMs, there have been other models that cover broader sets of elements of the periodic table compared to the approaches mentioned in the previous paragraph. In [33], the authors built a GNN model using MTL for simultaneously predictions of several material properties by training the GNN model on multiple datasets, including Open Catalyst 2020 (OC2020) [44] and Open Catalyst 2022 (OC2022) [45]. However, the approach considers only a single GNN architecture without performing HPO. Moreover, the set of parameters in the GNN model is relatively small, in the order of few millions of parameters, which limits the attainable accuracy on large volumes of data.

In [35], the authors studied the scaling behavior of 2D molecular GNNs under varied settings of depth, width, number of molecules, number of labels, the diversity in dataset, and the architectural choice. The authors showed that supervised pretraining of large GNNs on molecular datasets provides a rich fingerprint embedding, which is useful for 38 downstream tasks. Even if this work very systematically studied the effect of GNN model size over the predictive performance in the pre-training and fine-tuning stage with many and diverse downstream tasks, the work has two important limitations: it only considers 2D graphs and it addressed only organic compounds.

Several uncertainty quantification (UQ) methods have been applied to GNNs [46], including Bayesian GNNs [47], prediction interval methods [48], and deep ensemble methods [49]. Bayesian methods are theoretically rigorous but challenging to scale to high-dimensional data. Prediction interval methods are cost-effective but often require tedious tuning of heuristic parameters. We leverage deep ensemble methods as a compromise between cost and performance to quantify uncertainty in our GFMs.

Compared to the scientific contributions mentioned above, our work distinguishes itself by leveraging extreme scale supercomputing resources to ensure trustworthiness of the GFMs by performing (i) a systematic large scale HPO across a broad set of GNN architectures and (ii) a large scale ensemble learning (EL) for UQ.

II-B Scalability and GPU optimization for GNN training

The effect of the specific algorithmic characteristics of GNNs on performance benchmarking has been carried out on GPUs by [50], where the authors noted that GNN training differs significantly from conventional convolutional networks (CNNs) in that only 25% of the execution time is spent on dense and sparse matrix multiplications compared to 50% in CNNs. Moreover, the execution time to process graph samples in GNNs was noted to vary greatly according to the size of the graph (number of nodes and number of edges) of the input data. The studies conducted in this work showed that the majority of the time during GNN training was spent in integer operations, sorting, index selection, reductions, and scatter-gather operations needed for nodal and edge feature updates with message passing. Multi-GPU scaling was reported using up to 4 GPUs, showing about 20-50% strong scaling efficiency between 1 and 4 GPUs. Similar remarks apply to refs. [51, 52, 53, 54], which characterize subdivision of large graphs among processors and parallel aggregation during convolution steps.

These are useful conclusions for optimization of GNN training on large graphs (i.e., with millions of nodes), but need to be re-evaluated for our datasets. Training on large graphs can be highly sensitive to the splitting scheme used to partition the graph into subgraphs and to distribute them among processors. For atomistic materials modeling applications addressed in our work, the graph samples are small (with at most a few hundreds of nodes). For the GNN convolutions specifically, convolution on a batch of samples will have a much more local, block diagonal structure. Throughput should be less sensitive to the choice of molecules per batch.

Using a larger number of GPUs, the developers of the PyTorch framework for DDP showed the benefit of overlap** computation with communication, showing near-linear scaling using up to 256 NVIDIA Tesla V100 GPUs [55]. These preliminary scaling results focused on DDP for training of DL model using a moderate volume of data. Compared to this preliminary studies, our work shows near-linear scaling using 10x more GPUs and using much larger volumes of data, which introduces important challenges in I/O that we addressed to reduce computational bottlenecks and minimize communication overheads. Moreover, compared to this work, our results are generated using GPUs of newer generations, namely NVIDIA A100 installed on NERSC-Perlmutter and AMD MI250x installed on OLCF-Frontier, thereby showing that our scaling efficiency is also transferable across technologies manufactured by different vendors.

II-C HydraGNN

The complexity of the physics and the scale at which atomistic structures must be studied in response to US-DOE needs in materials science makes it compelling to develop GNN capabilities that simultaneously satisfy several important algorithmic and computer science requirements. To effectively respond to the scientific needs of the US-DOE, a GNN architecture must provide (1) capabilities to read and process data from multiple sources simultaneously, (2) flexibility to support diverse DOE-relevant scientific applications, (3) capabilities to scale the training on leadership class supercomputing facilities, (4) portability across heterogeneous computing environments, (5) continuous software maintenance by ensuring support and compatibility with upgraded software dependencies, (6) maintained documentation to support new users across a broad set of international institutions.

While several GNN architectures have been made available as open-source tools to the scientific community in the last few years [56, 57, 58, 59], none of these tools completely satisfies the above requirements. Moreover, including missing capabilities on these well-established GNN libraries requires invasive and laborious modifications for software re-design. These challenges arising from existing GNN implementations motivated our effort in develo** HydraGNN [36, 39], our ORNL-branded, scalable, multi-tasking graph neural network architecture. In response to the US-DOE scientific needs, HydraGNN provides:

  • multi-task learning (MTL) capabilities to process multi-source, multi-fidelity data [38]

  • object-oriented programming capabilities to use different MPNN layers [60], which allows flexible switching between different message policies based on the scientific needs of the specific application at hand, as well treating the MPNN layer as a tunable categorical hyperparameter with HPO

  • distributed data management techniques to efficiently scale the training of GNN models on millions of data samples using thousands of GPUs

  • portable capabilities that allow conveniently running the GNN training on diverse computing platforms with different hardware and software specifications

The HydraGNN library uses the Pytorch [61, 62] software for automatic differentiation and the Pytorch Geometric [63, 64] software for message passing. The architectural hyperparameters that determine the HydraGNN model size and complexity can be set in a configuration file to tune the model training and inference process easily. Overall, HydraGNN is developed and maintained as a high-quality software product for large scale training and development of machine learning models [36].

III Our Contribution

The work described in this manuscript is the first-of-its-kind large-scale training of GFMs for atomistic materials modeling. We have employed three key techniques for develo** a scalable and trustworthy GFM: 1) scalable data management using a scientific data management library and an in-memory data store, 2) scalable HPO that uses asynchronous Bayesian optimization for efficiently managing computing resources, and 3) ensemble methods for uncertainty quantification that allows model generalization and concurrently training multiple models. These three advancements collectively enhance the robustness, efficiency, and scalability of the GNN training process.

III-A Data Aggregation

Dataset Number of data samples Size
ANI1x [65] 4,956,005 24 GB
QM7-X [66] 4,195,237 23 GB
OC2020 [44] 134,929,018 4.3 TB
OC2022 [45] 8,847,031 648 GB
MPTrj [43] 1,580,395 17 GB
Total 154,507,686 5.2 TB
TABLE I: Overview of Datasets used for training HydraGNN

Using large datasets for graph foundation model training can enhance generalizability and ensure resilience to data variance issues that typically arise during downstream tasks. To this end, we aggregated five open-source atomistic materials modeling datasets that are extremely diverse in terms of chemical composition, atomistic configurations, and number of atoms in the system. These datasets, as listed in Table I, are: ANI1x, QM7x, OC2020, OC2022, and MPTrj.

  • ANI1x [65] consists of over 4,956,005 conformations derived from up to 57 thousand distinct molecular configurations containing the C, H, N, and O chemical elements

  • QM7x [66] is a comprehensive dataset of 42 physicochemical properties for approximately 4.2 million equilibrium and non-equilibrium structures of small organic molecules with up to seven non-hydrogen atoms from the C, N, O, S, Cl chemical elements

  • OC2020 [44] provides 1,281,040 Density Functional Theory (DFT) relaxations (134,890,000 single point calculations) across a range of oxide materials, coverages, and adsorbates.

  • OC2022 [45] provides 62,331 Density Functional Theory (DFT) relaxations (9,854,504 single point calculations) across a range of oxide materials, coverages, and adsorbates.

  • MPTrj [43]: the version of the dataset from 2020 provides DFT calculations for 83,988 atomistic structures of inorganic materials.

Each dataset is unique for the chemical compositions and the number of atoms in the atomistic structures of the compounds described. Fig. 1 shows the distribution of the number of atoms and bonds per molecule for each dataset. For the MPTrj dataset, approximately half of the molecules are relatively small in size. On the other hand, the OC2020 and OC2022 datasets consist of a more even distribution of molecules with different sizes and edge counts, with the larger molecules consisting of over 400 atoms and over 12,500 edges. In total, the data used for training our GFM consisted of 155 million molecules that consume 5.3 Terabytes of storage space. These datasets were pre-processed using a scientific data management library into a common format for efficient storage and I/O, as discussed in Section III-C.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: The frequency distribution over number of atoms (upper) and edges (lower) for each dataset.

III-B Data Cleaning

Some of the atomistic structures were determined to have unrealistic values for atomic forces, in the order of 20,000 eV/angstrom. Thus, we first applied a data cleaning task in which we discarded all atomistic structures with an L2𝐿2L2italic_L 2-norm (also known as spectral norm) of the force tensor above 100 eV/angstrom to ensure that these data samples did not affect the training of our GFMs. The number of data samples removed from each dataset by this filtering operation is reported in Table II.

Dataset Number of data samples removed
ANI1x [65] 0
QM7-X [66] 0
OC2020 [44] 1
OC2022 [45] 12,270
MPTrj [43] 151
Total 12,421
TABLE II: Number of data samples discarded from each dataset in which the L2𝐿2L2italic_L 2-norm (also known as spectral norm) of the force tensor was over 100 eV/angstrom.

III-C Scalable Data Management

HydraGNN implements two optimization strategies that address scalability issues due to the large volume of data used for training. These strategies aim for 1) efficient storage and performant reading of large training data, and 2) fast reading of batch data during the training process. As molecular datasets are typically exported as collections of large numbers of files, storing datasets on a shared file system and then reading data from the large number of files during the training process causes a severe I/O bottleneck for GNN training. Multiple datasets cumulatively containing tens of thousands of small files put significant pressure on the filesystem’s metadata service, further slowing data access. Additionally, frequent data fetching by multiple GPUs from the file system during training loops results in a substantial slowdown in the training process. We adopted a two-pronged approach to managing large data and reducing the I/O overhead for training the GNN model. First, we pre-process the various input datasets and store their graph representation using a scientific data management library. Secondly, we use a distributed in-memory data store to load data into memory for fast shuffling of data objects during the training process.

III-C1 ADIOS for High Performance I/O

Several publicly available molecular datasets are stored using bespoke schemas and exported as large collections of files. For example, the OC2020 dataset [44] consists of over 50,000 files. Storing multiple such datasets adds prohibitively high metadata overhead on the parallel file system and leads to slow data ingestion during the training process. For efficiently storing and performant reading of large training data, we use the ADIOS  [67] scientific data management library, which provides a state-of-the-art solution for managing extreme-scale data. ADIOS is designed to provide scalable I/O on the largest supercomputers in the world and has been successfully used in science applications that write and read several petabytes in a single simulation run.

An ADIOS file is stored in a hierarchical, self-documenting format that consists of a directory with sub-files and metadata files. Data is stored in ADIOS variables and is automatically distributed across several files called ADIOS ‘sub-files.’ Users only focus on creating variables and issuing read/write calls, leaving the storage format and organization to ADIOS. For example, we store graph node features in a large array which is automatically distributed amongst several sub-files when it is written to the ADIOS file. ADIOS internally maintains metadata to track the structure and organization of data.

The number of sub-files controls the concurrency level while reading data in parallel. This n:m:𝑛𝑚n:mitalic_n : italic_m pattern in which n𝑛nitalic_n processes concurrently read data from m𝑚mitalic_m sub-files is pivotal to obtaining high reading performance using ADIOS. ADIOS provides several options to tune I/O performance, including configuring the number of sub-files. We create the graph structures from input data and store them in ADIOS as a separate pre-processing step. We have developed a data writer and reader in HydraGNN for writing and reading graph data, respectively, from ADIOS files during the training process. When an ADIOS file is created, we split molecules into three groups - ‘trainset’ representing training data, ‘valset’ for data used for validation, and ‘testset’ data for testing the model performance. This logical grou** of molecules helps us read different groups of molecules for different tasks during the training process.

III-C2 DDStore

Refer to caption
Figure 2: Different approaches for shuffling data during the training process. In a), data is read from the shared file system in which each graph object is stored in its own separate file (high file system metadata overhead). In b), data is read from an ADIOS file (low metadata overhead, high I/O bandwidth). In c) all data is read once into DDStore [41], an in-memory data store which uses MPI one-sided RMA operations to obtain data from remote processes (best performance).

Distributed data parallelism (DDP) [68, 69, 70, 71, 72] involves distributing training data amongst the available compute resources. Data is grouped into batches, and GPUs train on one batch at a time before fetching the next batch until all batches are processed in an epoch. Frequently reading data from the file system, even via a high-performance library such as ADIOS, is an expensive operation because I/O over the shared filesystem is the slowest operation in a computing system.

To provide fast data retrieval during training, we use DDStore [41], a distributed data store that provides in-memory data transfer between processes. When training begins, processes read data from ADIOS files and load into the node’s memory, which maintains a global map of data samples on each process. When a GPU requests a new batch of data, DDStore fetches the data from remote processes using low latency, fast communication techniques instead of reading data from the filesystem. By restricting access to the filesystem to the initial bootup phase, DDStore ensures that obtaining a batch is a fast, in-memory operation. Experiments described in [41] show that it leads to a 6×\times× speedup in overall training time.

DDStore provides options to tune the size of data chunks stored on each process (chunking), replicating a dataset on internal sub-groups of processes (replication), and the communication mechanism selected for fetching data. For our experiments, data is split evenly amongst all processes, and a single replica of the dataset is maintained across all processes. For efficient data retrieval, the low latency MPI one-sided remote memory access (RMA) operations were used. Fig. 2 shows the data loading and caching approach used by DDStore compared to traditional approaches that read data directly from the file system. Section IV shows the time taken to obtain a batch of data samples for different model sizes and node counts.

III-D Scalable HPO

GNNs are known for their exceptional performance in learning from graph-structured molecular datasets. However, their development and broader application are hindered by the need for meticulous tuning of the network architecture. To achieve high predictive accuracy across chemically diverse datasets, it is essential to fine-tune the hyperparameters of HydraGNN. The task of identifying optimal hyperparameter settings is daunting and has been extensively documented in existing literature [73, 74, 75, 76, 77, 78]. Manual tuning requires extensive experimentation and often results in suboptimal performance.

To perform HPO at large scale, we used DeepHyper [79], an open-source Python package designed for optimizing hyperparameters, searching for optimal neural architectures, and quantifying uncertainty through the deep ensembles. Specifically, we used asynchronous Bayesian optimization that continuously refines a surrogate model by sampling hyperparameter configurations. The efficacy of DeepHyper’s asynchronous Bayesian optimization has been demonstrated across various deep learning benchmarks, outperforming methods such as random search, genetic algorithms, and Hyperband in environments equipped with CPUs and GPUs. In the DeepHyper setup, a manager node refines the surrogate model and suggests promising configurations while worker nodes perform the evaluations. Our approach uses a centralized architecture with process-based parallelism, optimizing the allocation of tasks across computing nodes to avoid bottlenecks.

Message passing is the core methodology of GNN models since it prescribes how features of nodes and edges are updated using information contained in neighboring nodes and edges. Various MPNNs have been developed and tailored for different atomistic systems, such as SchNet [80] for organic molecules and CGCNN [81] for solid state crystals. However, when considering foundation models applicable to a broad range of systems in atomistic materials, it is not practical to confine ourselves to a specific MPNN method. In HydraGNN, the choice of MPNN is configurable through a hyperparameter, allowing the users to select the optimal model that best suits their applications. We include MPNN as a categorical hyperparameter in the HPO runs to allow for the identification of the best performing MPNN layers for the assigned training data.

In HPO, early termination strategies are vital for improving the utilization of computational resources by discarding unpromising candidates based on their performance trends. This decision has proven effective early in the training process [82]. Deephyper provides three early discarding techniques suited for asynchronous and parallel environments: (1) asynchronous successive halving, which progressively eliminates candidates based on their interim performance; (2) learning curve extrapolation, which predicts future performance from early data and facilitates early termination; and (3) constant fidelity, which sets a fixed resource allocation for each candidate before deciding whether to continue. For our tests, we used constant fidelity as it enables efficient reallocation of resources towards more promising configurations and significantly enhances operational efficiency in large-scale, distributed computing environments. We used 10 epochs as a stop** criterion for each model training in the HPO phase.

While HPO has been previously explored for GNNs, our approach uses HPO on a scale previously unattempted.

III-E Scalable Uncertainty Quantification with GNN Ensembles

Ensemble methods are widely utilized in uncertainty quantification (UQ) to compile predictions from various models, termed ensemble members, into a unified forecast. The goal of these methods is to enhance model generalization by drawing on the diverse capabilities of each individual model.[83] To promote a varied set of predictions, practices such as different model initializations, techniques like Bagging and Boosting, and the integration of diverse network architectures are employed. Research conducted by Egele et al. [84] demonstrated that expanding the variety of network architectures within an ensemble can improve the diversity, thereby increasing the precision of uncertainty assessments. They also developed a technique for concurrently training multiple candidate models, which optimizes the use of computational resources. Ensemble methods are acknowledged for their ability to deliver reliable uncertainty estimates and their ease of implementation and scalability, making them practical for various UQ applications.

To account for model (epistemic) uncertainty, we employ ensembles consisting of multiple neural networks (NNs). Our approach involves considering a collection of GNN models generated by DeepHyper, denoted by 𝒞={θi,i=1,2,,c}\mathcal{C}=\{\theta_{i},i=1,2,\cdots,c\}caligraphic_C = { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , 2 , ⋯ , italic_c }. We then select K𝐾Kitalic_K models from this collection to form the ensemble, where ={θi,i=1,2,,K}\mathcal{E}=\{\theta_{i},i=1,2,\cdots,K\}caligraphic_E = { italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_i = 1 , 2 , ⋯ , italic_K } and K𝐾Kitalic_K denotes the ensemble size. For an input graph G𝐺Gitalic_G, the ensemble’s prediction is the average of prediction from all model members fθisubscript𝑓subscript𝜃𝑖f_{\theta_{i}}italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT,

y~=1Ki=1kfθi(G),~𝑦1𝐾superscriptsubscript𝑖1𝑘subscript𝑓subscript𝜃𝑖𝐺\tilde{y}=\frac{1}{K}\sum_{i=1}^{k}f_{\theta_{i}}(G),over~ start_ARG italic_y end_ARG = divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_G ) , (1)

and the uncertainty is measured as the standard deviation,

σy~=1Ki=1k(fθi(G)y~)2.subscript𝜎~𝑦1𝐾superscriptsubscript𝑖1𝑘superscriptsubscript𝑓subscript𝜃𝑖𝐺~𝑦2\sigma_{\tilde{y}}=\sqrt{\frac{1}{K}\sum_{i=1}^{k}\left(f_{\theta_{i}}(G)-% \tilde{y}\right)^{2}}.italic_σ start_POSTSUBSCRIPT over~ start_ARG italic_y end_ARG end_POSTSUBSCRIPT = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_K end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT ( italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_G ) - over~ start_ARG italic_y end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (2)

Our method offers notable advantages in terms of generality and scalability. Central to our approach is the construction of model ensembles, which relies on scalable HPO. This methodology can be applied to any type of neural network model. The process begins with using a standard neural network architecture, conducting HPO, selecting the most suitable models, and subsequently producing uncertainty estimates. The scalability of our method is anchored in both the scalable nature of the hyperparameter search and the ability to train ensembles efficiently. Working with an ensemble of models enables many options for building consensus models, uncertainty estimation, and active learning.[83]

IV Performance Measurements

IV-A Setup

We utilize three different sizes of foundation models for scaling measurements, denoted as SMALL, MEDIUM, and LARGE. They differ in the total number of parameters, ranging from approximately 60,000 to 163 million. Table III provides details about the three model sizes.

Experiments were conducted on two DOE supercomputers: Frontier at ORNL and Perlmutter at NERSC. Both systems provide state-of-the-art GPU-based heterogeneous architectures. Frontier, located at the Oak Ridge Leadership Computing Facility at ORNL, is one of the world’s most powerful supercomputers. It comprises a total of 9,408 compute nodes, each featuring a single 64-core AMD EPYC 7763 (Milan) CPU and four AMD MI250X GPU accelerators, effectively providing eight GPU units per node. Running with one rank per GPU unit, each rank has 64 GB of DDR4 (CPU) and 64 GB of HBM2e (GPU) memory.

Perlmutter, a supercomputer at National Energy Research Scientific Computing Center (NERSC), features approximately 3000 CPU-only nodes and 1800 GPU-accelerated nodes. Our work utilizes only the GPU-accelerated nodes. Each node is equipped with an AMD EPYC 7763 (Milan) CPU and four NVIDIA Ampere A100 GPUs interconnected via NVLink-3. Running with one rank per GPU unit, each rank has 64 GB of DDR4 (CPU) and 40 GB of HBM2 (GPU) memory. Both Frontier and Perlmutter use HPE Cray Slingshot(TM) interconnects.

To aid in monitoring HydraGNN execution in real-time for a subset of the analysis carried out on Frontier, an AMD research utility, omnistat, was employed to sample a variety of GPU telemetry metrics including occupancy, high-bandwidth memory (HBM) usage, power, temperature, and frequency on a per GCD basis across all nodes assigned to an individual run. This Python-based utility was executed entirely in user-space implemented as a prometheus client on each assigned compute node and combines low-overhead sampling via AMD’s system management interface (SMI) at fixed intervals with a temporary prometheus server [85] instantiated on one CPU core of the master compute host per batch job. Minimal job overhead (less than 0.5%) was observed when running HydraGNN training with this approach for sampling intervals down to one second.

Model size SMALL MEDIUM LARGE
Type of MPNN layer EGNN EGNN EGNN
# MPNN layers 3 6 6
# neurons in MPNN layers 50 500 2,000
# FC layers 2 2 3
# neurons in FC layers 50 1,000 1,000
Number of parameters 58,404 14,539,004 163,129,004
TABLE III: GNN model sizes used for strong and weak scaling tests on NERSC-Perlmutter and OLCF-Frontier.

IV-B I/O Performance for Reading Large Data

In Section III-C, we described using the ADIOS scientific data management library for fast storage and retrieval of large training data. In this section, we show the performance of reading large data in HydraGNN for training models.

Of all the datasets used in this study, the Open Catalyst 2020 dataset is the largest in terms of the number of molecules, the storage size of the dataset, and the number of files across which data is stored. The original dataset consists of over 50,000 files. The dataset was pre-processed into ADIOS and was configured to use just over 50 ADIOS sub-files, which led to a 1000×\times× reduction in the metadata footprint.

When training begins, HydraGNN reads ADIOS data in parallel on all processes. This read operation is a two-step process in which the root process first obtains the number of graphs (molecules) followed by the size (number of atoms) and the feature metadata for each graph. This information is broadcast to all other processes that implicitly distribute the graphs evenly amongst themselves and concurrently read their assigned graphs from the ADIOS file. This set of operations is performed for all molecule groups - trainset, valset, and testset.

Fig. 3 shows the reading performance of trainset data on Frontier when all processes read their assigned graphs in parallel. We obtain over 8 Terabytes/second for higher node counts and almost 2 Terabytes/second on 128 nodes. The high I/O bandwidth is a characteristic feature of the ADIOS library as it permits multiple processes to read data spread over multiple ADIOS sub-files efficiently. A similar run on Perlmutter was not possible due to filesystem issues encountered on the system during our study.

Refer to caption
Figure 3: Read performance for the ADIOS ‘trainset’ data of the OC2020 dataset on Frontier. We obtain over 8 Terabytes/second for almost all node counts for reading 3.8 Terabytes of trainset data.

The initial step in which the root process reads several small portions of the dataset and broadcasts them is an inherently sequential set of operations. As this slows the overall I/O, we obtain lower I/O bandwidth as the root process performs these tasks for the trainset, valset, and testset data groups to read a total of approximately 500 Gigabytes of initial data. Fig. 4 shows the sustained I/O bandwidth achieved when HydraGNN reads the entire OC2020 dataset, which is 4.3 Terabytes in size. We obtain a net bandwidth of over 120 Gigabytes/second on Frontier, which allows HydraGNN to ingest the full collection of 120 million graphs in just over 30 seconds.

Refer to caption
Figure 4: Read performance for the entire OC2020 dataset on Frontier that includes training, validation, and testing data. We obtain over 120 Gigabytes/second (approximately 35 seconds) for reading 4.3 Terabytes of data.

IV-C HydraGNN Training Scaling Results

We now analyze the scaling performance of HydraGNN on Frontier and Perlmutter. We present weak and strong scaling trends, along with a breakdown of component operations in HydraGNN as we scale it up. Experiments were performed with up to 2,048 nodes on Frontier and 256 nodes on Perlmutter using the three model sizes discussed in Table III.

IV-C1 Weak Scaling

For the weak scaling runs, we configured each GPU to process 3,500 molecules equally. Fig. 5 shows the weak scaling performance on Perlmutter and Frontier as we vary in the number of GPUs used for the training. The reported time represents the average training time per epoch. We conducted experiments with up to 2,048 GPUs on Frontier and 1,024 GPUs on Perlmutter. The limited number of GPUs on Perlmutter was due to constraints on available node hours. We observe that the parallel efficiency of weak scaling experiments drops as we increase the number of GPUs beyond 256 for both Perlmutter and Frontier. This is attributed to increased communication costs as we scale the number of GPUs and the overhead associated with using varying graph sizes.

Refer to caption
Refer to caption
Figure 5: Weak scaling of HydraGNN multitasking pretraining on a problem of size 3,500 molecules per GPU for (top) Frontier and (bottom) Perlmutter.

Fig.  6 provides a breakdown of the overhead of different components of HydraGNN used in the weak scaling experiments. ‘forward’ and ‘backward’ represent the forward and backward phases of the DL model training, respectively, and ‘dataload’ denotes the cost of obtaining the next batch of data samples from DDStore after a GPU finishes processing its current batch. We notice that ‘dataload’ has a fixed cost, which expectedly becomes more prominent for the small model size and is only a fraction of the runtime as the model size increases. The forward and backward phases show an increase in runtime as we scale up the workflow as synchronization and communication operations become more expensive with increasing GPU counts.

Refer to caption
Figure 6: Weak scaling. AMD is better for large models.

IV-C2 Strong Scaling

For strong scaling runs, we trained HydraGNN on 120 million molecular graphs (approximately 4TB in size) on Frontier and 2 million molecular graphs on Perlmutter for the three model sizes. Fig. 7 shows the scaling results for 512 to 16,384 GPUs on Frontier and from 64 up to 2,048 GPUs on Perlmutter. The reported time is the average training time per epoch, similar to the weak scaling measurements. While the SMALL model’s performance deviates from the optimal linear dotted line after 2,048 GPUs on Frontier, the MEDIUM and LARGE models maintain close to linear scaling up to 16,384 GPUs on Frontier. We notice a similar trend on Perlmutter where we observe near-linear scaling upto 2048 GPUs for all model sizes.

Refer to caption
Refer to caption
Figure 7: Strong scaling of HydraGNN multitasking pretraining on a problem of 120 million graphs on Frontier and 2 million graphs on Perlmutter with three GNN model sizes.

The drop in scaling performance is attributed to load imbalance - an artifact of varying graph sizes. As shown in Fig. 1, we use a diverse dataset where molecule sizes vary by up to 400, and the number of edges in the larger molecules exceeds 12,500. This results in an imbalanced workload among GPUs in each batch, causing some GPUs to finish training before others. As GPUs must synchronize for exchanging model weights, the runtime is dominated by the GPUs that must train on larger graphs. Effectively, this leads to sub-standard utilization of compute resources and posts a challenge towards achieving high-performant, scalable training. Training on large volumes of data can help develop robust models because of the diverse nature of data, but it affects the computational performance as the workload can vary greatly.

Refer to caption
Refer to caption
Figure 8: The distribution of forward time during the training of three models with respect to the graph size, measured in the number of edges. It illustreate a linear relationship between forward time and graph size.

The EGNN model we used is particularly vulnerable to this problem. The time required for forward calculations in EGNN is directly proportional to the number of edges in the molecules. For datasets with highly variable edge counts between molecules, the likelihood of load imbalance between GPUs increases. Fig. 8 illustrates the time spent on the forward task in the EGNN model with different model sizes. We observe an almost linear relationship between forward time and graph size at each batch (measured by the number of edges). The SMALL models show large variances on both machines, which is expected due to system noise being more apparent with smaller model sizes. Significant performance differences (e.g., the difference between minimum and maximum time) are observed due to the varying graph sizes in our datasets. However, for other tasks (data loading and backward), we do not observe a similar correlation, as they are agnostic of the graph size. Fig. 9 illustrates the average percentage of time spent waiting during three tasks: data loading, forward pass, and backward pass. It highlights a significant waiting period during the forward pass, primarily due to varying graph sizes. This waiting time increases as the disparity in graph sizes among GPUs grows. Other tasks, such as data loading and backward pass, also involve waiting time, but to a much lesser extent.

Refer to caption
Figure 9: Average percentage of waiting time during three parallel tasks – data loading, forward pass, and backward pass.

To quantify the degree of load imbalance between GPUs, we compute the Load Imbalance Factor (LIF). The LIF is determined by the ratio of the maximum load observed on a computing resource to the average load across all resources. We define LIF as

LIF=Tmax/Tavg𝐿𝐼𝐹subscript𝑇𝑚𝑎𝑥subscript𝑇𝑎𝑣𝑔LIF=T_{max}/T_{avg}italic_L italic_I italic_F = italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT (3)

where Tmaxsubscript𝑇𝑚𝑎𝑥T_{max}italic_T start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT and Tavgsubscript𝑇𝑎𝑣𝑔T_{avg}italic_T start_POSTSUBSCRIPT italic_a italic_v italic_g end_POSTSUBSCRIPT represent the maximum runtime and the average runtime for training an epoch, respectively, among all computing resources (GPUs in our case). These timings represent the time to perform training (forward and backward calculations) and do not include wait times during synchronization. For a well-balanced workload, LIF approaches 1.0 from above, whereas it increases as the workload imbalance increases. Fig. 10 presents the LIF scores to demonstrate the imbalance among processes. The trend remains consistent: while data loading and backward pass exhibit nearly balanced workloads (with scores close to 1.0), the forward pass shows imbalanced workload characteristics as it deviates from 1.0.

Refer to caption
Figure 10: Load imbalance factor.

To address the performance penalties caused by workload imbalance, one potential solution is to implement binning or sharing approaches based on graph sizes. This would help ensure balanced workloads across multiple GPUs during each batch processing. However, there is a concern that this method might negatively impact the quality of training or the training losses during the optimization phase by reducing the stochastic effect, which is crucial for effective training. Given this potential trade-off, it is crucial to explore and develop more sophisticated strategies to mitigate load imbalance while maintaining training quality. This will be a key focus for future research and development efforts.

IV-D Scalable HPO

The HPO process is performed using DeepHyper [79], a Python package that provides a common interface for the implementation and study of scalable hyperparameter search methods. DeepHyper has been specifically designed to perform efficient and scalable HPO on integrated extreme scale HPC and leadership class supercomputing facilities, and it thus suits very well for our purpose. Among the various hyperparameter search algorithms implemented in DeepHyper, we used the Centralized Bayesian Optimisation Search [86, 87, 88, 89, 90, 91], previously named as “Asynchronous Model-Based Search” (AMBS) [92]. It follows a manager-workers architecture where the manager runs the Bayesian optimization loop and workers execute parallel evaluations of the black-box function.

The hyperparameter tuning has spanned important architectural hyperparameters described in Table IV. The range of architectural hyperparameters covers regions of the hyperparameter space that allow to construct HydraGNN models of extremely diverse size, which include the SMALL model and the LARGE models described in Table III as extremes.

Hyperparameter Type Admissible values
Type of MPNN layer Categorical {PNA, EGNN, SchNet}
# MPNN layers Integer {1,…,6}
# neurons in MPNN layers Integer {100, …, 2,000}
# FC layers Integer {2,3}
# neurons in FC layers Integer {300, …, 1,000}
TABLE IV: Set of architectural HydraGNN hyperparameters tuned by scalable HPO.

We have successfully executed HPO using 8,192 Frontier nodes in parallel. Each HPO trial is associated with an independent ‘srun’ execution of the SLURM scheduler and occupies 256 nodes (i.e., 2,048 AMD Mi250x GCDs) for distributed training using DDP, thereby allowing 32 distinct HydraGNN architectures to be concurrently trained. Concurrent HPO trials are executed asynchronously, and the termination of an HPO trial is immediately followed by the start of a new one on the same set of compute nodes. This extensive scale not only tests the limits of scalability and efficiency in computational resources, but also addresses the challenges associated with the high dimensionality of the hyperparameter space that needs to be explored, ensuring that a proper balance between exploitation and exploration is maintained during the entire HPO execution.

In order to ensure that the HPO process is performed in an energy-efficient way on OLCF-Frontier, we early stop the training of HydraGNN models for each HPO trial after 10 epochs. This number of epochs allows to early stop the HPO trials that are clearly underperforming in a timely manner, without wasteful energy consumption caused by further training epochs that would not likely improve their accuracy, while still ensuring that promising HPO trials are distinguishable and selected for the next computational tasks. This approach results into impactful energy savings. Our use of DeepHyper for asynchronous Bayesian optimization, combined with a strategic deployment of early termination strategies, showcases a significant advancement in the field, optimizing GNN training in ways that have not been documented prior to this.

Fig. 11 reports the validation mean absolute error (MAE) as a function of wall-clock time. The scattered distribution of blue dots (corresponding to values of the validation MAE for different HPO trials) shows that the HPO maintains a good degree of exploration throughout the entire execution. The red solid line indicates the minimum validation MAE obtained at a given time during the HPO run. The fact that the red line progressively lowers as time progresses confirms that new HPO trials progressively selected and evaluate help identify GFM architecture with better accuracy.

Refer to caption
Figure 11: Value of validation MAE for different HPO trials as a function of wall-clock time expressed in seconds.

We found that the 10 best performing HydraGNN models identified by HPO are relatively small in size, between 4 and 6 millions parameters. The fact that HPO proposes models with small number of parameters seems to disagree with other results presented in the literature [35] for GFM applied to atomistic materials modeling, where models with increasing numbers of parameters (up to a few billions of parameters) seem to increasingly reach higher accuracy, even when they are trained on smaller volumes of data than what we use for our study. One possible explanation for this disagreement is that smaller models tend to learn faster, thus inducing HPO to favor smaller models when early stop** is applied to each HPO trial.

We also found that all the 10 best HydraGNN models selected by HPO use the PNA as MPNN layer. Among the different types of MPNN lyers tested, the PNA (albeit non-invariant and non-equivariant) was already shown to reach higher accuracy on alloy systems [38, 93, 73]. Since the OC2022 and the OC2022 datasets, obtained by slicing 3D bulk alloy systems in 2D slabs and modelling the interaction with catalysts on the 2D alloy surface, constitute the majority of the training data, the automated HPO analysis performed in this study seems to reconfirm what was already empirically observed at smaller scale in previous studies conducted by the authors.

To characterize the dynamic resource behavior of HPO, the user-space telemetry tool highlighted in Section IV-A was enabled to capture a variety of GPU metrics during a small HPO exercise utilizing 320 GPUs. Fig. 12 highlights the memory subsystem showing individual GPU memory usage on each assigned compute node with a dynamic high water mark peaking at 99% of available memory. The variability of memory utilization across different GPUs is due to the fact that different groups of GPUs (associated with different HPO trials) train HydraGNN models of different size, which affects the amount of GPU memory engaged at different stages of the model training.

Refer to caption
Figure 12: GPU HBM memory consumption traces sampled via omnistat telemetry harness (per GCD on each assigned compute node) during HPO exercise executed on OLCF-Frontier.

IV-E Energy Profiling

To quantify the energy usage as a function of different model sizes, three training epochs of the SMALL, MEDIUM, and LARGE model configurations listed in Table III were completed with the omnistat telemetry tool sampling at one second intervals. Measurements consider the entire run duration including I/O for the initial data loading process. Each model executed using 1,024 GPUs on 128 compute nodes which is the minimum node count needed to accommodate memory requirements for the LARGE model configuration. The resulting GPU energy measurements as a function of model size are summarized in Table V. While total energy scales with the execution time, note that GPU utilization (occupancy) also influences the energy consumed. Table V includes mean utilization observed across all 1,024 GPUs. The LARGE case showed the highest GPU utilization–around 89%. The underling power histories used to compute total energy consumed for each model configuration are shown in Fig. 13. From these plots, we see evidence of the underlying training process with three epoch cycles visible in the power response. Furthermore, the increased GPU utilization for the larger models leads to increased GPU power demand with the LARGE model encountering peak power measurements in excess of 520 W (the peak TDP power for the AMD MI250 socket is 560W).

Model size Duration Mean GPU Utilization GPU Energy Consumed
SMALL 17 mins 12.5 % 14.0 kWh
MEDIUM 25 mins 46.0 % 42.7 kWh
LARGE 133 mins 88.9 % 366.6 kWh
TABLE V: Energy usage during training on OLCF-Frontier.
Refer to caption
Refer to caption
Refer to caption
Figure 13: GPU Energy use over time for three models – SMALL (top), MEDIUM (middle), and LARGE (bottom). Each line represents one AMD MI250x.

IV-F Full training of best performing HydraGNN models identified by scalable HPO

The 10 best performing HydraGNN models identified by HPO have been fully trained for 40 epochs to reach convergence of the training. We report the trend of the training loss for all 10 models in Fig. 14. The training loss flattens at the end of the training history, indicating that the models have reached their maximum predictive capacity. The trained 10 models will be used to provide ensemble predictions in future tasks with uncertainty measurement.

Refer to caption
Figure 14: Full training of 10 selected models from HPO.

V Conclusion and Future Work

In this work we described our approach towards develo** and training predictive GFMs by scaling the HydraGNN architecture on hundreds of millions of atomistic materials modeling data using two DOE leadership class supercomputers, viz. NERSC-Perlmutter and OLCF-Frontier. We discussed optimizations and tools used for develo** a GFM and running hyperparameter optimization at large scale.

We used distributed data management capabilities to partition large volumes of data across distributed computing resources and efficiently exchange data samples across devices using low-latency communication methods. This helped preserve global data shuffling, which is crucial for maintaining good convergence of the GFM training. By scaling HPO on over 87% of the exascale OLCF-Frontier supercomputer, we have assessed the importance of thoroughly exploring a large set of hyperparameter configurations to identify HydraGNN architectures with high predictive accuracy. Moreover, access to exceptionally performing large scale computing facilities allowed us to develop and test ensemble UQ capabilities to measure the degree of confidence associated with the HydraGNN predictions. Performing HPO and ensemble UQ at unprecedented scale on supercomputing facilities confirms our computational readiness in using HydraGNN to develop trustworthy (i.e, accurate and confident) GFMs to support the US-DOE materials science needs by providing robust and transferable computational capabilities for AI-accelerated materials discovery and design.

Future work will be devoted to deploying the pre-trained GFMs to downstream tasks for fine-tuning, where we will illustrate the efficacy of our GFMs in reducing the amount of training data and computational resources needed to develop robust and transferable DL models for domain-specific applications.

Acknowledgements

Massimiliano Lupo Pasini would like to thank Dr. Vladimir Protopopescu for his valuable feedback in the preparation of the manuscript. This research is sponsored by the Artificial Intelligence Initiative as part of the Laboratory Directed Research and Development (LDRD) Program of Oak Ridge National Laboratory, managed by UT-Battelle, LLC, for the US Department of Energy under contract DE-AC05-00OR22725. This work used resources of the Oak Ridge Leadership Computing Facility, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR22725, under INCITE award CPH161. This work also used resources of the National Energy Research Scientific Computing Center, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231, under award ERCAP0025216 and ERCAP0027259.

References

  • [1] US Department of Energy - Frontiers in Energy Research Newsletter, “Scientific playgrounds accelerating clean energy research,” https://www.energyfrontier.us/content/scientific-playgrounds-accelerating-clean-energy-research.
  • [2] R. M. Balabin and I. Lomakina, “Neural network approach to quantum-chemistry data: Accurate prediction of density functional theory energies,” J. Chem. Phys., vol. 131, no. 7, p. 074104, 2009.
  • [3] A. Chandrasekaran, K. Deepak, E. Batra, C. Kim, L. Chen, and R. Ramprasad, “Solving the electronic structure problem with machine learning,” NPJ Comput. Mater., vol. 5, no. 22, 2019.
  • [4] F. Brockherde, L. Vogt, M. E. Tuckerman, K. Burke, and K. R. Müller, “Bypassing the Kohn-Sham equations with machine learning,” Nat. Commun., vol. 8, no. 872, 2017.
  • [5] A. V. Sinitskiy and V. S. Pande, “Deep neural network computes electron densities and energies of a large set of organic molecules faster than density functional theory (DFT),” https://arxiv.longhoe.net/abs/1809.02723.
  • [6] C. A. Custódio, E. R. Filletti, and V. V. F. ca, “Artificial neural networks for density-functional optimizations in fermionic systems,” Sci. Rep., vol. 9, no. 1886, 2019.
  • [7] G. R. Schleder, A. C. M. Padhila, C. M. Acosta, M. Costa, and A. Fazzio, “From DFT to machine learning: recent approaches to materials science–a review,” JPhys. Materials, vol. 2, no. 3, 2019.
  • [8] R. K. Vasudevan, K. Choudhary, A. Mehta, R. Smith, G. Kusne, F. Tavazza, L. Vlcek, M. Ziatdinov, S. V. Kalinin, and J. Hattrick-Simpers, “Materials science in the AI age: high-throughput library generation, machine learning and a pathway from correlations to the underpinning physics,” MRS Commun., vol. 9, no. 3, p. 10.1557/mrc.2019.95, 2019.
  • [9] J. Maguire, M. Benedict, L. Woodcock, and S. LeClair, “Artificial intelligence in materials science: Application to molecular and particulate simulations,” MRS Commun., vol. 700, no. S8.1., p. 10.1557/mrc.2019.95, 2001.
  • [10] C. Wang, A. Tharval, and J. R. Kitchin, “A density functional theory parameterised neural network model of zirconia,” Mol. Simul., vol. 44, no. 8, pp. 623–630, 2018.
  • [11] M. Lupo Pasini, Y. W. Li, J. Yin, J. Zhang, K. Barros, and M. Eisenbach, “Fast and stable deep-learning predictions of material properties for solid solution alloys,” J. Phys.: Condens. Matter, vol. 33, no. 8, p. 084005, Dec. 2020, publisher: IOP Publishing. [Online]. Available: https://doi.org/10.1088/1361-648x/abcb10
  • [12] C. W. Park, M. Kornbluth, J. Vandermause, C. Wolverton, B. Kozinsky, and J. P. Mailoa, “Accurate and scalable graph neural network force field and molecular dynamics with direct force architecture,” npj Computational Materials, vol. 7, p. 73, 2021. [Online]. Available: https://doi.org/10.1038/s41524-021-00543-3
  • [13] S. Axelrod, D. Schwalbe-Koda, S. Mohapatra, J. Damewood, K. P. Greenman, and R. Gómez-Bombarelli, “Learning matter: Materials design with machine learning and atomistic simulations,” Acc. Mater. Res., vol. 3, pp. 343–357, February 2022. [Online]. Available: https://doi.org/10.1021/accountsmr.1c00238
  • [14] P. Yoo, D. Bhowmik, K. Mehta, P. Zhang, F. Liu, M. Lupo Pasini, and S. Irle, “Deep learning workflow for the inverse design of molecules with specific optoelectronic properties,” Scientific Reports, vol. 13, no. 1, p. 20031, 2023. [Online]. Available: https://doi.org/10.1038/s41598-023-45385-9
  • [15] Q. Zhang, K. Ding, T. Lyv, X. Wang, Q. Yin, Y. Zhang, J. Yu, Y. Wang, X. Li, Z. Xiang, X. Zhuang, Z. Wang, M. Qin, M. Zhang, J. Zhang, J. Cui, R. Xu, H. Chen, X. Fan, H. Xing, and H. Chen, “Scientific large language models: A survey on biological & chemical domains,” ArXiv, vol. abs/2401.14656, 2024.
  • [16] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017, pp. 5998–6008.
  • [17] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186.
  • [18] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  • [19] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of Machine Learning Research, vol. 21, no. 140, pp. 1–67, 2020.
  • [20] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” arXiv preprint arXiv:2005.14165, 2020.
  • [21] B. Mueller, C. Bertram, M. Lott, V. Krey, and M. Meinshausen, “Enhancing climate change text analysis with pre-trained language models,” Environmental Research Letters, vol. 15, no. 10, p. 104005, 2020.
  • [22] J. Dodge, M. Gardner, S. Iyer, W.-t. Yih, and Y. Choi, “Climate-qa: A dataset for climate change question answering,” arXiv preprint arXiv:2005.08808, 2020.
  • [23] X. Chen, Z. Hu, Y. Huo, and C. Qiu, “Leveraging bert for extracting information from climate change articles,” Sustainability, vol. 12, no. 7, p. 2801, 2020.
  • [24] M. Webersinke, P. Hofer, L. Wanner, M. Basaldella, and C. Pattichis, “Climatebert: A pretrained language model for climate-related text,” arXiv preprint arXiv:2110.12010, 2021.
  • [25] X. Liu, J. Gao, X. Wang, and X. Ren, “Climatebert: Adapting transformer architectures for climate policy research,” in Proceedings of the 2022 Annual Meeting of the Association for Computational Linguistics, 2022, pp. 540–550.
  • [26] V. Tshitoyan, J. Dagdelen, L. Weston, A. Dunn, Z. Rong, O. Kononova, K. A. Persson, G. Ceder, and A. Jain, “Matbert: A materials domain language model for text mining and information extraction,” Nature, vol. 571, no. 7763, pp. 95–98, 2019.
  • [27] I. Beltagy, K. Lo, and A. Cohan, “Scibert: A pretrained language model for scientific text,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 3615–3620.
  • [28] E. Kim, Y. Liu, W. J. Choi, W. **, J. Park, and J. Kang, “Polymerbert: A language model for predicting polymer properties from text,” arXiv preprint arXiv:2011.05691, 2020.
  • [29] S. Gupta and J. M. Cole, “Application of bert-based transformers in extracting information from materials science literature,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, 2020, pp. 456–465.
  • [30] S. Chithrananda, G. Grand, and B. Ramsundar, “Chemberta: Large-scale self-supervised pretraining for molecular property prediction,” arXiv preprint arXiv:2010.09885, 2020.
  • [31] J. Lee, H. Kim, H. J. Kim, and N.-G. Park, “Materialsbert: A pretrained language model for materials science,” arXiv preprint arXiv:2106.11953, 2021.
  • [32] S. Takeda, I. Priyadarsini, A. Kishimoto, H. Shinohara, L. Hamada, H. Masataka, J. Fuchiwaki, and D. Nakano, “Multi-modal foundation model for material design,” in AI for Accelerated Materials Design - NeurIPS 2023 Workshop, 2023. [Online]. Available: https://openreview.net/forum?id=EiT2bLsfM9
  • [33] K. L. K. Lee, C. Gonzales, M. Spellings, M. Galkin, S. Miret, and N. Kumar, “Towards foundation models for materials science: The open matsci ml toolkit,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 51–59. [Online]. Available: https://doi.org/10.1145/3624062.3626081
  • [34] D. Beaini, S. Huang, J. A. Cunha, Z. Li, G. Moisescu-Pareja, O. Dymov, S. Maddrell-Mander, C. McLean, F. Wenkel, L. Müller, J. H. Mohamud, A. Parviz, M. Craig, M. Koziarski, J. Lu, Z. Zhu, C. Gabellini, K. Klaser, J. Dean, C. Wognum, M. Sypetkowski, G. Rabusseau, R. Rabbany, J. Tang, C. Morris, M. Ravanelli, G. Wolf, P. Tossou, H. Mary, T. Bois, A. W. Fitzgibbon, B. Banaszewski, C. Martin, and D. Masters, “Towards foundational models for molecular learning on large-scale multi-task datasets,” in The Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=Zc2aIcucwc
  • [35] M. Sypetkowski, F. Wenkel, F. Poursafaei, N. Dickson, K. Suri, P. Fradkin, and D. Beaini, “On the Scalability of GNNs for Molecular Graphs,” ArXiv, vol. abs/2404.11568v1, 2024.
  • [36] M. Lupo Pasini, J. Y. Choi, P. Zhang, J. Baker, and U. O. of Science, “Hydragnn v3.0, version v3.0,” 2 2024. [Online]. Available: https://www.osti.gov//servlets/purl/2283293
  • [37] J. Y. Choi, P. Zhang, K. Mehta, B. A., and M. Lupo Pasini, “Scalable training of graph convolutional neural networks for fast and accurate predictions of HOMO-LUMO gap in molecules,” J Cheminform, vol. 14, p. 70, 2022. [Online]. Available: https://doi.org/10.1186/s13321-022-00652-1
  • [38] M. Lupo Pasini, P. Zhang, S. T. Reeve, and J. Y. Choi, “Multi-task graph neural networks for simultaneous prediction of global and atomic properties in ferromagnetic systems,” Mach. learn.: sci. technol., vol. 3, no. 2, p. 025007, may 2022. [Online]. Available: https://doi.org/10.1088/2632-2153/ac6a51
  • [39] M. Lupo Pasini, J. Y. Choi, P. Zhang, and J. Baker, “User Manual - HydraGNN: Distributed PyTorch Implementation of Multi-Headed Graph Convolutional Neural Networks.” [Online]. Available: https://www.osti.gov/biblio/2224153
  • [40] J. Baker, M. Lupo Pasini, and C. Hauck, “Invariant features for accurate predictions of quantum chemical uv-vis spectra of organic molecules,” ChemRxiv, 2023, this content is a preprint and has not been peer-reviewed.
  • [41] J. Y. Choi, M. Lupo Pasini, P. Zhang, K. Mehta, F. Liu, J. Bae, and K. Ibrahim, “Ddstore: Distributed data store for scalable training of graph neural networks on large atomistic modeling datasets,” in Proceedings of the SC ’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, ser. SC-W ’23.   New York, NY, USA: Association for Computing Machinery, 2023, p. 941–950. [Online]. Available: https://doi.org/10.1145/3624062.3624171
  • [42] I. Batatia, P. Benner, Y. Chiang, A. M. Elena, D. P. Kovács, J. Riebesell, X. R. Advincula, M. Asta, M. Avaylon, W. J. Baldwin, F. Berger, N. Bernstein, A. Bhowmik, S. M. Blau, V. Cărare, J. P. Darby, S. De, F. D. Pia, V. L. Deringer, R. Elijošius, Z. El-Machachi, F. Falcioni, E. Fako, A. C. Ferrari, A. Genreith-Schriever, J. George, R. E. A. Goodall, C. P. Grey, P. Grigorev, S. Han, W. Handley, H. H. Heenen, K. Hermansson, C. Holm, J. Jaafar, S. Hofmann, K. S. Jakob, H. Jung, V. Kapil, A. D. Kaplan, N. Karimitari, J. R. Kermode, N. Kroupa, J. Kullgren, M. C. Kuner, D. Kuryla, G. Liepuoniute, J. T. Margraf, I.-B. Magdău, A. Michaelides, J. H. Moore, A. A. Naik, S. P. Niblett, S. W. Norwood, N. O’Neill, C. Ortner, K. A. Persson, K. Reuter, A. S. Rosen, L. L. Schaaf, C. Schran, B. X. Shi, E. Sivonxay, T. K. Stenczel, V. Svahn, C. Sutton, T. D. Swinburne, J. Tilly, C. van der Oord, E. Varga-Umbrich, T. Vegge, M. Vondrák, Y. Wang, W. C. Witt, F. Zills, and G. Csányi, “A foundation model for atomistic materials chemistry,” 2024.
  • [43] A. Jain, S. P. Ong, G. Hautier, W. Chen, W. D. Richards, S. Dacek, S. Cholia, D. Gunter, D. Skinner, G. Ceder, and K. A. Persson, “The Materials Project: A materials genome approach to accelerating materials innovation,” APL Materials, vol. 1, no. 1, p. 011002, 2013.
  • [44] L. Chanussot, A. Das, S. Goyal, T. Lavril, M. Shuaibi, M. Riviere, K. Tran, J. Heras-Domingo, C. Ho, W. Hu, A. Palizhati, A. Sriram, B. Wood, J. Yoon, D. Parikh, C. L. Zitnick, and Z. Ulissi, “Open Catalyst 2020 (OC20) Dataset and Community Challenges,” ACS Catalysis, vol. 11, no. 10, p. 6059–6072, 2021. [Online]. Available: https://doi.org/10.1021/acscatal.0c04525
  • [45] R. Tran, J. Lan, M. Shuaibi, B. M. Wood, S. Goyal, A. Das, J. Heras-Domingo, A. Kolluru, A. Rizvi, N. Shoghi, A. Sriram, F. Therrien, J. Abed, O. Voznyy, E. H. Sargent, Z. Ulissi, and C. L. Zitnick, “The Open Catalyst 2022 (OC22) Dataset and Challenges for Oxide Electrocatalysts,” ACS Catalysis, vol. 13, no. 5, p. 3066–3084, 2023. [Online]. Available: https://doi.org/10.1021/acscatal.2c05426
  • [46] L. Hirschfeld, K. Swanson, K. Yang, R. Barzilay, and C. W. Coley, “Uncertainty quantification using neural networks for molecular property prediction,” Journal of Chemical Information and Modeling, vol. 60, no. 8, pp. 3770–3780, 2020.
  • [47] S. Ryu, Y. Kwon, and W. Y. Kim, “A bayesian graph convolutional network for reliable prediction of molecular properties with uncertainty quantification,” Chemical science, vol. 10, no. 36, pp. 8438–8446, 2019.
  • [48] K. Huang, Y. **, E. Candes, and J. Leskovec, “Uncertainty quantification over graph with conformalized graph neural networks,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  • [49] S. Jiang, S. Qin, R. C. Van Lehn, P. Balaprakash, and V. M. Zavala, “Uncertainty quantification for molecular property predictions with graph neural architecture search,” arXiv preprint arXiv:2307.10438, 2023.
  • [50] T. Baruah, K. Shivdikar, S. Dong, Y. Sun, S. A. Mojumder, K. Jung, J. L. Abellán, Y. Ukidave, A. Joshi, J. Kim, and D. Kaeli, “Gnnmark: A benchmark suite to characterize graph neural network training on gpus,” in 2021 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2021, pp. 13–23.
  • [51] Z. Lin, C. Li, Y. Miao, Y. Liu, and Y. Xu, “Pagraph: Scaling gnn training on large graphs via computation-aware caching,” in Proceedings of the 11th ACM Symposium on Cloud Computing, ser. SoCC ’20.   New York, NY, USA: Association for Computing Machinery, 2020, p. 401–415. [Online]. Available: https://doi.org/10.1145/3419111.3421281
  • [52] D. Yang, J. Liu, J. Qi, and J. Lai, “Wholegraph: A fast graph neural network training framework with multi-gpu distributed shared memory architecture,” in SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, 2022, pp. 1–14.
  • [53] M. F. Balin, K. Sancak, and U. V. Catalyurek, “Mg-gcn: A scalable multi-gpu gcn training framework,” in Proceedings of the 51st International Conference on Parallel Processing, ser. ICPP ’22.   New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/3545008.3545082
  • [54] V. Md, S. Misra, G. Ma, R. Mohanty, E. Georganas, A. Heinecke, D. Kalamkar, N. K. Ahmed, and S. Avancha, “Distgnn: scalable distributed training for large-scale graph neural networks,” in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’21.   New York, NY, USA: Association for Computing Machinery, 2021. [Online]. Available: https://doi.org/10.1145/3458817.3480856
  • [55] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke, J. Smith, B. Vaughan, P. Damania, and S. Chintala, “Pytorch distributed: experiences on accelerating data parallel training,” Proc. VLDB Endow., vol. 13, no. 12, p. 3005–3018, aug 2020. [Online]. Available: https://doi.org/10.14778/3415478.3415530
  • [56] O. Viniavskyi, M. Dobko, D. Mishkin, and O. Dobosevych, “Openglue: Open source graph neural net based pipeline for image matching,” 2022.
  • [57] S. Miret, K. L. K. Lee, C. Gonzales, M. Nassar, and M. Spellings, “The open MatSci ML toolkit: A flexible framework for machine learning in materials science,” Transactions on Machine Learning Research, 2023.
  • [58] K. L. K. Lee, C. Gonzales, M. Nassar, M. Spellings, M. Galkin, and S. Miret, “Matsciml: A broad, multi-task benchmark for solid-state materials modeling,” arXiv preprint arXiv:2309.05934, 2023.
  • [59] M. Li, J. Zhou, J. Hu, W. Fan, Y. Zhang, Y. Gu, and G. Karypis, “Dgl-lifesci: An open-source toolkit for deep learning on graphs in life science,” ACS Omega, vol. 6, no. 41, pp. 27 233–27 238, 2021. [Online]. Available: https://doi.org/10.1021/acsomega.1c04017
  • [60] J. Baker, M. L. Pasini, and C. Hauck, “Invariant features for accurate predictions of quantum chemical uv-vis spectra of organic molecules,” in SoutheastCon 2024, 2024, pp. 311–320.
  • [61] “PyTorch,” https://pytorch.org/docs/stable/index.html.
  • [62] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds.   Vancouver Convention Centre, Vancouver, Canada: Curran Associates, Inc., 2019, pp. 8024–8035. [Online]. Available: http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf
  • [63] M. Fey and J. E. Lenssen, “Fast graph representation learning with PyTorch Geometric,” in ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019.
  • [64] “PyTorch Geometric,” https://pytorch-geometric.readthedocs.io/en/latest/.
  • [65] J. S. Smith, R. Zubatyuk, B. Nebgen, N. Lubbers, K. Barros, A. E. Roitberg, O. Isayev, and S. Tretiak, “The ani-1ccx and ani-1x data sets, coupled-cluster and density functional theory properties for molecules,” Scientific Data, vol. 7, p. 134, 2020.
  • [66] J. Hoja, L. M. Sandonas, B. G. Ernst, A. Vazquez-Mayagoitia, R. A. DiStasio Jr., and A. Tkatchenko, “Qm7-x, a comprehensive dataset of quantum-mechanical properties spanning the chemical space of small organic molecules,” Scientific Data, vol. 8, p. 43, 2021.
  • [67] W. F. Godoy, N. Podhorszki, R. Wang, C. Atkins, G. Eisenhauer, J. Gu, P. Davis, J. Choi, K. Germaschewski, K. Huck et al., “Adios 2: The adaptable input output system. a framework for high-performance data management,” SoftwareX, vol. 12, p. 100561, 2020.
  • [68] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” Communications of the ACM, vol. 51, no. 1, pp. 107–113, 2008.
  • [69] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang et al., “Large scale distributed deep networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1223–1231.
  • [70] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
  • [71] M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su, “Scaling distributed machine learning with the parameter server,” in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), 2014, pp. 583–598.
  • [72] P. Goyal, P. Dollár, R. Girshick, L. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch sgd: Training imagenet in 1 hour,” arXiv preprint arXiv:1706.02677, 2017.
  • [73] M. Lupo Pasini, J. Yin, Y. W. Li, and M. Eisenbach, “A scalable algorithm for the optimization of neural network architectures,” Parallel Computing, vol. 104-105, p. 102788, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167819121000430
  • [74] T. Yu and H. Zhu, “Hyper-parameter optimization: A review of algorithms and applications,” ArXiv, vol. abs/2003.05689, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:212675087
  • [75] A. Muyskens, B. W. Priest, I. R. Goumiri, and M. D. Schneider, “Muygps: Scalable gaussian process hyperparameter estimation using local cross-validation,” arXiv: Computation, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:233476590
  • [76] A. Kadra, M. Janowski, M. Wistuba, and J. Grabocka, “Scaling laws for hyperparameter optimization,” in Neural Information Processing Systems, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258887933
  • [77] A. J. Fetterman, E. Kitanidis, J. Albrecht, Z. Polizzi, B. Fogelman, M. Knutins, B. Wróblewski, J. B. Simon, and K. Qiu, “Tune as you scale: Hyperparameter optimization for compute efficient training,” ArXiv, vol. abs/2306.08055, 2023.
  • [78] J. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, Eds., vol. 24.   Curran Associates, Inc., 2011. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2011/file/86e8f7ab32cfd12577bc2619bc635690-Paper.pdf
  • [79] P. Balaprakash, M. Salim, T. D. Uram, V. Vishwanath, and S. M. Wild, “Deephyper: Asynchronous hyperparameter search for deep neural networks,” in 2018 IEEE 25th International Conference on High Performance Computing (HiPC), 2018, pp. 42–51.
  • [80] K. T. Schütt, P.-J. Kindermans, H. E. Sauceda, S. Chmiela, A. Tkatchenko, and K.-R. Müller, “Schnet: a continuous-filter convolutional neural network for modeling quantum interactions,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, ser. NIPS’17.   Red Hook, NY, USA: Curran Associates Inc., 2017, p. 992–1002.
  • [81] T. Xie and J. C. Grossman, “Crystal graph convolutional neural networks for an accurate and interpretable prediction of material properties,” Phys. Rev. Lett., vol. 120, p. 145301, Apr 2018. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevLett.120.145301
  • [82] R. Egele, F. Mohr, T. Viering, and P. Balaprakash, “The unreasonable effectiveness of early discarding after one epoch in neural network hyperparameter optimization,” Neurocomputing, vol. 597, p. 127964, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0925231224007355
  • [83] A. Krogh and J. Vedelsby, “Neural network ensembles, cross validation, and active learning,” in Advances in Neural Information Processing Systems, G. Tesauro, D. Touretzky, and T. Leen, Eds., vol. 7.   MIT Press, 1994. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/1994/file/b8c37e33defde51cf91e1e03e51657da-Paper.pdf
  • [84] R. Egele, R. Maulik, K. Raghavan, B. Lusch, I. Guyon, and P. Balaprakash, “Autodeuq: Automated deep ensemble with uncertainty quantification,” in 2022 26th International Conference on Pattern Recognition (ICPR).   IEEE, 2022, pp. 1908–1914.
  • [85] B. Rabenstein and J. Volz, “Prometheus: A Next-Generation monitoring system (talk).”   Dublin: USENIX Association, May 2015.
  • [86] D. R. Jones, M. Schonlau, and W. J. Welch, “Efficient global optimization of expensive black-box functions,” Journal of Global Optimization, vol. 13, pp. 455–492, 1998.
  • [87] K. Swersky, J. Snoek, and R. P. Adams, “Multi-task bayesian optimization,” in Advances in Neural Information Processing Systems (NIPS), 2013, pp. 2004–2012.
  • [88] J. Snoek, H. Larochelle, and R. P. Adams, “Practical bayesian optimization of machine learning algorithms,” in Advances in Neural Information Processing Systems (NIPS), 2012, pp. 2960–2968.
  • [89] J. Snoek, K. Swersky, R. P. Adams, D. W. Turner, H. Larochelle, and K. P. Murphy, “Bayesian optimization and semiparametric models with applications to assistive technology,” arXiv preprint arXiv:1402.7182, 2014.
  • [90] J. Gonzalez, Z. Dai, P. Hennig, and N. D. Lawrence, “Batch bayesian optimization via local penalization,” in Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016, pp. 648–657.
  • [91] P. I. Frazier, “Bayesian optimization,” arXiv preprint arXiv:1807.02811, 2018.
  • [92] F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Parallel algorithm configuration,” in Learning and Intelligent Optimization (LION), 2012, pp. 55–70.
  • [93] M. Lupo Pasini, M. Burĉul, S. T. Reeve, M. Eisenbach, and S. Perotto, “Fast and accurate predictions of total energy for solid solution alloys with graph convolutional neural networks,” Springer Journal of Communications in Computer and Information Science, vol. 1512, Sep. 2021.