Search | arXiv e-print repository

Scalable Training of Graph Foundation Models for Atomistic Materials Modeling: A Case Study with HydraGNN

Authors: Massimiliano Lupo Pasini, Jong Youl Choi, Kshitij Mehta, Pei Zhang, David Rogers, Jonghyun Bae, Khaled Z. Ibrahim, Ashwin M. Aji, Karl W. Schulz, Jorda Polo, Prasanna Balaprakash

Abstract: We present our work on develo** and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that de… ▽ More We present our work on develo** and training scalable graph foundation models (GFM) using HydraGNN, a multi-headed graph convolutional neural network architecture. HydraGNN expands the boundaries of graph neural network (GNN) in both training scale and data diversity. It abstracts over message passing algorithms, allowing both reproduction of and comparison across algorithmic innovations that define convolution in GNNs. This work discusses a series of optimizations that have allowed scaling up the GFM training to tens of thousands of GPUs on datasets that consist of hundreds of millions of graphs. Our GFMs use multi-task learning (MTL) to simultaneously learn graph-level and node-level properties of atomistic structures, such as the total energy and atomic forces. Using over 150 million atomistic structures for training, we illustrate the performance of our approach along with the lessons learned on two United States Department of Energy (US-DOE) supercomputers, namely the Perlmutter petascale system at the National Energy Research Scientific Computing Center and the Frontier exascale system at Oak Ridge National Laboratory. The HydraGNN architecture enables the GFM to achieve near-linear strong scaling performance using more than 2,000 GPUs on Perlmutter and 16,000 GPUs on Frontier. Hyperparameter optimization (HPO) was performed on over 64,000 GPUs on Frontier to select GFM architectures with high accuracy. Early stop** was applied on each GFM architecture for energy awareness in performing such an extreme-scale task. The training of an ensemble of highest-ranked GFM architectures continued until convergence to establish uncertainty quantification (UQ) capabilities with ensemble learning. Our contribution opens the door for rapidly develo**, training, and deploying GFMs using large-scale computational resources to enable AI-accelerated materials discovery and design. △ Less

Submitted 28 June, 2024; v1 submitted 12 June, 2024; originally announced June 2024.

Comments: 16 pages, 13 figures

MSC Class: 68T07; 68T09 ACM Class: C.2.4; I.2.11

arXiv:2112.04494 [pdf, other]

doi 10.1145/3490354.3494448

Deep Q-Learning Market Makers in a Multi-Agent Simulated Stock Market

Authors: Oscar Fernández Vicente, Fernando Fernández Rebollo, Francisco Javier García Polo

Abstract: Market makers play a key role in financial markets by providing liquidity. They usually fill order books with buy and sell limit orders in order to provide traders alternative price levels to operate. This paper focuses precisely on the study of these markets makers strategies from an agent-based perspective. In particular, we propose the application of Reinforcement Learning (RL) for the creation… ▽ More Market makers play a key role in financial markets by providing liquidity. They usually fill order books with buy and sell limit orders in order to provide traders alternative price levels to operate. This paper focuses precisely on the study of these markets makers strategies from an agent-based perspective. In particular, we propose the application of Reinforcement Learning (RL) for the creation of intelligent market markers in simulated stock markets. This research analyzes how RL market maker agents behaves in non-competitive (only one RL market maker learning at the same time) and competitive scenarios (multiple RL market markers learning at the same time), and how they adapt their strategies in a Sim2Real scope with interesting results. Furthermore, it covers the application of policy transfer between different experiments, describing the impact of competing environments on RL agents performance. RL and deep RL techniques are proven as profitable market maker approaches, leading to a better understanding of their behavior in stock markets. △ Less

Submitted 8 December, 2021; originally announced December 2021.

Comments: Presented at 2nd ACM International Conference on AI in Finance

arXiv:2111.15451 [pdf, other]

Large-Scale Video Analytics through Object-Level Consolidation

Authors: Daniel Rivas, Francesc Guim, Jordà Polo, David Carrera

Abstract: As the number of installed cameras grows, so do the compute resources required to process and analyze all the images captured by these cameras. Video analytics enables new use cases, such as smart cities or autonomous driving. At the same time, it urges service providers to install additional compute resources to cope with the demand while the strict latency requirements push compute towards the e… ▽ More As the number of installed cameras grows, so do the compute resources required to process and analyze all the images captured by these cameras. Video analytics enables new use cases, such as smart cities or autonomous driving. At the same time, it urges service providers to install additional compute resources to cope with the demand while the strict latency requirements push compute towards the end of the network, forming a geographically distributed and heterogeneous set of compute locations, shared and resource-constrained. Such landscape (shared and distributed locations) forces us to design new techniques that can optimize and distribute work among all available locations and, ideally, make compute requirements grow sublinearly with respect to the number of cameras installed. In this paper, we present FoMO (Focus on Moving Objects). This method effectively optimizes multi-camera deployments by preprocessing images for scenes, filtering the empty regions out, and composing regions of interest from multiple cameras into a single image that serves as input for a pre-trained object detection model. Results show that overall system performance can be increased by 8x while accuracy improves 40% as a by-product of the methodology, all using an off-the-shelf pre-trained model with no additional training or fine-tuning. △ Less

Submitted 30 November, 2021; originally announced November 2021.

arXiv:2104.06826 [pdf, other]

Towards Automatic Model Specialization for Edge Video Analytics

Authors: Daniel Rivas, Francesc Guim, Jordà Polo, Pubudu M. Silva, Josep Ll. Berral, David Carrera

Abstract: Judging by popular and generic computer vision challenges, such as the ImageNet or PASCAL VOC, neural networks have proven to be exceptionally accurate in recognition tasks. However, state-of-the-art accuracy often comes at a high computational price, requiring hardware acceleration to achieve real-time performance, while use cases, such as smart cities, require images from fixed cameras to be ana… ▽ More Judging by popular and generic computer vision challenges, such as the ImageNet or PASCAL VOC, neural networks have proven to be exceptionally accurate in recognition tasks. However, state-of-the-art accuracy often comes at a high computational price, requiring hardware acceleration to achieve real-time performance, while use cases, such as smart cities, require images from fixed cameras to be analyzed in real-time. Due to the amount of network bandwidth these streams would generate, we cannot rely on offloading compute to a centralized cloud. Thus, a distributed edge cloud is expected to process images locally. However, the edge is, by nature, resource-constrained, which puts a limit on the computational complexity that can execute. Yet, there is a need for a meeting point between the edge and accurate real-time video analytics. Specializing lightweight models on a per-camera basis may help but it quickly becomes unfeasible as the number of cameras grows unless the process is automated. In this paper, we present and evaluate COVA (Contextually Optimized Video Analytics), a framework to assist in the automatic specialization of models for video analytics in edge cameras. COVA automatically improves the accuracy of lightweight models through their specialization. Moreover, we discuss and review each step involved in the process to understand the different trade-offs that each one entails. Additionally, we show how the sole assumption of static cameras allows us to make a series of considerations that greatly simplify the scope of the problem. Finally, experiments show that state-of-the-art models, i.e., able to generalize to unseen environments, can be effectively used as teachers to tailor smaller networks to a specific context, boosting accuracy at a constant computational cost. Results show that our COVA can automatically improve accuracy of pre-trained models by an average of 21%. △ Less

Submitted 13 December, 2021; v1 submitted 14 April, 2021; originally announced April 2021.

arXiv:2007.02813 [pdf, other]

doi 10.1007/978-3-030-10549-5_48

Disaggregating Non-Volatile Memory for Throughput-Oriented Genomics Workloads

Authors: Aaron Call, Jordà Polo, David Carrera, Francesc Guim, Sujoy Sen

Abstract: Massive exploitation of next-generation sequencing technologies requires dealing with both: huge amounts of data and complex bioinformatics pipelines. Computing architectures have evolved to deal with these problems, enabling approaches that were unfeasible years ago: accelerators and Non-Volatile Memories (NVM) are becoming widely used to enhance the most demanding workloads. However, bioinformat… ▽ More Massive exploitation of next-generation sequencing technologies requires dealing with both: huge amounts of data and complex bioinformatics pipelines. Computing architectures have evolved to deal with these problems, enabling approaches that were unfeasible years ago: accelerators and Non-Volatile Memories (NVM) are becoming widely used to enhance the most demanding workloads. However, bioinformatics workloads are usually part of bigger pipelines with different and dynamic needs in terms of resources. The introduction of Software Defined Infrastructures (SDI) for data centers provides roots to dramatically increase the efficiency in the management of infrastructures. SDI enables new ways to structure hardware resources through disaggregation, and provides new hardware composability and sharing mechanisms to deploy workloads in more flexible ways. In this paper we study a state-of-the-art genomics application, SMUFIN, aiming to address the challenges of future HPC facilities. △ Less

Submitted 6 July, 2020; originally announced July 2020.

Comments: Partially funded by European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 639595) - HiEST Project Published Euro-Par 2018: Euro-Par 2018: Parallel Processing Workshops pp 613-625

Journal ref: Euro-Par 2018: Euro-Par 2018: Parallel Processing Workshops

arXiv:1712.03254 [pdf, other]

Accelerating K-mer Frequency Counting with GPU and Non-Volatile Memory

Authors: Nicola Cadenelli, Jorda Polo, David Carrera

Abstract: The emergence of Next Generation Sequencing (NGS) platforms has increased the throughput of genomic sequencing and in turn the amount of data that needs to be processed, requiring highly efficient computation for its analysis. In this context, modern architectures including accelerators and non-volatile memory are essential to enable the mass exploitation of these bioinformatics workloads. This pa… ▽ More The emergence of Next Generation Sequencing (NGS) platforms has increased the throughput of genomic sequencing and in turn the amount of data that needs to be processed, requiring highly efficient computation for its analysis. In this context, modern architectures including accelerators and non-volatile memory are essential to enable the mass exploitation of these bioinformatics workloads. This paper presents a redesign of the main component of a state-of-the-art reference-free method for variant calling, SMUFIN, which has been adapted to make the most of GPUs and NVM devices. SMUFIN relies on counting the frequency of \textit{k-mers} (substrings of length $k$) in DNA sequences, which also constitutes a well-known problem for many bioinformatics workloads, such as genome assembly. We propose techniques to improve the efficiency of k-mer counting and to scale-up workloads like \sm that used to require 16 nodes of \mn to a single machine with a GPU and NVM drives. Results show that although the single machine is not able to improve the time to solution of 16 nodes, its CPU time is 7.5x shorter than the aggregate CPU time of the 16 nodes, with a reduction in energy consumption of 5.5x. △ Less

Submitted 21 November, 2017; originally announced December 2017.

Comments: Submitted to the 19th IEEE International Conference on high Performance Computing and Communication (HPC 2017). Partially funded by European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 639595) - HiEST Project

arXiv:1511.02043 [pdf, other]

doi 10.1109/NCA.2015.49

Performance Evaluation of Microservices Architectures using Containers

Authors: Marcelo Amaral, Jordà Polo, David Carrera, Iqbal Mohomed, Merve Unuvar, Malgorzata Steinder

Abstract: Microservices architecture has started a new trend for application development for a number of reasons: (1) to reduce complexity by using tiny services; (2) to scale, remove and deploy parts of the system easily; (3) to improve flexibility to use different frameworks and tools; (4) to increase the overall scalability; and (5) to improve the resilience of the system. Containers have empowered the u… ▽ More Microservices architecture has started a new trend for application development for a number of reasons: (1) to reduce complexity by using tiny services; (2) to scale, remove and deploy parts of the system easily; (3) to improve flexibility to use different frameworks and tools; (4) to increase the overall scalability; and (5) to improve the resilience of the system. Containers have empowered the usage of microservices architectures by being lightweight, providing fast start-up times, and having a low overhead. Containers can be used to develop applications based on monolithic architectures where the whole system runs inside a single container or inside a microservices architecture where one or few processes run inside the containers. Two models can be used to implement a microservices architecture using containers: master-slave, or nested-container. The goal of this work is to compare the performance of CPU and network running benchmarks in the two aforementioned models of microservices architecture hence provide a benchmark analysis guidance for system designers. △ Less

Submitted 6 November, 2015; originally announced November 2015.

Comments: Submitted to the 14th IEEE International Symposium on Network Computing and Applications (IEEE NCA15). Partially funded by European Research Council (ERC) under the European Union's Horizon 2020 research and innovation programme (grant agreement No 639595) - HiEST Project

Showing 1–7 of 7 results for author: Polo, J