-
Simulation-Based Performance Prediction of HPC Applications: A Case Study of HPL
Authors:
Gen Xu,
Huda Ibeid,
Xin Jiang,
Vjekoslav Svilan,
Zhaojuan Bian
Abstract:
We propose a simulation-based approach for performance modeling of parallel applications on high-performance computing platforms. Our approach enables full-system performance modeling: (1) the hardware platform is represented by an abstract yet high-fidelity model; (2) the computation and communication components are simulated at a functional level, where the simulator allows the use of the compon…
▽ More
We propose a simulation-based approach for performance modeling of parallel applications on high-performance computing platforms. Our approach enables full-system performance modeling: (1) the hardware platform is represented by an abstract yet high-fidelity model; (2) the computation and communication components are simulated at a functional level, where the simulator allows the use of the components native interface; this results in a (3) fast and accurate simulation of full HPC applications with minimal modifications to the application source code. This hardware/software hybrid modeling methodology allows for low overhead, fast, and accurate exascale simulation and can be easily carried out on a standard client platform (desktop or laptop). We demonstrate the capability and scalability of our approach with High Performance LINPACK (HPL), the benchmark used to rank supercomputers in the TOP500 list. Our results show that our modeling approach can accurately and efficiently predict the performance of HPL at the scale of the TOP500 list supercomputers. For instance, the simulation of HPL on Frontera takes less than five hours with an error rate of four percent.
△ Less
Submitted 4 November, 2020;
originally announced November 2020.
-
FFT, FMM, and Multigrid on the Road to Exascale: performance challenges and opportunities
Authors:
Huda Ibeid,
Luke Olson,
William Gropp
Abstract:
FFT, FMM, and multigrid methods are widely used fast and highly scalable solvers for elliptic PDEs. However, emerging large-scale computing systems are introducing challenges in comparison to current petascale computers. Recent efforts (Dongarra et al. 2011) have identified several constraints in the design of exascale software that includes massive concurrency, resilience management, exploiting t…
▽ More
FFT, FMM, and multigrid methods are widely used fast and highly scalable solvers for elliptic PDEs. However, emerging large-scale computing systems are introducing challenges in comparison to current petascale computers. Recent efforts (Dongarra et al. 2011) have identified several constraints in the design of exascale software that includes massive concurrency, resilience management, exploiting the high performance of heterogeneous systems, energy efficiency, and utilizing the deeper and more complex memory hierarchy expected at exascale. In this paper, we perform a model-based comparison of the FFT, FMM, and multigrid methods in the context of these projected constraints. In addition, we use performance models to offer predictions about the expected performance on upcoming exascale system configurations based on current technology trends.
△ Less
Submitted 27 March, 2020; v1 submitted 28 October, 2018;
originally announced October 2018.
-
Learning with Analytical Models
Authors:
Huda Ibeid,
Si** Meng,
Oliver Dobon,
Luke Olson,
William Gropp
Abstract:
To understand and predict the performance of scientific applications, several analytical and machine learning approaches have been proposed, each having its advantages and disadvantages. In this paper, we propose and validate a hybrid approach for performance modeling and prediction, which combines analytical and machine learning models. The proposed hybrid model aims to minimize prediction cost w…
▽ More
To understand and predict the performance of scientific applications, several analytical and machine learning approaches have been proposed, each having its advantages and disadvantages. In this paper, we propose and validate a hybrid approach for performance modeling and prediction, which combines analytical and machine learning models. The proposed hybrid model aims to minimize prediction cost while providing reasonable prediction accuracy. Our validation results show that the hybrid model is able to learn and correct the analytical models to better match the actual performance. Furthermore, the proposed hybrid model improves the prediction accuracy in comparison to pure machine learning techniques while using small training datasets, thus making it suitable for hardware and workload changes.
△ Less
Submitted 25 February, 2019; v1 submitted 28 October, 2018;
originally announced October 2018.
-
Communication Reducing Algorithms for Distributed Hierarchical N-Body Problems with Boundary Distributions
Authors:
Mustafa Abduljabbar,
George Markomanolis,
Huda Ibeid,
Rio Yokota,
David Keyes
Abstract:
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical $N$-Body algorithms like FMM. In the present work, we propose four independent strategies to improve partitioning and reduce communication. First of all, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which…
▽ More
Reduction of communication and efficient partitioning are key issues for achieving scalability in hierarchical $N$-Body algorithms like FMM. In the present work, we propose four independent strategies to improve partitioning and reduce communication. First of all, we show that the conventional wisdom of using space-filling curve partitioning may not work well for boundary integral problems, which constitute about 50% of FMM's application user base. We propose an alternative method which modifies orthogonal recursive bisection to solve the cell-partition misalignment that has kept it from scaling previously. Secondly, we optimize the granularity of communication to find the optimal balance between a bulk-synchronous collective communication of the local essential tree and an RDMA per task per cell. Finally, we take the dynamic sparse data exchange proposed by Hoefler et al. and extend it to a hierarchical sparse data exchange, which is demonstrated at scale to be faster than the MPI library's MPI_Alltoallv that is commonly used.
△ Less
Submitted 17 February, 2017;
originally announced February 2017.
-
A Performance Model for the Communication in Fast Multipole Methods on HPC Platforms
Authors:
Huda Ibeid,
Rio Yokota,
David Keyes
Abstract:
Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the current parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. There is therefore an urgent need to…
▽ More
Exascale systems are predicted to have approximately one billion cores, assuming Gigahertz cores. Limitations on affordable network topologies for distributed memory systems of such massive scale bring new challenges to the current parallel programing model. Currently, there are many efforts to evaluate the hardware and software bottlenecks of exascale designs. There is therefore an urgent need to model application performance and to understand what changes need to be made to ensure extrapolated scalability. The fast multipole method (FMM) was originally developed for accelerating N-body problems in astrophysics and molecular dynamics, but has recently been extended to a wider range of problems, including preconditioners for sparse linear solvers. It's high arithmetic intensity combined with its linear complexity and asynchronous communication patterns makes it a promising algorithm for exascale systems. In this paper, we discuss the challenges for FMM on current parallel computers and future exascale architectures, with a focus on inter-node communication. We develop a performance model that considers the communication patterns of the FMM, and observe a good match between our model and the actual communication time, when latency, bandwidth, network topology, and multi-core penalties are all taken into account. To our knowledge, this is the first formal characterization of inter-node communication in FMM, which validates the model against actual measurements of communication time.
△ Less
Submitted 25 May, 2014;
originally announced May 2014.