-
Experiences Readying Applications for Exascale
Authors:
Paul T. Bauman,
Reuben D. Budiardja,
Dmytro Bykov,
Noel Chalmers,
Jacqueline Chen,
Nicholas Curtis,
Marc Day,
Markus Eisenbach,
Lucas Esclapez,
Alessandro Fanfarillo,
William Freitag,
Nicholas Frontiere,
Antigoni Georgiadou,
Joseph Glenski,
Kalyana Gottiparthi,
Marc T. Henry de Frahan,
Gustav R. Jansen,
Wayne Joubert,
Justin G. Lietz,
Jakub Kurzak,
Nicholas Malaya,
Bronson Messer,
Damon McDougall,
Paul Mullowney,
Stephen Nichols
, et al. (7 additional authors not shown)
Abstract:
The advent of exascale computing invites an assessment of existing best practices for develo** application readiness on the world's largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programm…
▽ More
The advent of exascale computing invites an assessment of existing best practices for develo** application readiness on the world's largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems.
△ Less
Submitted 2 October, 2023;
originally announced October 2023.
-
Optimizing High-Performance Linpack for Exascale Accelerated Architectures
Authors:
Noel Chalmers,
Jakub Kurzak,
Damon McDougall,
Paul T. Bauman
Abstract:
We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socke…
▽ More
We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes.
△ Less
Submitted 20 April, 2023;
originally announced April 2023.
-
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
Authors:
Emmanuel Agullo,
Henricus Bouwmeester,
Jack Dongarra,
Jakub Kurzak,
Julien Langou,
Lee Rosenberg
Abstract:
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR fa…
▽ More
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this extended abstract, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.
△ Less
Submitted 22 February, 2010;
originally announced February 2010.
-
Accelerating Scientific Computations with Mixed Precision Algorithms
Authors:
Marc Baboulin,
Alfredo Buttari,
Jack Dongarra,
Jakub Kurzak,
Julie Langou,
Julien Langou,
Piotr Luszczek,
Stanimire Tomov
Abstract:
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here ca…
▽ More
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
△ Less
Submitted 20 August, 2008;
originally announced August 2008.
-
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
Authors:
Alfredo Buttari,
Julien Langou,
Jakub Kurzak,
Jack Dongarra
Abstract:
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an oper…
▽ More
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations.
△ Less
Submitted 12 June, 2008; v1 submitted 9 September, 2007;
originally announced September 2007.