Search | arXiv e-print repository

Experiences Readying Applications for Exascale

Authors: Paul T. Bauman, Reuben D. Budiardja, Dmytro Bykov, Noel Chalmers, Jacqueline Chen, Nicholas Curtis, Marc Day, Markus Eisenbach, Lucas Esclapez, Alessandro Fanfarillo, William Freitag, Nicholas Frontiere, Antigoni Georgiadou, Joseph Glenski, Kalyana Gottiparthi, Marc T. Henry de Frahan, Gustav R. Jansen, Wayne Joubert, Justin G. Lietz, Jakub Kurzak, Nicholas Malaya, Bronson Messer, Damon McDougall, Paul Mullowney, Stephen Nichols , et al. (7 additional authors not shown)

Abstract: The advent of exascale computing invites an assessment of existing best practices for develo** application readiness on the world's largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programm… ▽ More The advent of exascale computing invites an assessment of existing best practices for develo** application readiness on the world's largest supercomputers. This work details observations from the last four years in preparing scientific applications to run on the Oak Ridge Leadership Computing Facility's (OLCF) Frontier system. This paper addresses a range of topics in software including programmability, tuning, and portability considerations that are key to moving applications from existing systems to future installations. A set of representative workloads provides case studies for general system and software testing. We evaluate the use of early access systems for development across several generations of hardware. Finally, we discuss how best practices were identified and disseminated to the community through a wide range of activities including user-guides and trainings. We conclude with recommendations for ensuring application readiness on future leadership computing systems. △ Less

Submitted 2 October, 2023; originally announced October 2023.

Comments: Accepted at SC23

arXiv:2304.10397 [pdf, ps, other]

Optimizing High-Performance Linpack for Exascale Accelerated Architectures

Authors: Noel Chalmers, Jakub Kurzak, Damon McDougall, Paul T. Bauman

Abstract: We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socke… ▽ More We detail the performance optimizations made in rocHPL, AMD's open-source implementation of the High-Performance Linpack (HPL) benchmark targeting accelerated node architectures designed for exascale systems such as the Frontier supercomputer. The implementation leverages the high-throughput GPU accelerators on the node via highly optimized linear algebra libraries, as well as the entire CPU socket to perform latency-sensitive factorization phases. We detail novel performance improvements such as a multi-threaded approach to computing the panel factorization phase on the CPU, time-sharing of CPU cores between processes on the node, as well as several optimizations which hide MPI communication. We present some performance results of this implementation of the HPL benchmark on a single node of the Frontier early access cluster at Oak Ridge National Laboratory, as well as scaling to multiple nodes. △ Less

Submitted 20 April, 2023; originally announced April 2023.

arXiv:1002.4057 [pdf, ps, other]

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Authors: Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

Abstract: The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR fa… ▽ More The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this extended abstract, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results. △ Less

Submitted 22 February, 2010; originally announced February 2010.

Comments: 8 pages, extended abstract submitted to VecPar10 on 12/11/09, notification of acceptance received on 02/05/10. See: http://vecpar.fe.up.pt/2010/

arXiv:0808.2794 [pdf, other]

doi 10.1016/j.cpc.2008.11.005

Accelerating Scientific Computations with Mixed Precision Algorithms

Authors: Marc Baboulin, Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou, Julien Langou, Piotr Luszczek, Stanimire Tomov

Abstract: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here ca… ▽ More On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented. △ Less

Submitted 20 August, 2008; originally announced August 2008.

arXiv:0709.1272 [pdf, other]

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

Authors: Alfredo Buttari, Julien Langou, Jakub Kurzak, Jack Dongarra

Abstract: As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an oper… ▽ More As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations. △ Less

Submitted 12 June, 2008; v1 submitted 9 September, 2007; originally announced September 2007.

Report number: Lapack working Note 191

Showing 1–5 of 5 results for author: Kurzak, J