-
Tightening I/O Lower Bounds through the Hourglass Dependency Pattern
Authors:
Lionel Eyraud-Dubois,
Guillaume Iooss,
Julien Langou,
Fabrice Rastello
Abstract:
When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itsel…
▽ More
When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itself. The objective of I/O complexity analysis is to compute, for a given program, its minimal I/O requirement among all valid schedules. We consider a sequential execution model with two memories, an infinite one, and a small one of size S on which the computations retrieve and produce data. The I/O is the number of reads and writes between the two memories. We identify a common "hourglass pattern" in the dependency graphs of several common linear algebra kernels. Using the properties of this pattern, we mathematically prove tighter lower bounds on their I/O complexity, which improves the previous state-of-the-art bound by a parametric ratio. This proof was integrated inside the IOLB automatic lower bound derivation tool.
△ Less
Submitted 25 April, 2024;
originally announced April 2024.
-
Proposed Consistent Exception Handling for the BLAS and LAPACK
Authors:
James Demmel,
Jack Dongarra,
Mark Gates,
Greg Henry,
Julien Langou,
Xiaoye Li,
Piotr Luszczek,
Weslley Pereira,
Jason Riedy,
Cindy Rubio-González
Abstract:
Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(-1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more aspects of society become automated, e.g., self-driving cars, health monitors, and cyber-physical systems more generally, it is becoming increasingly important to des…
▽ More
Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(-1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more aspects of society become automated, e.g., self-driving cars, health monitors, and cyber-physical systems more generally, it is becoming increasingly important to design software that is resilient to exceptions, and that responds to them in a consistent way. Consistency is needed to allow users to build higher-level software that is also resilient and consistent (and so on recursively). In this paper we explore the design space of consistent exception handling for the widely used BLAS and LAPACK linear algebra libraries, pointing out a variety of instances of inconsistent exception handling in the current versions, and propose a new design that balances consistency, complexity, ease of use, and performance. Some compromises are needed, because there are preexisting inconsistencies that are outside our control, including in or between existing vendor BLAS implementations, different programming languages, and even compilers for the same programming language. And user requests from our surveys are quite diverse. We also propose our design as a possible model for other numerical software, and welcome comments on our design choices.
△ Less
Submitted 19 July, 2022;
originally announced July 2022.
-
I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels
Authors:
Olivier Beaumont,
Lionel Eyraud-Dubois,
Mathieu Vérité,
Julien Langou
Abstract:
In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and…
▽ More
In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and the goal is to minimize the amount of communication between slow and fast memories. As the set of computations is fixed by the choice of the algorithm, only the ordering of the computations (the schedule) directly influences the volume of communications.We prove lower bounds of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}}$ for the communication volume of the Cholesky factorization of an $N\times N$ symmetric positive definite matrix, and of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}}$ for the SYRK computation of $\mat{A}\cdot\transpose{\mat{A}}$, where $\mathbf{A}$ is an $N\times M$ matrix. Both bounds improve the best known lower bounds from the literature by a factor $\sqrt{2}$.In addition, we present two out-of-core, sequential algorithms with matching communication volume: \TBS for SYRK, with a volume of $\frac{1}{\sqrt{2}}\frac{N^2M}{\sqrt{S}} + \bigo{NM\log N}$, and \LBC for Cholesky, with a volume of $\frac{1}{3\sqrt{2}}\frac{N^3}{\sqrt{S}} + \bigo{N^{5/2}}$. Both algorithms improve over the best known algorithms from the literature by a factor $\sqrt{2}$, and prove that the leading terms in our lower bounds cannot be improved further. This work shows that the operational intensity of symmetric kernels like SYRK or Cholesky is intrinsically higher (by a factor $\sqrt{2}$) than that of corresponding non-symmetric kernels (GEMM and LU factorization).
△ Less
Submitted 21 February, 2022;
originally announced February 2022.
-
Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs
Authors:
Auguste Olivry,
Julien Langou,
Louis-Noël Pouchet,
P. Sadayappan,
Fabrice Rastello
Abstract:
For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how effi…
▽ More
For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how efficiently the cache is used, e.g., untiled versus effectively tiled matrix-matrix multiplication. A significant current challenge is that no existing tool is able to meaningfully quantify the potential reduction to the data movement of a computation that can be achieved by more effective use of the cache through operation rescheduling. Asymptotic parametric expressions of data movement lower bounds have previously been manually derived for a limited number of algorithms, often without scaling constants. In this paper, we present the first compile-time approach for deriving non-asymptotic parametric expressions of data movement lower bounds for arbitrary affine computations. The approach has been implemented in a fully automatic tool (IOLB) that can generate these lower bounds for input affine programs.
IOLB's use is demonstrated by exercising it on all the benchmarks of the PolyBench suite. The advantages of IOLB are many: (1) IOLB enables us to derive bounds for few dozens of algorithms for which these lower bounds have never been derived. This reflects an increase of productivity by automation. (2) Anyone is able to obtain these lower bounds through IOLB, no expertise is required. (3) For some of the most well-studied algorithms, the lower bounds obtained by \tool are higher than any previously reported manually derived lower bounds.
△ Less
Submitted 15 November, 2019;
originally announced November 2019.
-
A Tight I/O Lower Bound for Matrix Multiplication
Authors:
Tyler Michael Smith,
Bradley Lowery,
Julien Langou,
Robert A. van de Geijn
Abstract:
A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform $C:=AB$, for distinct matrices $A$, $B$, and $C$, where each segment is a series of operations involving $M$ reads and writes to and from fast memory,…
▽ More
A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform $C:=AB$, for distinct matrices $A$, $B$, and $C$, where each segment is a series of operations involving $M$ reads and writes to and from fast memory, and $M$ is the size of fast memory. A lower bound on the number of segments was then determined by obtaining an upper bound on the number of elementary multiplications performed per segment. This paper follows the same high level approach, but improves the lower bound by (1) transforming algorithms for MMM so that they perform all computation via fused multiply-add instructions (FMAs) and using this to reason about only the cost associated with reading the matrices, and (2) decoupling the per-segment I/O cost from the size of fast memory. For $n \times n$ matrices, the lower bound's leading-order term is $2n^3/\sqrt{M}$. A theoretical algorithm whose leading terms attains this is introduced. To what extent the state-of-the-art Goto's Algorithm attains the lower bound is discussed.
△ Less
Submitted 6 February, 2019; v1 submitted 3 February, 2017;
originally announced February 2017.
-
Bidiagonalization with Parallel Tiled Algorithms
Authors:
Mathieu Faverge,
Julien Langou,
Yves Robert,
Jack Dongarra
Abstract:
We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a til…
▽ More
We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a tiled version of the algorithm which consists in first performing the QR factorization of the initial matrix, then performing the band-bidiagonalization of the R-factor. For both bidiagonalization algorithms BiDiag and R-BiDiag, we use four main types of reduction trees, namely FlatTS, FlatTT, Greedy, and a newly introduced auto-adaptive tree, Auto. We provide a study of critical path lengths for these tiled algorithms, which shows that (i) R-BiDiag has a shorter critical path length than BiDiag for tall and skinny matrices, and (ii) Greedy based schemes are much better than earlier proposed variants with unbounded resources. We provide experiments on a single multicore node, and on a few multicore nodes of a parallel distributed shared-memory system, to show the superiority of the new algorithms on a variety of matrix sizes, matrix shapes and core counts.
△ Less
Submitted 18 November, 2016;
originally announced November 2016.
-
A Backward/Forward Recovery Approach for the Preconditioned Conjugate Gradient Method
Authors:
Massimiliano Fasi,
Julien Langou,
Yves Robert,
Bora Ucar
Abstract:
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every $d$ iterations, and to checkpoint every $c \times d$ iterat…
▽ More
Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every $d$ iterations, and to checkpoint every $c \times d$ iterations. When a silent error is detected by the verification mechanism, one can rollback to and re-execute from the last checkpoint. In this paper, we also propose to combine checkpointing and verification, but we use algorithm-based fault tolerance (ABFT) rather than stability tests. ABFT can be used for error detection, but also for error detection and correction, allowing a forward recovery (and no rollback nor re-execution) when a single error is detected. We introduce an abstract performance model to compute the performance of all schemes, and we instantiate it using the preconditioned conjugate gradient algorithm. Finally, we validate our new approach through a set of simulations.
△ Less
Submitted 13 November, 2015;
originally announced November 2015.
-
A Makespan Lower Bound for the Scheduling of the Tiled Cholesky Factorization based on ALAP scheduling
Authors:
Willy Quach,
Julien Langou
Abstract:
Due to the advent of multicore architectures and massive parallelism, the tiled Cholesky factorization algorithm has recently received plenty of attention and is often referenced by practitioners as a case study. It is also implemented in mainstream dense linear algebra libraries. However, we note that theoretical study of the parallelism of this algorithm is currently lacking. In this paper, we p…
▽ More
Due to the advent of multicore architectures and massive parallelism, the tiled Cholesky factorization algorithm has recently received plenty of attention and is often referenced by practitioners as a case study. It is also implemented in mainstream dense linear algebra libraries. However, we note that theoretical study of the parallelism of this algorithm is currently lacking. In this paper, we present new theoretical results about the tiled Cholesky factorization in the context of a parallel homogeneous model without communication costs. We use standard flop-based weights for the tasks. For a $t$-by-$t$ matrix, we know that the critical path of the tiled Cholesky algorithm is $9t-10$ and that the weight of all tasks is $t^3$. In this context, we prove that no schedule with less than $0.185 t^2$ processing units can finish in a time less than the critical path. In perspective, a naive bound gives $0.11 t^2.$ We then give a schedule which needs less than $0.25 t^2+0.16t+3$ processing units to complete in the time of the critical path. In perspective, a naive schedule gives $0.50 t^2.$ In addition, given a fixed number of processing units, $p$, we give a lower bound on the execution time as follows: $$\max( \frac{t^{3}}{p}, \frac{t^{3}}{p} - 3\frac{t^2}{p} + 6\sqrt{2p} - 7 , 9t-10).$$ The interest of the latter formula lies in the middle term. Our results stem from the observation that the tiled Cholesky factorization is much better behaved when we schedule it with an ALAP (As Late As Possible) heuristic than an ASAP (As Soon As Possible) heuristic. We also provide scheduling heuristics which match closely the lower bound on execution time. We believe that our theoretical results will help practical scheduling studies. Indeed, our results enable to better characterize the quality of a practical schedule with respect to an optimal schedule.
△ Less
Submitted 17 October, 2015;
originally announced October 2015.
-
A Greedy Algorithm for Optimally Pipelining a Reduction
Authors:
Bradley R. Lowery,
Julien Langou
Abstract:
Collective communications are ubiquitous in parallel applications. We present two new algorithms for performing a reduction. The operation associated with our reduction needs to be associative and commutative. The two algorithms are developed under two different communication models (unidirectional and bidirectional). Both algorithms use a greedy scheduling scheme. For a unidirectional, fully conn…
▽ More
Collective communications are ubiquitous in parallel applications. We present two new algorithms for performing a reduction. The operation associated with our reduction needs to be associative and commutative. The two algorithms are developed under two different communication models (unidirectional and bidirectional). Both algorithms use a greedy scheduling scheme. For a unidirectional, fully connected network, we prove that our greedy algorithm is optimal when some realistic assumptions are respected. Previous algorithms fit the same assumptions and are only appropriate for some given configurations. Our algorithm is optimal for all configurations. We note that there are some configuration where our greedy algorithm significantly outperform any existing algorithms. This result represents a contribution to the state-of-the art. For a bidirectional, fully connected network, we present a different greedy algorithm. We verify by experimental simulations that our algorithm matches the time complexity of an optimal broadcast (with addition of the computation). Beside reversing an optimal broadcast algorithm, the greedy algorithm is the first known reduction algorithm to experimentally attain this time complexity. Simulations show that this greedy algorithm performs well in practice, outperforming any state-of-the-art reduction algorithms. Positive experiments on a parallel distributed machine are also presented.
△ Less
Submitted 17 October, 2013;
originally announced October 2013.
-
Hierarchical QR factorization algorithms for multi-core cluster systems
Authors:
Jack Dongarra,
Mathieu Faverge,
Thomas Herault,
Julien Langou,
and Yves Robert
Abstract:
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential ke…
▽ More
This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential kernels executed by the cores (high sequential performance), low number of messages in a parallel distributed setting (small latency term), and fine granularity (high parallelism).
△ Less
Submitted 7 October, 2011;
originally announced October 2011.
-
Tiled QR factorization algorithms
Authors:
Henricus Bouwmeester,
Mathias Jacquelin,
Julien Langou,
Yves Robert
Abstract:
This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p-by-q tiles, where p >= q. Within this framework, we study the critical paths and performance of algorithms such as Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA. Although neither Modi and Clarke nor Greedy is optimal, both are shown to be asymptotically optimal for all mat…
▽ More
This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p-by-q tiles, where p >= q. Within this framework, we study the critical paths and performance of algorithms such as Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA. Although neither Modi and Clarke nor Greedy is optimal, both are shown to be asymptotically optimal for all matrices of size p = q^2 f(q), where f is any function such that \lim_{+\infty} f= 0. This novel and important complexity result applies to all matrices where p and q are proportional, p = λq, with λ>= 1, thereby encompassing many important situations in practice (least squares). We provide an extensive set of experiments that show the superiority of the new algorithms for tall matrices.
△ Less
Submitted 22 April, 2011;
originally announced April 2011.
-
A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion
Authors:
Henricus Bouwmeester,
Julien Langou
Abstract:
Algorithms come with multiple variants which are obtained by changing the mathematical approach from which the algorithm is derived. These variants offer a wide spectrum of performance when implemented on a multicore platform and we seek to understand these differences in performances from a theoretical point of view. To that aim, we derive and present the critical path lengths of each algorithmic…
▽ More
Algorithms come with multiple variants which are obtained by changing the mathematical approach from which the algorithm is derived. These variants offer a wide spectrum of performance when implemented on a multicore platform and we seek to understand these differences in performances from a theoretical point of view. To that aim, we derive and present the critical path lengths of each algorithmic variant for our application problem which enables us to determine a lower bound on the time to solution. This metric provides an intuitive grasp of the performance of a variant and we present numerical experiments to validate the tightness of our lower bounds on practical applications. Our case study is the Cholesky inversion and its use in computing the inverse of a symmetric positive definite matrix.
△ Less
Submitted 11 October, 2010;
originally announced October 2010.
-
Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures
Authors:
Emmanuel Agullo,
Henricus Bouwmeester,
Jack Dongarra,
Jakub Kurzak,
Julien Langou,
Lee Rosenberg
Abstract:
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR fa…
▽ More
The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR factorization. In this extended abstract, we attack the problem of the computation of the inverse of a symmetric positive definite matrix. We observe that, using a dynamic task scheduler, it is relatively painless to translate existing LAPACK code to obtain a ready-to-be-executed tile algorithm. However we demonstrate that non trivial compiler techniques (array renaming, loop reversal and pipelining) need then to be applied to further increase the parallelism of our application. We present preliminary experimental results.
△ Less
Submitted 22 February, 2010;
originally announced February 2010.
-
QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment
Authors:
Emmanuel Agullo,
Camille Coti,
Jack Dongarra,
Thomas Herault,
Julien Langou
Abstract:
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problem…
▽ More
Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problems is limited to applications exhibiting parallelism at a higher level. We have identified two performance bottlenecks in the distributed memory algorithms implemented in ScaLAPACK, a state-of-the-art dense linear algebra library. First, because ScaLAPACK assumes a homogeneous communication network, the implementations of ScaLAPACK algorithms lack locality in their communication pattern. Second, the number of messages sent in the ScaLAPACK algorithms is significantly greater than other algorithms that trade flops for communication. In this paper, we present a new approach for computing a QR factorization -- one of the main dense linear algebra kernels -- of tall and skinny matrices in a grid computing environment that overcomes these two bottlenecks. Our contribution is to articulate a recently proposed algorithm (Communication Avoiding QR) with a topology-aware middleware (QCG-OMPI) in order to confine intensive communications (ScaLAPACK calls) within the different geographical sites. An experimental study conducted on the Grid'5000 platform shows that the resulting performance increases linearly with the number of geographical sites on large-scale problems (and is in particular consistently higher than ScaLAPACK's).
△ Less
Submitted 13 December, 2009;
originally announced December 2009.
-
Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion
Authors:
Fred G. Gustavson,
Jerzy Wasniewski,
Jack J. Dongarra,
Julien Langou
Abstract:
We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format ar…
▽ More
We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format arrays fully utilize storage (array space) but provide low performance as there is no Level 3 packed BLAS. We combine the good features of packed and full storage using RFPF to obtain high performance via using Level 3 BLAS as RFPF is a standard full format representation. Also, RFPF requires exactly the same minimal storage as packed format. Each LAPACK full and/or packed triangular, symmetric, and Hermitian routine becomes a single new RFPF routine based on eight possible data layouts of RFPF. This new RFPF routine usually consists of two calls to the corresponding LAPACK full format routine and two calls to Level 3 BLAS routines. This means {\it no} new software is required. As examples, we present LAPACK routines for Cholesky factorization, Cholesky solution and Cholesky inverse computation in RFPF to illustrate this new work and to describe its performance on several commonly used computer platforms. Performance of LAPACK full routines using RFPF versus LAPACK full routines using standard format for both serial and SMP parallel processing is about the same while using half the storage. Performance gains are roughly one to a factor of 43 for serial and one to a factor of 97 for SMP parallel times faster using vendor LAPACK full routines with RFPF than with using vendor and/or reference packed routines.
△ Less
Submitted 12 January, 2009;
originally announced January 2009.
-
Accelerating Scientific Computations with Mixed Precision Algorithms
Authors:
Marc Baboulin,
Alfredo Buttari,
Jack Dongarra,
Jakub Kurzak,
Julie Langou,
Julien Langou,
Piotr Luszczek,
Stanimire Tomov
Abstract:
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here ca…
▽ More
On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here can apply not only to conventional processors but also to other technologies such as Field Programmable Gate Arrays (FPGA), Graphical Processing Units (GPU), and the STI Cell BE processor. Results on modern processor architectures and the STI Cell BE are presented.
△ Less
Submitted 20 August, 2008;
originally announced August 2008.
-
Algorithmic Based Fault Tolerance Applied to High Performance Computing
Authors:
George Bosilca,
Remi Delmas,
Jack Dongarra,
Julien Langou
Abstract:
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To…
▽ More
We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To assess the viability of our approach, we have developed a fault tolerant matrix-matrix multiplication subroutine and we propose some models to predict its running time. Our parallel fault-tolerant matrix-matrix multiplication scores 1.4 TFLOPS on 484 processors (cluster jacquard.nersc.gov) and returns a correct result while one process failure has happened. This represents 65% of the machine peak efficiency and less than 12% overhead with respect to the fastest failure-free implementation. We predict (and have observed) that, as we increase the processor count, the overhead of the fault tolerance drops significantly.
△ Less
Submitted 18 June, 2008;
originally announced June 2008.
-
A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures
Authors:
Alfredo Buttari,
Julien Langou,
Jakub Kurzak,
Jack Dongarra
Abstract:
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an oper…
▽ More
As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an operation. This paper presents an algorithm for the Cholesky, LU and QR factorization where the operations can be represented as a sequence of small tasks that operate on square blocks of data. These tasks can be dynamically scheduled for execution based on the dependencies among them and on the availability of computational resources. This may result in an out of order execution of the tasks which will completely hide the presence of intrinsically sequential tasks in the factorization. Performance comparisons are presented with the LAPACK algorithms where parallelism can only be exploited at the level of the BLAS operations and vendor implementations.
△ Less
Submitted 12 June, 2008; v1 submitted 9 September, 2007;
originally announced September 2007.