Skip to main content

Showing 1–18 of 18 results for author: van de Geijn, R

.
  1. arXiv:2311.10700  [pdf, ps, other

    cs.MS

    Deriving Algorithms for Triangular Tridiagonalization a (Skew-)Symmetric Matrix

    Authors: Robert van de Geijn, Maggie Myers, RuQing G. Xu, Devin Matthews

    Abstract: We apply the FLAME methodology to derive algorithms hand in hand with their proofs of correctness for the computation of the $ L T L^T $ decomposition (with and without pivoting) of a skew-symmetric matrix. The approach yields known as well as new algorithms, presented using the FLAME notation. A number of BLAS-like primitives are exposed at the core of blocked algorithms that can attain high perf… ▽ More

    Submitted 17 November, 2023; originally announced November 2023.

    Comments: 28 pages

  2. arXiv:2304.03068  [pdf, ps, other

    cs.MS

    Formal Derivation of LU Factorization with Pivoting

    Authors: Robert van de Geijn, Maggie Myers

    Abstract: The FLAME methodology for deriving linear algebra algorithms from specification, first introduced around 2000, has been successfully applied to a broad cross section of operations. An open question has been whether it can yield algorithms for the best-known operation in linear algebra, LU factorization with partial pivoting (Gaussian elimination with row swap**). This paper shows that it can.

    Submitted 6 April, 2023; originally announced April 2023.

    Comments: 30 pages

    ACM Class: G.4

  3. arXiv:2303.04353  [pdf, other

    cs.MS

    Cascading GEMM: High Precision from Low Precision

    Authors: Devangi N. Parikh, Robert A. van de Geijn, Greg M. Henry

    Abstract: This paper lays out insights and opportunities for implementing higher-precision matrix-matrix multiplication (GEMM) from (in terms of) lower-precision high-performance GEMM. The driving case study approximates double-double precision (FP64x2) GEMM in terms of double precision (FP64) GEMM, leveraging how the BLAS-like Library Instantiation Software (BLIS) framework refactors the Goto Algorithm. Wi… ▽ More

    Submitted 7 March, 2023; originally announced March 2023.

    Comments: 26 pages, 9 figures

    ACM Class: G.4

  4. arXiv:2302.08417  [pdf, other

    cs.MS

    GEMMFIP: Unifying GEMM in BLIS

    Authors: RuQing G. Xu, Field G. Van Zee, Robert A. van de Geijn

    Abstract: Matrix libraries often focus on achieving high performance for problems considered to be either "small" or "large", as these two scenarios tend to respond best to different optimization strategies. We propose a unified technique for implementing matrix operations like general matrix multiplication (GEMM) that can achieve high performance for both small and large problem sizes. The key is to fuse p… ▽ More

    Submitted 16 February, 2023; v1 submitted 16 February, 2023; originally announced February 2023.

    Comments: 16 pages, 7 figures, 2 algorithms

    ACM Class: G.4

  5. arXiv:1904.05717  [pdf, ps, other

    cs.MS

    The MOMMS Family of Matrix Multiplication Algorithms

    Authors: Tyler M. Smith, Robert A. van de Geijn

    Abstract: As the ratio between the rate of computation and rate with which data can be retrieved from various layers of memory continues to deteriorate, a question arises: Will the current best algorithms for computing matrix-matrix multiplication on future CPUs continue to be (near) optimal? This paper provides compelling analytical and empirical evidence that the answer is "no". The analytical results gui… ▽ More

    Submitted 11 April, 2019; originally announced April 2019.

  6. arXiv:1901.06015  [pdf, other

    cs.MS

    Supporting mixed-datatype matrix multiplication within the BLIS framework

    Authors: Field G. Van Zee, Devangi N. Parikh, Robert A. van de Geijn

    Abstract: We approach the problem of implementing mixed-datatype support within the general matrix multiplication (GEMM) operation of the BLIS framework, whereby each matrix operand A, B, and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the computation is allowed to take place in a precision different from the storage precisions of either A or… ▽ More

    Submitted 1 May, 2019; v1 submitted 17 January, 2019; originally announced January 2019.

    Report number: FLAME Working Note #89, The University of Texas at Austin, Department of Computer Science, Technical Report TR-19-01

  7. arXiv:1808.07984  [pdf, other

    cs.MS

    Implementing Strassen's Algorithm with CUTLASS on NVIDIA Volta GPUs

    Authors: Jianyu Huang, Chenhan D. Yu, Robert A. van de Geijn

    Abstract: Conventional GPU implementations of Strassen's algorithm (Strassen) typically rely on the existing high-performance matrix multiplication (GEMM), trading space for time. As a result, such approaches can only achieve practical speedup for relatively large, "squarish" matrices due to the extra memory overhead, and their usages are limited due to the considerable workspace. We present novel Strassen… ▽ More

    Submitted 23 August, 2018; originally announced August 2018.

    Report number: FLAME Working Note #88, The University of Texas at Austin, Department of Computer Science, Technical Report TR-18-08

  8. arXiv:1808.07832  [pdf, ps, other

    cs.PL cs.LO cs.MS

    A Simple Methodology for Computing Families of Algorithms

    Authors: Devangi N. Parikh, Margaret E. Myers, Richard Vuduc, Robert A. van de Geijn

    Abstract: Discovering "good" algorithms for an operation is often considered an art best left to experts. What if there is a simple methodology, an algorithm, for systematically deriving a family of algorithms as well as their cost analyses, so that the best algorithm can be chosen? We discuss such an approach for deriving loop-based algorithms. The example used to illustrate this methodology, evaluation of… ▽ More

    Submitted 20 August, 2018; originally announced August 2018.

    Report number: FLAME Working Note #87, The University of Texas at Austin, Department of Computer Science, Technical Report TR-18-06

  9. arXiv:1710.04286  [pdf, ps, other

    cs.MS

    Deriving Correct High-Performance Algorithms

    Authors: Devangi N. Parikh, Maggie E. Myers, Robert A. van de Geijn

    Abstract: Dijkstra observed that verifying correctness of a program is difficult and conjectured that derivation of a program hand-in-hand with its proof of correctness was the answer. We illustrate this goal-oriented approach by applying it to the domain of dense linear algebra libraries for distributed memory parallel computers. We show that algorithms that underlie the implementation of most functionalit… ▽ More

    Submitted 11 October, 2017; originally announced October 2017.

    Report number: FLAME Working Note #86, The University of Texas at Austin, Department of Computer Science, Technical Report TR-17-07

  10. arXiv:1704.03092  [pdf, other

    cs.MS

    Strassen's Algorithm for Tensor Contraction

    Authors: Jianyu Huang, Devin A. Matthews, Robert A. van de Geijn

    Abstract: Tensor contraction (TC) is an important computational kernel widely used in numerous applications. It is a multi-dimensional generalization of matrix multiplication (GEMM). While Strassen's algorithm for GEMM is well studied in theory and practice, extending it to accelerate TC has not been previously pursued. Thus, we believe this to be the first paper to demonstrate how one can in practice speed… ▽ More

    Submitted 10 April, 2017; originally announced April 2017.

    Report number: FLAME Working Note #84, The University of Texas at Austin, Department of Computer Science, Technical Report TR-17-02

  11. arXiv:1702.02017  [pdf, ps, other

    cs.CC

    A Tight I/O Lower Bound for Matrix Multiplication

    Authors: Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn

    Abstract: A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform $C:=AB$, for distinct matrices $A$, $B$, and $C$, where each segment is a series of operations involving $M$ reads and writes to and from fast memory,… ▽ More

    Submitted 6 February, 2019; v1 submitted 3 February, 2017; originally announced February 2017.

  12. arXiv:1611.08035  [pdf, other

    cs.MS

    Automating the Last-Mile for High Performance Dense Linear Algebra

    Authors: Richard Michael Veras, Tze Meng Low, Tyler Michael Smith, Robert van de Geijn, Franz Franchetti

    Abstract: High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving… ▽ More

    Submitted 28 April, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

  13. arXiv:1611.06365  [pdf, other

    cs.DC cs.MS cs.PF

    A Case for Malleable Thread-Level Linear Algebra Libraries: The LU Factorization with Partial Pivoting

    Authors: Sandra Catalán, José R. Herrero, Enrique S. Quintana-Ortí, Rafael Rodríguez-Sánchez, Robert van de Geijn

    Abstract: We propose two novel techniques for overcoming load-imbalance encountered when implementing so-called look-ahead mechanisms in relevant dense matrix factorizations for the solution of linear systems. Both techniques target the scenario where two thread teams are created/activated during the factorization, with each team in charge of performing an independent task/branch of execution. The first tec… ▽ More

    Submitted 19 November, 2016; originally announced November 2016.

  14. arXiv:1611.01120  [pdf, other

    cs.MS

    Generating Families of Practical Fast Matrix Multiplication Algorithms

    Authors: Jianyu Huang, Leslie Rice, Devin A. Matthews, Robert A. van de Geijn

    Abstract: Matrix multiplication (GEMM) is a core operation to numerous scientific applications. Traditional implementations of Strassen-like fast matrix multiplication (FMM) algorithms often do not perform well except for very large matrix sizes, due to the increased cost of memory movement, which is particularly noticeable for non-square matrices. Such implementations also require considerable workspace an… ▽ More

    Submitted 3 November, 2016; originally announced November 2016.

    Report number: FLAME Working Note #82, The University of Texas at Austin, Department of Computer Science, Technical Report TR-16-18

  15. arXiv:1609.00076  [pdf, other

    cs.MS

    BLISlab: A Sandbox for Optimizing GEMM

    Authors: Jianyu Huang, Robert A. van de Geijn

    Abstract: Matrix-matrix multiplication is a fundamental operation of great importance to scientific computing and, increasingly, machine learning. It is a simple enough concept to be introduced in a typical high school algebra course yet in practice important enough that its implementation on computers continues to be an active research topic. This note describes a set of exercises that use this operation t… ▽ More

    Submitted 31 August, 2016; originally announced September 2016.

    Comments: FLAME Working Note #80

  16. arXiv:1605.01078  [pdf, other

    cs.MS

    Implementing Strassen's Algorithm with BLIS

    Authors: Jianyu Huang, Tyler M. Smith, Greg M. Henry, Robert A. van de Geijn

    Abstract: We dispel with "street wisdom" regarding the practical implementation of Strassen's algorithm for matrix-matrix multiplication (DGEMM). Conventional wisdom: it is only practical for very large matrices. Our implementation is practical for small matrices. Conventional wisdom: the matrices being multiplied should be relatively square. Our implementation is practical for rank-k updates, where k is re… ▽ More

    Submitted 3 May, 2016; originally announced May 2016.

    Report number: FLAME Working Note #79, The University of Texas at Austin, Department of Computer Sciences Technical Report TR-16-03

  17. arXiv:1512.02671  [pdf, other

    math.NA

    Householder QR Factorization with Randomization for Column Pivoting (HQRRP). FLAME Working Note #78

    Authors: Per-Gunnar Martinsson, Gregorio Quintana-Orti, Nathan Heavner, Robert van de Geijn

    Abstract: A fundamental problem when adding column pivoting to the Householder QR factorization is that only about half of the computation can be cast in terms of high performing matrix-matrix multiplications, which greatly limits the benefits that can be derived from so-called blocking of algorithms. This paper describes a technique for selecting groups of pivot vectors by means of randomized projections.… ▽ More

    Submitted 6 December, 2016; v1 submitted 8 December, 2015; originally announced December 2015.

    Report number: FLAME Working Note #78

  18. arXiv:1301.7744  [pdf, ps, other

    math.NA cs.MS

    Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

    Authors: Martin D. Schatz, Tze Meng Low, Robert A. van de Geijn, Tamara G. Kolda

    Abstract: Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric te… ▽ More

    Submitted 9 April, 2014; v1 submitted 31 January, 2013; originally announced January 2013.

    MSC Class: 15-02 (Primary)

    Journal ref: SIAM Journal on Scientific Computing, Vol. 36, No. 5, pp. C453-C479, September 2014