Skip to main content

Showing 1–18 of 18 results for author: Langou, J

Searching in archive cs. Search in all archives.
.
  1. arXiv:2404.16443  [pdf, ps, other

    cs.CC cs.DC

    Tightening I/O Lower Bounds through the Hourglass Dependency Pattern

    Authors: Lionel Eyraud-Dubois, Guillaume Iooss, Julien Langou, Fabrice Rastello

    Abstract: When designing an algorithm, one cares about arithmetic/computational complexity, but data movement (I/O) complexity plays an increasingly important role that highly impacts performance and energy consumption. For a given algorithm and a given I/O model, scheduling strategies such as loop tiling can reduce the required I/O down to a limit, called the I/O complexity, inherent to the algorithm itsel… ▽ More

    Submitted 25 April, 2024; originally announced April 2024.

    Journal ref: 36th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '24), Jun 2024, Nantes, France

  2. arXiv:2207.09281  [pdf, other

    cs.MS

    Proposed Consistent Exception Handling for the BLAS and LAPACK

    Authors: James Demmel, Jack Dongarra, Mark Gates, Greg Henry, Julien Langou, Xiaoye Li, Piotr Luszczek, Weslley Pereira, Jason Riedy, Cindy Rubio-González

    Abstract: Numerical exceptions, which may be caused by overflow, operations like division by 0 or sqrt(-1), or convergence failures, are unavoidable in many cases, in particular when software is used on unforeseen and difficult inputs. As more aspects of society become automated, e.g., self-driving cars, health monitors, and cyber-physical systems more generally, it is becoming increasingly important to des… ▽ More

    Submitted 19 July, 2022; originally announced July 2022.

  3. arXiv:2202.10217  [pdf, ps, other

    cs.DC

    I/O-Optimal Algorithms for Symmetric Linear Algebra Kernels

    Authors: Olivier Beaumont, Lionel Eyraud-Dubois, Mathieu Vérité, Julien Langou

    Abstract: In this paper, we consider two fundamental symmetric kernels in linear algebra: the Cholesky factorization and the symmetric rank-$k$ update (SYRK), with the classical three nested loops algorithms for these kernels. In addition, we consider a machine model with a fast memory of size $S$ and an unbounded slow memory. In this model, all computations must be performed on operands in fast memory, and… ▽ More

    Submitted 21 February, 2022; originally announced February 2022.

  4. arXiv:1911.06664  [pdf, other

    cs.CC

    Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs

    Authors: Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, Fabrice Rastello

    Abstract: For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how effi… ▽ More

    Submitted 15 November, 2019; originally announced November 2019.

  5. arXiv:1702.02017  [pdf, ps, other

    cs.CC

    A Tight I/O Lower Bound for Matrix Multiplication

    Authors: Tyler Michael Smith, Bradley Lowery, Julien Langou, Robert A. van de Geijn

    Abstract: A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform $C:=AB$, for distinct matrices $A$, $B$, and $C$, where each segment is a series of operations involving $M$ reads and writes to and from fast memory,… ▽ More

    Submitted 6 February, 2019; v1 submitted 3 February, 2017; originally announced February 2017.

  6. arXiv:1611.06892  [pdf, other

    cs.MS math.NA math.RA

    Bidiagonalization with Parallel Tiled Algorithms

    Authors: Mathieu Faverge, Julien Langou, Yves Robert, Jack Dongarra

    Abstract: We consider algorithms for going from a "full" matrix to a condensed "band bidiagonal" form using orthogonal transformations. We use the framework of "algorithms by tiles". Within this framework, we study: (i) the tiled bidiagonalization algorithm BiDiag, which is a tiled version of the standard scalar bidiagonalization algorithm; and (ii) the R-bidiagonalization algorithm R-BiDiag, which is a til… ▽ More

    Submitted 18 November, 2016; originally announced November 2016.

  7. arXiv:1511.04478  [pdf, ps, other

    cs.DS

    A Backward/Forward Recovery Approach for the Preconditioned Conjugate Gradient Method

    Authors: Massimiliano Fasi, Julien Langou, Yves Robert, Bora Ucar

    Abstract: Several recent papers have introduced a periodic verification mechanism to detect silent errors in iterative solvers. Chen [PPoPP'13, pp. 167--176] has shown how to combine such a verification mechanism (a stability test checking the orthogonality of two vectors and recomputing the residual) with checkpointing: the idea is to verify every $d$ iterations, and to checkpoint every $c \times d$ iterat… ▽ More

    Submitted 13 November, 2015; originally announced November 2015.

  8. arXiv:1510.05107  [pdf, other

    cs.DC

    A Makespan Lower Bound for the Scheduling of the Tiled Cholesky Factorization based on ALAP scheduling

    Authors: Willy Quach, Julien Langou

    Abstract: Due to the advent of multicore architectures and massive parallelism, the tiled Cholesky factorization algorithm has recently received plenty of attention and is often referenced by practitioners as a case study. It is also implemented in mainstream dense linear algebra libraries. However, we note that theoretical study of the parallelism of this algorithm is currently lacking. In this paper, we p… ▽ More

    Submitted 17 October, 2015; originally announced October 2015.

  9. arXiv:1310.4645  [pdf, other

    cs.DC

    A Greedy Algorithm for Optimally Pipelining a Reduction

    Authors: Bradley R. Lowery, Julien Langou

    Abstract: Collective communications are ubiquitous in parallel applications. We present two new algorithms for performing a reduction. The operation associated with our reduction needs to be associative and commutative. The two algorithms are developed under two different communication models (unidirectional and bidirectional). Both algorithms use a greedy scheduling scheme. For a unidirectional, fully conn… ▽ More

    Submitted 17 October, 2013; originally announced October 2013.

    Comments: 17 pages

  10. arXiv:1110.1553  [pdf, other

    cs.DC

    Hierarchical QR factorization algorithms for multi-core cluster systems

    Authors: Jack Dongarra, Mathieu Faverge, Thomas Herault, Julien Langou, and Yves Robert

    Abstract: This paper describes a new QR factorization algorithm which is especially designed for massively parallel platforms combining parallel distributed multi-core nodes. These platforms make the present and the foreseeable future of high-performance computing. Our new QR factorization algorithm falls in the category of the tile algorithms which naturally enables good data locality for the sequential ke… ▽ More

    Submitted 7 October, 2011; originally announced October 2011.

  11. arXiv:1104.4475  [pdf, ps, other

    cs.DC

    Tiled QR factorization algorithms

    Authors: Henricus Bouwmeester, Mathias Jacquelin, Julien Langou, Yves Robert

    Abstract: This work revisits existing algorithms for the QR factorization of rectangular matrices composed of p-by-q tiles, where p >= q. Within this framework, we study the critical paths and performance of algorithms such as Sameh and Kuck, Modi and Clarke, Greedy, and those found within PLASMA. Although neither Modi and Clarke nor Greedy is optimal, both are shown to be asymptotically optimal for all mat… ▽ More

    Submitted 22 April, 2011; originally announced April 2011.

  12. arXiv:1010.2000  [pdf, other

    cs.DC

    A Critical Path Approach to Analyzing Parallelism of Algorithmic Variants. Application to Cholesky Inversion

    Authors: Henricus Bouwmeester, Julien Langou

    Abstract: Algorithms come with multiple variants which are obtained by changing the mathematical approach from which the algorithm is derived. These variants offer a wide spectrum of performance when implemented on a multicore platform and we seek to understand these differences in performances from a theoretical point of view. To that aim, we derive and present the critical path lengths of each algorithmic… ▽ More

    Submitted 11 October, 2010; originally announced October 2010.

  13. arXiv:1002.4057  [pdf, ps, other

    cs.MS math.NA

    Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

    Authors: Emmanuel Agullo, Henricus Bouwmeester, Jack Dongarra, Jakub Kurzak, Julien Langou, Lee Rosenberg

    Abstract: The algorithms in the current sequential numerical linear algebra libraries (e.g. LAPACK) do not parallelize well on multicore architectures. A new family of algorithms, the tile algorithms, has recently been introduced. Previous research has shown that it is possible to write efficient and scalable tile algorithms for performing a Cholesky factorization, a (pseudo) LU factorization, and a QR fa… ▽ More

    Submitted 22 February, 2010; originally announced February 2010.

    Comments: 8 pages, extended abstract submitted to VecPar10 on 12/11/09, notification of acceptance received on 02/05/10. See: http://vecpar.fe.up.pt/2010/

  14. QR Factorization of Tall and Skinny Matrices in a Grid Computing Environment

    Authors: Emmanuel Agullo, Camille Coti, Jack Dongarra, Thomas Herault, Julien Langou

    Abstract: Previous studies have reported that common dense linear algebra operations do not achieve speed up by using multiple geographical sites of a computational grid. Because such operations are the building blocks of most scientific applications, conventional supercomputers are still strongly predominant in high-performance computing and the use of grids for speeding up large-scale scientific problem… ▽ More

    Submitted 13 December, 2009; originally announced December 2009.

    Comments: Accepted at IPDPS10. (IEEE International Parallel & Distributed Processing Symposium 2010 in Atlanta, GA, USA.)

  15. arXiv:0901.1696  [pdf, ps, other

    cs.MS cs.DS

    Rectangular Full Packed Format for Cholesky's Algorithm: Factorization, Solution and Inversion

    Authors: Fred G. Gustavson, Jerzy Wasniewski, Jack J. Dongarra, Julien Langou

    Abstract: We describe a new data format for storing triangular, symmetric, and Hermitian matrices called RFPF (Rectangular Full Packed Format). The standard two dimensional arrays of Fortran and C (also known as full format) that are used to represent triangular and symmetric matrices waste nearly half of the storage space but provide high performance via the use of Level 3 BLAS. Standard packed format ar… ▽ More

    Submitted 12 January, 2009; originally announced January 2009.

  16. Accelerating Scientific Computations with Mixed Precision Algorithms

    Authors: Marc Baboulin, Alfredo Buttari, Jack Dongarra, Jakub Kurzak, Julie Langou, Julien Langou, Piotr Luszczek, Stanimire Tomov

    Abstract: On modern architectures, the performance of 32-bit operations is often at least twice as fast as the performance of 64-bit operations. By using a combination of 32-bit and 64-bit floating point arithmetic, the performance of many dense and sparse linear algebra algorithms can be significantly enhanced while maintaining the 64-bit accuracy of the resulting solution. The approach presented here ca… ▽ More

    Submitted 20 August, 2008; originally announced August 2008.

  17. arXiv:0806.3121  [pdf, other

    cs.DC cs.MS

    Algorithmic Based Fault Tolerance Applied to High Performance Computing

    Authors: George Bosilca, Remi Delmas, Jack Dongarra, Julien Langou

    Abstract: We present a new approach to fault tolerance for High Performance Computing system. Our approach is based on a careful adaptation of the Algorithmic Based Fault Tolerance technique (Huang and Abraham, 1984) to the need of parallel distributed computation. We obtain a strongly scalable mechanism for fault tolerance. We can also detect and correct errors (bit-flip) on the fly of a computation. To… ▽ More

    Submitted 18 June, 2008; originally announced June 2008.

  18. arXiv:0709.1272  [pdf, other

    cs.MS cs.DC

    A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

    Authors: Alfredo Buttari, Julien Langou, Jakub Kurzak, Jack Dongarra

    Abstract: As multicore systems continue to gain ground in the High Performance Computing world, linear algebra algorithms have to be reformulated or new algorithms have to be developed in order to take advantage of the architectural features on these new processors. Fine grain parallelism becomes a major requirement and introduces the necessity of loose synchronization in the parallel execution of an oper… ▽ More

    Submitted 12 June, 2008; v1 submitted 9 September, 2007; originally announced September 2007.

    Report number: Lapack working Note 191