Search | arXiv e-print repository

doi 10.1145/3539781.3539784

Communication Bounds for Convolutional Neural Networks

Authors: Anthony Chen, James Demmel, Grace Dinh, Mason Haberle, Olga Holtz

Abstract: Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new… ▽ More Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm. △ Less

Submitted 18 April, 2022; originally announced April 2022.

Journal ref: PASC '22: Proceedings of the Platform for Advanced Scientific Computing Conference June 2022 Article No. 1 Pages 1-10

arXiv:2008.03759 [pdf, other]

Sparsifying the Operators of Fast Matrix Multiplication Algorithms

Authors: Gal Beniamini, Nathan Cheng, Olga Holtz, Elaye Karstadt, Oded Schwartz

Abstract: Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's b… ▽ More Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's bilinear operator. Unfortunately, the problem of finding optimal sparsifications is NP-Hard. We obtain three new methods to this end, and apply them to existing fast matrix multiplication algorithms, thus improving their leading coefficients. These methods have an exponential worst case running time, but run fast in practice and improve the performance of many fast matrix multiplication algorithms. Two of the methods are guaranteed to produce leading coefficients that, under some assumptions, are optimal. △ Less

Submitted 9 August, 2020; originally announced August 2020.

ACM Class: F.2.1; F.2.1

arXiv:1209.2184 [pdf, other]

doi 10.1007/978-3-642-34862-4_2

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obta… ▽ More Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal. △ Less

Submitted 10 September, 2012; originally announced September 2012.

Journal ref: Design and Analysis of Algorithms Volume 7659, 2012, pp 13-36

arXiv:1202.3177 [pdf, other]

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ba… ▽ More A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 4 pages, 1 figure

MSC Class: 68W10; 68W40 ACM Class: F.2.1

arXiv:1202.3173 [pdf, other]

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A criti… ▽ More Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 13 pages, 3 figures

MSC Class: 68W40; 68W10 ACM Class: F.2.1

arXiv:1109.1693 [pdf, ps, other]

doi 10.1145/1989493.1989495

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to… ▽ More The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $Ω((\frac{n}{\sqrt M})^{ω_0}\cdot M)$, where $ω_0$ is the exponent in the arithmetic count (e.g., $ω_0 = \lg 7$ for Strassen, and $ω_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation). △ Less

Submitted 8 September, 2011; originally announced September 2011.

Report number: UCB/EECS-2011-40 ACM Class: F.2.1

Journal ref: Proceedings of the 23rd annual symposium on parallelism in algorithms and architectures. ACM, 1-12. 2011 (a shorter conference version)

arXiv:0906.0687 [pdf, ps, other]

doi 10.4171/077-1/16

Computational Complexity and Numerical Stability of Linear Problems

Authors: Olga Holtz, Noam Shomron

Abstract: We survey classical and recent developments in numerical linear algebra, focusing on two issues: computational complexity, or arithmetic costs, and numerical stability, or performance under roundoff error. We present a brief account of the algebraic complexity theory as well as the general error analysis for matrix multiplication and related problems. We emphasize the central role played by the… ▽ More We survey classical and recent developments in numerical linear algebra, focusing on two issues: computational complexity, or arithmetic costs, and numerical stability, or performance under roundoff error. We present a brief account of the algebraic complexity theory as well as the general error analysis for matrix multiplication and related problems. We emphasize the central role played by the matrix multiplication problem and discuss historical and modern approaches to its solution. △ Less

Submitted 11 September, 2009; v1 submitted 3 June, 2009; originally announced June 2009.

Comments: 16 pages; updated to reflect referees' remarks; to appear in Proceedings of the 5th European Congress of Mathematics

Journal ref: European Congress of Mathematics Amsterdam, 14-18 July, 2008, EMS Publishing House, pp. 381-400

arXiv:0905.2485 [pdf, ps, other]

doi 10.1137/090769156

Minimizing Communication in Linear Algebra

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#ar… ▽ More In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#arithmetic operations / $\sqrt{M}$), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, $LDL^T$ factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. △ Less

Submitted 15 May, 2009; originally announced May 2009.

Comments: 27 pages, 2 tables

Journal ref: SIAM. J. Matrix Anal. & Appl. 32 (2011), no. 3, 866-901

arXiv:0902.2537 [pdf, other]

doi 10.1137/090760969

Communication-optimal Parallel and Sequential Cholesky Decomposition

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower… ▽ More Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy. △ Less

Submitted 12 April, 2010; v1 submitted 15 February, 2009; originally announced February 2009.

Comments: 29 pages, 2 tables, 6 figures

ACM Class: F.2.1

Journal ref: SIAM J. Sci. Comput. 32, (2010) pp. 3495-3523

arXiv:0812.3137 [pdf, ps, other]

Compressive sensing: a paradigm shift in signal processing

Authors: Olga Holtz

Abstract: We survey a new paradigm in signal processing known as "compressive sensing". Contrary to old practices of data acquisition and reconstruction based on the Shannon-Nyquist sampling principle, the new theory shows that it is possible to reconstruct images or signals of scientific interest accurately and even exactly from a number of samples which is far smaller than the desired resolution of the… ▽ More We survey a new paradigm in signal processing known as "compressive sensing". Contrary to old practices of data acquisition and reconstruction based on the Shannon-Nyquist sampling principle, the new theory shows that it is possible to reconstruct images or signals of scientific interest accurately and even exactly from a number of samples which is far smaller than the desired resolution of the image/signal, e.g., the number of pixels in the image. This new technique draws from results in several fields of mathematics, including algebra, optimization, probability theory, and harmonic analysis. We will discuss some of the key mathematical ideas behind compressive sensing, as well as its implications to other fields: numerical analysis, information theory, theoretical computer science, and engineering. △ Less

Submitted 16 December, 2008; originally announced December 2008.

Comments: A short survey of compressive sensing

MSC Class: 90C05; 90C25; 65F50; 94A08; 94A20; 68P30; 65Y20

arXiv:0712.4027 [pdf, ps, other]

doi 10.1017/S0962492906350015

Accurate and Efficient Expression Evaluation and Linear Algebra

Authors: James Demmel, Ioana Dumitriu, Olga Holtz, Plamen Koev

Abstract: We survey and unify recent results on the existence of accurate algorithms for evaluating multivariate polynomials, and more generally for accurate numerical linear algebra with structured matrices. By "accurate" we mean that the computed answer has relative error less than 1, i.e., has some correct leading digits. We also address efficiency, by which we mean algorithms that run in polynomial ti… ▽ More We survey and unify recent results on the existence of accurate algorithms for evaluating multivariate polynomials, and more generally for accurate numerical linear algebra with structured matrices. By "accurate" we mean that the computed answer has relative error less than 1, i.e., has some correct leading digits. We also address efficiency, by which we mean algorithms that run in polynomial time in the size of the input. Our results will depend strongly on the model of arithmetic: Most of our results will use the so-called Traditional Model (TM). We give a set of necessary and sufficient conditions to decide whether a high accuracy algorithm exists in the TM, and describe progress toward a decision procedure that will take any problem and provide either a high accuracy algorithm or a proof that none exists. When no accurate algorithm exists in the TM, it is natural to extend the set of available accurate operations by a library of additional operations, such as $x+y+z$, dot products, or indeed any enumerable set which could then be used to build further accurate algorithms. We show how our accurate algorithms and decision procedure for finding them extend to this case. Finally, we address other models of arithmetic, and the relationship between (im)possibility in the TM and (in)efficient algorithms operating on numbers represented as bit strings. △ Less

Submitted 24 December, 2007; originally announced December 2007.

Comments: 49 pages, 6 figures, 1 table

MSC Class: 65Y20; 68Q05; 68Q25; 65F30; 68W40; 68W25

Journal ref: Acta Numerica, Volume 17, May 2008, pp 87-145

arXiv:math/0612264 [pdf, ps, other]

doi 10.1007/s00211-007-0114-x

Fast linear algebra is stable

Authors: James Demmel, Ioana Dumitriu, Olga Holtz

Abstract: In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of $n$-by-$n$ matrices can be done by any algorithm in $O(n^{ω+ η})$ operations for any $η> 0$, then it can be done stably in $O(n^{ω+ η})$ operations for any $η> 0$. Here we extend this result to show that essentially all standard… ▽ More In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of $n$-by-$n$ matrices can be done by any algorithm in $O(n^{ω+ η})$ operations for any $η> 0$, then it can be done stably in $O(n^{ω+ η})$ operations for any $η> 0$. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in $O(n^{ω+ η})$ operations. △ Less

Submitted 28 August, 2007; v1 submitted 10 December, 2006; originally announced December 2006.

Comments: 26 pages; final version; to appear in Numerische Mathematik

MSC Class: 65Y20; 65F30; 65G50; 68Q17; 68Q25

Journal ref: Numer. Math. 108 (2007), no. 1, 59-91

arXiv:math/0603207 [pdf, ps, other]

doi 10.1007/s00211-007-0061-6

Fast matrix multiplication is stable

Authors: James Demmel, Ioana Dumitriu, Olga Holtz, Robert Kleinberg

Abstract: We perform forward error analysis for a large class of recursive matrix multiplication algorithms in the spirit of [D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer. Math. 36 (1980), 63--72]. As a consequence of our analysis, we show that the exponent of matrix multiplication (the optimal running time) can be achieved by numerically stable algorithms. We also s… ▽ More We perform forward error analysis for a large class of recursive matrix multiplication algorithms in the spirit of [D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer. Math. 36 (1980), 63--72]. As a consequence of our analysis, we show that the exponent of matrix multiplication (the optimal running time) can be achieved by numerically stable algorithms. We also show that new group-theoretic algorithms proposed in [H. Cohn, and C. Umans, A group-theoretic approach to fast matrix multiplication, FOCS 2003, 438--449] and [H. Cohn, R. Kleinberg, B. Szegedy and C. Umans, Group-theoretic algorithms for matrix multiplication, FOCS 2005, 379--388] are all included in the class of algorithms to which our analysis applies, and are therefore numerically stable. We perform detailed error analysis for three specific fast group-theoretic algorithms. △ Less

Submitted 7 December, 2006; v1 submitted 8 March, 2006; originally announced March 2006.

Comments: 19 pages; final version, expanded and updated to reflect referees' remarks; to appear in Numerische Mathematik

MSC Class: 65Y20; 65F30; 65G50; 68Q17; 68W40; 20C05; 20K01; 16S34; 43A30; 65T50

Journal ref: Numer. Math. 106 (2007), no. 2, 199-224

arXiv:math/0508350 [pdf, ps, other]

Toward accurate polynomial evaluation in rounded arithmetic

Authors: James Demmel, Ioana Dumitriu, Olga Holtz

Abstract: Given a multivariate real (or complex) polynomial $p$ and a domain $\cal D$, we would like to decide whether an algorithm exists to evaluate $p(x)$ accurately for all $x \in {\cal D}$ using rounded real (or complex) arithmetic. Here ``accurately'' means with relative error less than 1, i.e., with some correct leading digits. The answer depends on the model of rounded arithmetic: We assume that f… ▽ More Given a multivariate real (or complex) polynomial $p$ and a domain $\cal D$, we would like to decide whether an algorithm exists to evaluate $p(x)$ accurately for all $x \in {\cal D}$ using rounded real (or complex) arithmetic. Here ``accurately'' means with relative error less than 1, i.e., with some correct leading digits. The answer depends on the model of rounded arithmetic: We assume that for any arithmetic operator $op(a,b)$, for example $a+b$ or $a \cdot b$, its computed value is $op(a,b) \cdot (1 + δ)$, where $| δ|$ is bounded by some constant $ε$ where $0 < ε\ll 1$, but $δ$ is otherwise arbitrary. This model is the traditional one used to analyze the accuracy of floating point algorithms.Our ultimate goal is to establish a decision procedure that, for any $p$ and $\cal D$, either exhibits an accurate algorithm or proves that none exists. In contrast to the case where numbers are stored and manipulated as finite bit strings (e.g., as floating point numbers or rational numbers) we show that some polynomials $p$ are impossible to evaluate accurately. The existence of an accurate algorithm will depend not just on $p$ and $\cal D$, but on which arithmetic operators and which constants are are available and whether branching is permitted. Toward this goal, we present necessary conditions on $p$ for it to be accurately evaluable on open real or complex domains ${\cal D}$. We also give sufficient conditions, and describe progress toward a complete decision procedure. We do present a complete decision procedure for homogeneous polynomials $p$ with integer coefficients, ${\cal D} = \C^n$, and using only the arithmetic operations $+$, $-$ and $\cdot$. △ Less

Submitted 18 January, 2006; v1 submitted 18 August, 2005; originally announced August 2005.

Comments: 54 pages, 6 figures; refereed version; to appear in Foundations of Computational Mathematics: Santander 2005, Cambridge University Press, March 2006

MSC Class: 65Y20; 68Q05; 68Q25; 65F30; 68W40; 68W25

Journal ref: in Foundations of Computational Mathematics: Santander 2005 (L. Pardo et al, eds.) Cambridge University Press, 2006, pp. 36-105

Showing 1–14 of 14 results for author: Holtz, O