-
Communication Bounds for Convolutional Neural Networks
Authors:
Anthony Chen,
James Demmel,
Grace Dinh,
Mason Haberle,
Olga Holtz
Abstract:
Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new…
▽ More
Convolutional neural networks (CNNs) are important in a wide variety of machine learning tasks and applications, so optimizing their performance is essential. Moving words of data between levels of a memory hierarchy or between processors on a network is much more expensive than the cost of arithmetic, so minimizing communication is critical to optimizing performance. In this paper, we present new lower bounds on data movement for mixed precision convolutions in both single-processor and parallel distributed memory models, as well as algorithms that outperform current implementations such as Im2Col. We obtain performance figures using GEMMINI, a machine learning accelerator, where our tiling provides improvements between 13% and 150% over a vendor supplied algorithm.
△ Less
Submitted 18 April, 2022;
originally announced April 2022.
-
Sparsifying the Operators of Fast Matrix Multiplication Algorithms
Authors:
Gal Beniamini,
Nathan Cheng,
Olga Holtz,
Elaye Karstadt,
Oded Schwartz
Abstract:
Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's b…
▽ More
Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's bilinear operator. Unfortunately, the problem of finding optimal sparsifications is NP-Hard.
We obtain three new methods to this end, and apply them to existing fast matrix multiplication algorithms, thus improving their leading coefficients. These methods have an exponential worst case running time, but run fast in practice and improve the performance of many fast matrix multiplication algorithms. Two of the methods are guaranteed to produce leading coefficients that, under some assumptions, are optimal.
△ Less
Submitted 9 August, 2020;
originally announced August 2020.
-
Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Benjamin Lipshitz,
Oded Schwartz
Abstract:
Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obta…
▽ More
Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal.
△ Less
Submitted 10 September, 2012;
originally announced September 2012.
-
Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Benjamin Lipshitz,
Oded Schwartz
Abstract:
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ba…
▽ More
A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales.
We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms.
△ Less
Submitted 14 February, 2012;
originally announced February 2012.
-
Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Benjamin Lipshitz,
Oded Schwartz
Abstract:
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice.
A criti…
▽ More
Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice.
A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range.
Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203.
Our parallelization approach generalizes to other fast matrix multiplication algorithms.
△ Less
Submitted 14 February, 2012;
originally announced February 2012.
-
Graph Expansion and Communication Costs of Fast Matrix Multiplication
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Oded Schwartz
Abstract:
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs.
In the sequential case, where the processor has a fast memory of size $M$, too small to…
▽ More
The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs.
In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $Ω((\frac{n}{\sqrt M})^{ω_0}\cdot M)$, where $ω_0$ is the exponent in the arithmetic count (e.g., $ω_0 = \lg 7$ for Strassen, and $ω_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller.
These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation).
△ Less
Submitted 8 September, 2011;
originally announced September 2011.
-
Computational Complexity and Numerical Stability of Linear Problems
Authors:
Olga Holtz,
Noam Shomron
Abstract:
We survey classical and recent developments in numerical linear algebra, focusing on two issues: computational complexity, or arithmetic costs, and numerical stability, or performance under roundoff error. We present a brief account of the algebraic complexity theory as well as the general error analysis for matrix multiplication and related problems. We emphasize the central role played by the…
▽ More
We survey classical and recent developments in numerical linear algebra, focusing on two issues: computational complexity, or arithmetic costs, and numerical stability, or performance under roundoff error. We present a brief account of the algebraic complexity theory as well as the general error analysis for matrix multiplication and related problems. We emphasize the central role played by the matrix multiplication problem and discuss historical and modern approaches to its solution.
△ Less
Submitted 11 September, 2009; v1 submitted 3 June, 2009;
originally announced June 2009.
-
Minimizing Communication in Linear Algebra
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Oded Schwartz
Abstract:
In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#ar…
▽ More
In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#arithmetic operations / $\sqrt{M}$), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, $LDL^T$ factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems.
We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain.
△ Less
Submitted 15 May, 2009;
originally announced May 2009.
-
Communication-optimal Parallel and Sequential Cholesky Decomposition
Authors:
Grey Ballard,
James Demmel,
Olga Holtz,
Oded Schwartz
Abstract:
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower…
▽ More
Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy.
△ Less
Submitted 12 April, 2010; v1 submitted 15 February, 2009;
originally announced February 2009.
-
Compressive sensing: a paradigm shift in signal processing
Authors:
Olga Holtz
Abstract:
We survey a new paradigm in signal processing known as "compressive sensing". Contrary to old practices of data acquisition and reconstruction based on the Shannon-Nyquist sampling principle, the new theory shows that it is possible to reconstruct images or signals of scientific interest accurately and even exactly from a number of samples which is far smaller than the desired resolution of the…
▽ More
We survey a new paradigm in signal processing known as "compressive sensing". Contrary to old practices of data acquisition and reconstruction based on the Shannon-Nyquist sampling principle, the new theory shows that it is possible to reconstruct images or signals of scientific interest accurately and even exactly from a number of samples which is far smaller than the desired resolution of the image/signal, e.g., the number of pixels in the image. This new technique draws from results in several fields of mathematics, including algebra, optimization, probability theory, and harmonic analysis. We will discuss some of the key mathematical ideas behind compressive sensing, as well as its implications to other fields: numerical analysis, information theory, theoretical computer science, and engineering.
△ Less
Submitted 16 December, 2008;
originally announced December 2008.
-
Accurate and Efficient Expression Evaluation and Linear Algebra
Authors:
James Demmel,
Ioana Dumitriu,
Olga Holtz,
Plamen Koev
Abstract:
We survey and unify recent results on the existence of accurate algorithms for evaluating multivariate polynomials, and more generally for accurate numerical linear algebra with structured matrices. By "accurate" we mean that the computed answer has relative error less than 1, i.e., has some correct leading digits. We also address efficiency, by which we mean algorithms that run in polynomial ti…
▽ More
We survey and unify recent results on the existence of accurate algorithms for evaluating multivariate polynomials, and more generally for accurate numerical linear algebra with structured matrices. By "accurate" we mean that the computed answer has relative error less than 1, i.e., has some correct leading digits. We also address efficiency, by which we mean algorithms that run in polynomial time in the size of the input. Our results will depend strongly on the model of arithmetic: Most of our results will use the so-called Traditional Model (TM). We give a set of necessary and sufficient conditions to decide whether a high accuracy algorithm exists in the TM, and describe progress toward a decision procedure that will take any problem and provide either a high accuracy algorithm or a proof that none exists. When no accurate algorithm exists in the TM, it is natural to extend the set of available accurate operations by a library of additional operations, such as $x+y+z$, dot products, or indeed any enumerable set which could then be used to build further accurate algorithms. We show how our accurate algorithms and decision procedure for finding them extend to this case. Finally, we address other models of arithmetic, and the relationship between (im)possibility in the TM and (in)efficient algorithms operating on numbers represented as bit strings.
△ Less
Submitted 24 December, 2007;
originally announced December 2007.
-
Fast linear algebra is stable
Authors:
James Demmel,
Ioana Dumitriu,
Olga Holtz
Abstract:
In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of $n$-by-$n$ matrices can be done by any algorithm in $O(n^{ω+ η})$ operations for any $η> 0$, then it can be done stably in $O(n^{ω+ η})$ operations for any $η> 0$. Here we extend this result to show that essentially all standard…
▽ More
In an earlier paper, we showed that a large class of fast recursive matrix multiplication algorithms is stable in a normwise sense, and that in fact if multiplication of $n$-by-$n$ matrices can be done by any algorithm in $O(n^{ω+ η})$ operations for any $η> 0$, then it can be done stably in $O(n^{ω+ η})$ operations for any $η> 0$. Here we extend this result to show that essentially all standard linear algebra operations, including LU decomposition, QR decomposition, linear equation solving, matrix inversion, solving least squares problems, (generalized) eigenvalue problems and the singular value decomposition can also be done stably (in a normwise sense) in $O(n^{ω+ η})$ operations.
△ Less
Submitted 28 August, 2007; v1 submitted 10 December, 2006;
originally announced December 2006.
-
Fast matrix multiplication is stable
Authors:
James Demmel,
Ioana Dumitriu,
Olga Holtz,
Robert Kleinberg
Abstract:
We perform forward error analysis for a large class of recursive matrix multiplication algorithms in the spirit of [D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer. Math. 36 (1980), 63--72]. As a consequence of our analysis, we show that the exponent of matrix multiplication (the optimal running time) can be achieved by numerically stable algorithms. We also s…
▽ More
We perform forward error analysis for a large class of recursive matrix multiplication algorithms in the spirit of [D. Bini and G. Lotti, Stability of fast algorithms for matrix multiplication, Numer. Math. 36 (1980), 63--72]. As a consequence of our analysis, we show that the exponent of matrix multiplication (the optimal running time) can be achieved by numerically stable algorithms. We also show that new group-theoretic algorithms proposed in [H. Cohn, and C. Umans, A group-theoretic approach to fast matrix multiplication, FOCS 2003, 438--449] and [H. Cohn, R. Kleinberg, B. Szegedy and C. Umans, Group-theoretic algorithms for matrix multiplication, FOCS 2005, 379--388] are all included in the class of algorithms to which our analysis applies, and are therefore numerically stable. We perform detailed error analysis for three specific fast group-theoretic algorithms.
△ Less
Submitted 7 December, 2006; v1 submitted 8 March, 2006;
originally announced March 2006.
-
Toward accurate polynomial evaluation in rounded arithmetic
Authors:
James Demmel,
Ioana Dumitriu,
Olga Holtz
Abstract:
Given a multivariate real (or complex) polynomial $p$ and a domain $\cal D$, we would like to decide whether an algorithm exists to evaluate $p(x)$ accurately for all $x \in {\cal D}$ using rounded real (or complex) arithmetic. Here ``accurately'' means with relative error less than 1, i.e., with some correct leading digits. The answer depends on the model of rounded arithmetic: We assume that f…
▽ More
Given a multivariate real (or complex) polynomial $p$ and a domain $\cal D$, we would like to decide whether an algorithm exists to evaluate $p(x)$ accurately for all $x \in {\cal D}$ using rounded real (or complex) arithmetic. Here ``accurately'' means with relative error less than 1, i.e., with some correct leading digits. The answer depends on the model of rounded arithmetic: We assume that for any arithmetic operator $op(a,b)$, for example $a+b$ or $a \cdot b$, its computed value is $op(a,b) \cdot (1 + δ)$, where $| δ|$ is bounded by some constant $ε$ where $0 < ε\ll 1$, but $δ$ is otherwise arbitrary. This model is the traditional one used to analyze the accuracy of floating point algorithms.Our ultimate goal is to establish a decision procedure that, for any $p$ and $\cal D$, either exhibits an accurate algorithm or proves that none exists. In contrast to the case where numbers are stored and manipulated as finite bit strings (e.g., as floating point numbers or rational numbers) we show that some polynomials $p$ are impossible to evaluate accurately. The existence of an accurate algorithm will depend not just on $p$ and $\cal D$, but on which arithmetic operators and which constants are are available and whether branching is permitted. Toward this goal, we present necessary conditions on $p$ for it to be accurately evaluable on open real or complex domains ${\cal D}$. We also give sufficient conditions, and describe progress toward a complete decision procedure. We do present a complete decision procedure for homogeneous polynomials $p$ with integer coefficients, ${\cal D} = \C^n$, and using only the arithmetic operations $+$, $-$ and $\cdot$.
△ Less
Submitted 18 January, 2006; v1 submitted 18 August, 2005;
originally announced August 2005.