Search | arXiv e-print repository

Dissecting Query-Key Interaction in Vision Transformers

Authors: Xu Pan, Aaron Philip, Ziqian Xie, Odelia Schwartz

Abstract: Self-attention in vision transformers is often thought to perform perceptual grou** where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction… ▽ More Self-attention in vision transformers is often thought to perform perceptual grou** where tokens attend to other tokens with similar embeddings, which could correspond to semantically similar features of an object. However, attending to dissimilar tokens can be beneficial by providing contextual information. We propose to use the Singular Value Decomposition to dissect the query-key interaction (i.e. ${\textbf{W}_q}^\top\textbf{W}_k$). We find that early layers attend more to similar tokens, while late layers show increased attention to dissimilar tokens, providing evidence corresponding to perceptual grou** and contextualization, respectively. Many of these interactions between features represented by singular vectors are interpretable and semantic, such as attention between relevant objects, between parts of an object, or between the foreground and background. This offers a novel perspective on interpreting the attention mechanism, which contributes to understanding how transformer models utilize context and salient features when processing images. △ Less

Submitted 26 May, 2024; v1 submitted 4 April, 2024; originally announced May 2024.

arXiv:2108.01548 [pdf, other]

Inference via Sparse Coding in a Hierarchical Vision Model

Authors: Joshua Bowren, Luis Sanchez-Giraldo, Odelia Schwartz

Abstract: Sparse coding has been incorporated in models of the visual cortex for its computational advantages and connection to biology. But how the level of sparsity contributes to performance on visual tasks is not well understood. In this work, sparse coding has been integrated into an existing hierarchical V2 model (Hosoya and Hyvärinen, 2015), but replacing its independent component analysis (ICA) with… ▽ More Sparse coding has been incorporated in models of the visual cortex for its computational advantages and connection to biology. But how the level of sparsity contributes to performance on visual tasks is not well understood. In this work, sparse coding has been integrated into an existing hierarchical V2 model (Hosoya and Hyvärinen, 2015), but replacing its independent component analysis (ICA) with an explicit sparse coding in which the degree of sparsity can be controlled. After training, the sparse coding basis functions with a higher degree of sparsity resembled qualitatively different structures, such as curves and corners. The contributions of the models were assessed with image classification tasks, specifically tasks associated with mid-level vision including figure-ground classification, texture classification, and angle prediction between two line stimuli. In addition, the models were assessed in comparison to a texture sensitivity measure that has been reported in V2 (Freeman et al., 2013), and a deleted-region inference task. The results from the experiments show that while sparse coding performed worse than ICA at classifying images, only sparse coding was able to better match the texture sensitivity level of V2 and infer deleted image regions, both by increasing the degree of sparsity in sparse coding. Higher degrees of sparsity allowed for inference over larger deleted image regions. The mechanism that allows for this inference capability in sparse coding is described here. △ Less

Submitted 16 January, 2022; v1 submitted 3 August, 2021; originally announced August 2021.

Comments: To appear in Journal of Vision (JoV)

arXiv:2008.03759 [pdf, other]

Sparsifying the Operators of Fast Matrix Multiplication Algorithms

Authors: Gal Beniamini, Nathan Cheng, Olga Holtz, Elaye Karstadt, Oded Schwartz

Abstract: Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's b… ▽ More Fast matrix multiplication algorithms may be useful, provided that their running time is good in practice. Particularly, the leading coefficient of their arithmetic complexity needs to be small. Many sub-cubic algorithms have large leading coefficients, rendering them impractical. Karstadt and Schwartz (SPAA'17, JACM'20) demonstrated how to reduce these coefficients by sparsifying an algorithm's bilinear operator. Unfortunately, the problem of finding optimal sparsifications is NP-Hard. We obtain three new methods to this end, and apply them to existing fast matrix multiplication algorithms, thus improving their leading coefficients. These methods have an exponential worst case running time, but run fast in practice and improve the performance of many fast matrix multiplication algorithms. Two of the methods are guaranteed to produce leading coefficients that, under some assumptions, are optimal. △ Less

Submitted 9 August, 2020; originally announced August 2020.

ACM Class: F.2.1; F.2.1

arXiv:2005.14150 [pdf, other]

Network Partitioning and Avoidable Contention

Authors: Yishai Oltchik, Oded Schwartz

Abstract: Network contention frequently dominates the run time of parallel algorithms and limits scaling performance. Most previous studies mitigate or eliminate contention by utilizing one of several approaches: communication-minimizing algorithms; hotspot-avoiding routing schemes; topology-aware task map**; or improving global network properties, such as bisection bandwidth, edge-expansion, partitioning… ▽ More Network contention frequently dominates the run time of parallel algorithms and limits scaling performance. Most previous studies mitigate or eliminate contention by utilizing one of several approaches: communication-minimizing algorithms; hotspot-avoiding routing schemes; topology-aware task map**; or improving global network properties, such as bisection bandwidth, edge-expansion, partitioning, and network diameter. In practice, parallel jobs often use only a fraction of a host system. How do processor allocation policies affect contention within a partition? We utilize edge-isoperimetric analysis of network graphs to determine whether a network partition has optimal internal bisection. Increasing the bisection allows a more efficient use of the network resources, decreasing or completely eliminating the link contention. We first study torus networks and characterize partition geometries that maximize internal bisection bandwidth. We examine the allocation policies of Mira and JUQUEEN, the two largest publicly-accessible Blue Gene/Q torus-based supercomputers. Our analysis demonstrates that the bisection bandwidth of their current partitions can often be improved by changing the partitions' geometries. These can yield up to a X2 speedup for contention-bound workloads. Benchmarking experiments validate the predictions. Our analysis applies to allocation policies of other networks. △ Less

Submitted 28 May, 2020; originally announced May 2020.

Comments: 10 pages, 7 figures

ACM Class: C.2.1

arXiv:1806.02888 [pdf, other]

Correspondence of Deep Neural Networks and the Brain for Visual Textures

Authors: Md Nasir Uddin Laskar, Luis G Sanchez Giraldo, Odelia Schwartz

Abstract: Deep convolutional neural networks (CNNs) trained on objects and scenes have shown intriguing ability to predict some response properties of visual cortical neurons. However, the factors and computations that give rise to such ability, and the role of intermediate processing stages in explaining changes that develop across areas of the cortical hierarchy, are poorly understood. We focused on the s… ▽ More Deep convolutional neural networks (CNNs) trained on objects and scenes have shown intriguing ability to predict some response properties of visual cortical neurons. However, the factors and computations that give rise to such ability, and the role of intermediate processing stages in explaining changes that develop across areas of the cortical hierarchy, are poorly understood. We focused on the sensitivity to textures as a paradigmatic example, since recent neurophysiology experiments provide rich data pointing to texture sensitivity in secondary but not primary visual cortex. We developed a quantitative approach for selecting a subset of the neural unit population from the CNN that best describes the brain neural recordings. We found that the first two layers of the CNN showed qualitative and quantitative correspondence to the cortical data across a number of metrics. This compatibility was reduced for the architecture alone rather than the learned weights, for some other related hierarchical models, and only mildly in the absence of a nonlinear computation akin to local divisive normalization. Our results show that the CNN class of model is effective for capturing changes that develop across early areas of cortex, and has the potential to facilitate understanding of the computations that give rise to hierarchical processing in the brain. △ Less

Submitted 7 June, 2018; originally announced June 2018.

arXiv:1806.01823 [pdf, ps, other]

Integrating Flexible Normalization into Mid-Level Representations of Deep Convolutional Neural Networks

Authors: Luis Gonzalo Sanchez Giraldo, Odelia Schwartz

Abstract: Deep convolutional neural networks (CNNs) are becoming increasingly popular models to predict neural responses in visual cortex. However, contextual effects, which are prevalent in neural processing and in perception, are not explicitly handled by current CNNs, including those used for neural prediction. In primary visual cortex, neural responses are modulated by stimuli spatially surrounding the… ▽ More Deep convolutional neural networks (CNNs) are becoming increasingly popular models to predict neural responses in visual cortex. However, contextual effects, which are prevalent in neural processing and in perception, are not explicitly handled by current CNNs, including those used for neural prediction. In primary visual cortex, neural responses are modulated by stimuli spatially surrounding the classical receptive field in rich ways. These effects have been modeled with divisive normalization approaches, including flexible models, where spatial normalization is recruited only to the degree responses from center and surround locations are deemed statistically dependent. We propose a flexible normalization model applied to mid-level representations of deep CNNs as a tractable way to study contextual normalization mechanisms in mid-level cortical areas. This approach captures non-trivial spatial dependencies among mid-level features in CNNs, such as those present in textures and other visual stimuli, that arise from tiling high order features, geometrically. We expect that the proposed approach can make predictions about when spatial normalization might be recruited in mid-level cortical areas. We also expect this approach to be useful as part of the CNN toolkit, therefore going beyond more restrictive fixed forms of normalization. △ Less

Submitted 24 December, 2018; v1 submitted 5 June, 2018; originally announced June 2018.

arXiv:1712.01978 [pdf, other]

doi 10.1016/j.jcp.2018.07.008

High-order Discretization of a Gyrokinetic Vlasov Model in Edge Plasma Geometry

Authors: Milo R. Dorr, Phillip Colella, Mikhail A. Dorf, Debojyoti Ghosh, Jeffrey A. F. Hittinger, Peter O. Schwartz

Abstract: We present a high-order spatial discretization of a continuum gyrokinetic Vlasov model in axisymmetric tokamak edge plasma geometries. Such models describe the phase space advection of plasma species distribution functions in the absence of collisions. The gyrokinetic model is posed in a four-dimensional phase space, upon which a grid is imposed when discretized. To mitigate the computational cost… ▽ More We present a high-order spatial discretization of a continuum gyrokinetic Vlasov model in axisymmetric tokamak edge plasma geometries. Such models describe the phase space advection of plasma species distribution functions in the absence of collisions. The gyrokinetic model is posed in a four-dimensional phase space, upon which a grid is imposed when discretized. To mitigate the computational cost associated with high-dimensional grids, we employ a high-order discretization to reduce the grid size needed to achieve a given level of accuracy relative to lower-order methods. Strong anisotropy induced by the magnetic field motivates the use of mapped coordinate grids aligned with magnetic flux surfaces. The natural partitioning of the edge geometry by the separatrix between the closed and open field line regions leads to the consideration of multiple mapped blocks, in what is known as a mapped multiblock (MMB) approach. We describe the specialization of a more general formalism that we have developed for the construction of high-order, finite-volume discretizations on MMB grids, yielding the accurate evaluation of the gyrokinetic Vlasov operator, the metric factors resulting from the MMB coordinate map**s, and the interaction of blocks at adjacent boundaries. Our conservative formulation of the gyrokinetic Vlasov model incorporates the fact that the phase space velocity has zero divergence, which must be preserved discretely to avoid truncation error accumulation. We describe an approach for the discrete evaluation of the gyrokinetic phase space velocity that preserves the divergence-free property to machine precision. △ Less

Submitted 5 December, 2017; originally announced December 2017.

MSC Class: 65M06; 86A10; 76N15

arXiv:1607.06303 [pdf, ps, other]

High-Performance Algorithms for Computing the Sign Function of Triangular Matrices

Authors: Vadim Stotland, Oded Schwartz, Sivan Toledo

Abstract: Algorithms and implementations for computing the sign function of a triangular matrix are fundamental building blocks in algorithms for computing the sign of arbitrary square real or complex matrices. We present novel recursive and cache efficient algorithms that are based on Higham's stabilized specialization of Parlett's substitution algorithm for computing the sign of a triangular matrix. We sh… ▽ More Algorithms and implementations for computing the sign function of a triangular matrix are fundamental building blocks in algorithms for computing the sign of arbitrary square real or complex matrices. We present novel recursive and cache efficient algorithms that are based on Higham's stabilized specialization of Parlett's substitution algorithm for computing the sign of a triangular matrix. We show that the new recursive algorithms are asymptotically optimal in terms of the number of cache misses that they generate. One of the novel algorithms that we present performs more arithmetic than the non-recursive version, but this allows it to benefit from calling highly-optimized matrix-multiplication routines; the other performs the same number of operations as the non-recursive version, but it uses custom computational kernels instead. We present implementations of both, as well as a cache-efficient implementation of a block version of Parlett's algorithm. Our experiments show that the blocked and recursive versions are much faster than the previous algorithms, and that the inertia strongly influences their relative performance, as predicted by our analysis. △ Less

Submitted 21 July, 2016; originally announced July 2016.

Comments: 18 pages, 4 figures

arXiv:1603.05627 [pdf, ps, other]

Hypergraph Partitioning for Sparse Matrix-Matrix Multiplication

Authors: Grey Ballard, Alex Druinsky, Nicholas Knight, Oded Schwartz

Abstract: We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in… ▽ More We propose a fine-grained hypergraph model for sparse matrix-matrix multiplication (SpGEMM), a key computational kernel in scientific computing and data analysis whose performance is often communication bound. This model correctly describes both the interprocessor communication volume along a critical path in a parallel computation and also the volume of data moving through the memory hierarchy in a sequential computation. We show that identifying a communication-optimal algorithm for particular input matrices is equivalent to solving a hypergraph partitioning problem. Our approach is sparsity dependent, meaning that we seek the best algorithm for the given input matrices. In addition to our (3D) fine-grained model, we also propose coarse-grained 1D and 2D models that correspond to simpler SpGEMM algorithms. We explore the relations between our models theoretically, and we study their performance experimentally in the context of three applications that use SpGEMM as a key computation. For each application, we find that at least one coarse-grained model is as communication efficient as the fine-grained model. We also observe that different applications have affinities for different algorithms. Our results demonstrate that hypergraphs are an accurate model for reasoning about the communication costs of SpGEMM as well as a practical tool for exploring the SpGEMM algorithm design space. △ Less

Submitted 17 March, 2016; originally announced March 2016.

arXiv:1510.00844 [pdf, other]

doi 10.1137/15M104253X

Exploiting Multiple Levels of Parallelism in Sparse Matrix-Matrix Multiplication

Authors: Ariful Azad, Grey Ballard, Aydin Buluc, James Demmel, Laura Grigori, Oded Schwartz, Sivan Toledo, Samuel Williams

Abstract: Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, th… ▽ More Sparse matrix-matrix multiplication (or SpGEMM) is a key primitive for many high-performance graph algorithms as well as for some linear solvers, such as algebraic multigrid. The scaling of existing parallel implementations of SpGEMM is heavily bound by communication. Even though 3D (or 2.5D) algorithms have been proposed and theoretically analyzed in the flat MPI model on Erdos-Renyi matrices, those algorithms had not been implemented in practice and their complexities had not been analyzed for the general case. In this work, we present the first ever implementation of the 3D SpGEMM formulation that also exploits multiple (intra-node and inter-node) levels of parallelism, achieving significant speedups over the state-of-the-art publicly available codes at all levels of concurrencies. We extensively evaluate our implementation and identify bottlenecks that should be subject to further research. △ Less

Submitted 16 November, 2016; v1 submitted 3 October, 2015; originally announced October 2015.

Journal ref: SIAM Journal of Scientific Computing, Volume 38, Number 6, pp. C624-C651, 2016

arXiv:1209.2184 [pdf, other]

doi 10.1007/978-3-642-34862-4_2

Graph Expansion Analysis for Communication Costs of Fast Rectangular Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obta… ▽ More Graph expansion analysis of computational DAGs is useful for obtaining communication cost lower bounds where previous methods, such as geometric embedding, are not applicable. This has recently been demonstrated for Strassen's and Strassen-like fast square matrix multiplication algorithms. Here we extend the expansion analysis approach to fast algorithms for rectangular matrix multiplication, obtaining a new class of communication cost lower bounds. These apply, for example to the algorithms of Bini et al. (1979) and the algorithms of Hopcroft and Kerr (1971). Some of our bounds are proved to be optimal. △ Less

Submitted 10 September, 2012; originally announced September 2012.

Journal ref: Design and Analysis of Algorithms Volume 7659, 2012, pp 13-36

arXiv:1208.4405 [pdf, other]

Delay-Doppler Channel Estimation with Almost Linear Complexity

Authors: Alexander Fish, Shamgar Gurevich, Ronny Hadani, Akbar Sayeed, Oded Schwartz

Abstract: A fundamental task in wireless communication is Channel Estimation: Compute the channel parameters a signal undergoes while traveling from a transmitter to a receiver. In the case of delay-Doppler channel, a widely used method is the Matched Filter algorithm. It uses a pseudo-random sequence of length N, and, in case of non-trivial relative velocity between transmitter and receiver, its computatio… ▽ More A fundamental task in wireless communication is Channel Estimation: Compute the channel parameters a signal undergoes while traveling from a transmitter to a receiver. In the case of delay-Doppler channel, a widely used method is the Matched Filter algorithm. It uses a pseudo-random sequence of length N, and, in case of non-trivial relative velocity between transmitter and receiver, its computational complexity is O(N^{2}log(N)). In this paper we introduce a novel approach of designing sequences that allow faster channel estimation. Using group representation techniques we construct sequences, which enable us to introduce a new algorithm, called the flag method, that significantly improves the matched filter algorithm. The flag method finds the channel parameters in O(mNlog(N)) operations, for channel of sparsity m. We discuss applications of the flag method to GPS, radar system, and mobile communication as well. △ Less

Submitted 23 August, 2012; v1 submitted 21 August, 2012; originally announced August 2012.

Comments: 11 pages

arXiv:1202.3177 [pdf, other]

Strong Scaling of Matrix Multiplication Algorithms and Memory-Independent Communication Lower Bounds

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ba… ▽ More A parallel algorithm has perfect strong scaling if its running time on P processors is linear in 1/P, including all communication costs. Distributed-memory parallel algorithms for matrix multiplication with perfect strong scaling have only recently been found. One is based on classical matrix multiplication (Solomonik and Demmel, 2011), and one is based on Strassen's fast matrix multiplication (Ballard, Demmel, Holtz, Lipshitz, and Schwartz, 2012). Both algorithms scale perfectly, but only up to some number of processors where the inter-processor communication no longer scales. We obtain a memory-independent communication cost lower bound on classical and Strassen-based distributed-memory matrix multiplication algorithms. These bounds imply that no classical or Strassen-based parallel matrix multiplication algorithm can strongly scale perfectly beyond the ranges already attained by the two parallel algorithms mentioned above. The memory-independent bounds and the strong scaling bounds generalize to other algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 4 pages, 1 figure

MSC Class: 68W10; 68W40 ACM Class: F.2.1

arXiv:1202.3173 [pdf, other]

Communication-Optimal Parallel Algorithm for Strassen's Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Benjamin Lipshitz, Oded Schwartz

Abstract: Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A criti… ▽ More Parallel matrix multiplication is one of the most studied fundamental problems in distributed and high performance computing. We obtain a new parallel algorithm that is based on Strassen's fast matrix multiplication and minimizes communication. The algorithm outperforms all known parallel matrix multiplication algorithms, classical and Strassen-based, both asymptotically and in practice. A critical bottleneck in parallelizing Strassen's algorithm is the communication between the processors. Ballard, Demmel, Holtz, and Schwartz (SPAA'11) prove lower bounds on these communication costs, using expansion properties of the underlying computation graph. Our algorithm matches these lower bounds, and so is communication-optimal. It exhibits perfect strong scaling within the maximum possible range. Benchmarking our implementation on a Cray XT4, we obtain speedups over classical and Strassen-based algorithms ranging from 24% to 184% for a fixed matrix dimension n=94080, where the number of nodes ranges from 49 to 7203. Our parallelization approach generalizes to other fast matrix multiplication algorithms. △ Less

Submitted 14 February, 2012; originally announced February 2012.

Comments: 13 pages, 3 figures

MSC Class: 68W40; 68W10 ACM Class: F.2.1

arXiv:1112.4883 [pdf, other]

Computing the Matched Filter in Linear Time

Authors: Alexander Fish, Shamgar Gurevich, Ronny Hadani, Akbar Sayeed, Oded Schwartz

Abstract: A fundamental problem in wireless communication is the time-frequency shift (TFS) problem: Find the time-frequency shift of a signal in a noisy environment. The shift is the result of time asynchronization of a sender with a receiver, and of non-zero speed of a sender with respect to a receiver. A classical solution of a discrete analog of the TFS problem is called the matched filter algorithm. It… ▽ More A fundamental problem in wireless communication is the time-frequency shift (TFS) problem: Find the time-frequency shift of a signal in a noisy environment. The shift is the result of time asynchronization of a sender with a receiver, and of non-zero speed of a sender with respect to a receiver. A classical solution of a discrete analog of the TFS problem is called the matched filter algorithm. It uses a pseudo-random waveform S(t) of the length p, and its arithmetic complexity is O(p^{2} \cdot log (p)), using fast Fourier transform. In these notes we introduce a novel approach of designing new waveforms that allow faster matched filter algorithm. We use techniques from group representation theory to design waveforms S(t), which enable us to introduce two fast matched filter (FMF) algorithms, called the flag algorithm, and the cross algorithm. These methods solve the TFS problem in O(p\cdot log (p)) operations. We discuss applications of the algorithms to mobile communication, GPS, and radar. △ Less

Submitted 20 December, 2011; originally announced December 2011.

Comments: 6 pages

arXiv:1109.1693 [pdf, ps, other]

doi 10.1145/1989493.1989495

Graph Expansion and Communication Costs of Fast Matrix Multiplication

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to… ▽ More The communication cost of algorithms (also known as I/O-complexity) is shown to be closely related to the expansion properties of the corresponding computation graphs. We demonstrate this on Strassen's and other fast matrix multiplication algorithms, and obtain first lower bounds on their communication costs. In the sequential case, where the processor has a fast memory of size $M$, too small to store three $n$-by-$n$ matrices, the lower bound on the number of words moved between fast and slow memory is, for many of the matrix multiplication algorithms, $Ω((\frac{n}{\sqrt M})^{ω_0}\cdot M)$, where $ω_0$ is the exponent in the arithmetic count (e.g., $ω_0 = \lg 7$ for Strassen, and $ω_0 = 3$ for conventional matrix multiplication). With $p$ parallel processors, each with fast memory of size $M$, the lower bound is $p$ times smaller. These bounds are attainable both for sequential and for parallel algorithms and hence optimal. These bounds can also be attained by many fast algorithms in linear algebra (e.g., algorithms for LU, QR, and solving the Sylvester equation). △ Less

Submitted 8 September, 2011; originally announced September 2011.

Report number: UCB/EECS-2011-40 ACM Class: F.2.1

Journal ref: Proceedings of the 23rd annual symposium on parallelism in algorithms and architectures. ACM, 1-12. 2011 (a shorter conference version)

arXiv:0905.2485 [pdf, ps, other]

doi 10.1137/090769156

Minimizing Communication in Linear Algebra

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#ar… ▽ More In 1981 Hong and Kung proved a lower bound on the amount of communication needed to perform dense, matrix-multiplication using the conventional $O(n^3)$ algorithm, where the input matrices were too large to fit in the small, fast memory. In 2004 Irony, Toledo and Tiskin gave a new proof of this result and extended it to the parallel case. In both cases the lower bound may be expressed as $Ω$(#arithmetic operations / $\sqrt{M}$), where M is the size of the fast memory (or local memory in the parallel case). Here we generalize these results to a much wider variety of algorithms, including LU factorization, Cholesky factorization, $LDL^T$ factorization, QR factorization, algorithms for eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. The proof works for dense or sparse matrices, and for sequential or parallel algorithms. In addition to lower bounds on the amount of data moved (bandwidth) we get lower bounds on the number of messages required to move it (latency). We illustrate how to extend our lower bound technique to compositions of linear algebra operations (like computing powers of a matrix), to decide whether it is enough to call a sequence of simpler optimal algorithms (like matrix multiplication) to minimize communication, or if we can do better. We give examples of both. We also show how to extend our lower bounds to certain graph theoretic problems. We point out recently designed algorithms for dense LU, Cholesky, QR, eigenvalue and the SVD problems that attain these lower bounds; implementations of LU and QR show large speedups over conventional linear algebra algorithms in standard libraries like LAPACK and ScaLAPACK. Many open problems remain. △ Less

Submitted 15 May, 2009; originally announced May 2009.

Comments: 27 pages, 2 tables

Journal ref: SIAM. J. Matrix Anal. & Appl. 32 (2011), no. 3, 866-901

arXiv:0904.2115 [pdf, other]

Colorful Strips

Authors: G. Aloupis, J. Cardinal, S. Collette, S. Imahori, M. Korman, S. Langerman, O. Schwartz, S. Smorodinsky, P. Taslakian

Abstract: Given a planar point set and an integer $k$, we wish to color the points with $k$ colors so that any axis-aligned strip containing enough points contains all colors. The goal is to bound the necessary size of such a strip, as a function of $k$. We show that if the strip size is at least $2k{-}1$, such a coloring can always be found. We prove that the size of the strip is also bounded in any fixed… ▽ More Given a planar point set and an integer $k$, we wish to color the points with $k$ colors so that any axis-aligned strip containing enough points contains all colors. The goal is to bound the necessary size of such a strip, as a function of $k$. We show that if the strip size is at least $2k{-}1$, such a coloring can always be found. We prove that the size of the strip is also bounded in any fixed number of dimensions. In contrast to the planar case, we show that deciding whether a 3D point set can be 2-colored so that any strip containing at least three points contains both colors is NP-complete. We also consider the problem of coloring a given set of axis-aligned strips, so that any sufficiently covered point in the plane is covered by $k$ colors. We show that in $d$ dimensions the required coverage is at most $d(k{-}1)+1$. Lower bounds are given for the two problems. This complements recent impossibility results on decomposition of strip coverings with arbitrary orientations. Finally, we study a variant where strips are replaced by wedges. △ Less

Submitted 7 April, 2011; v1 submitted 14 April, 2009; originally announced April 2009.

arXiv:0902.2537 [pdf, other]

doi 10.1137/090760969

Communication-optimal Parallel and Sequential Cholesky Decomposition

Authors: Grey Ballard, James Demmel, Olga Holtz, Oded Schwartz

Abstract: Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower… ▽ More Numerical algorithms have two kinds of costs: arithmetic and communication, by which we mean either moving data between levels of a memory hierarchy (in the sequential case) or over a network connecting processors (in the parallel case). Communication costs often dominate arithmetic costs, so it is of interest to design algorithms minimizing communication. In this paper we first extend known lower bounds on the communication cost (both for bandwidth and for latency) of conventional (O(n^3)) matrix multiplication to Cholesky factorization, which is used for solving dense symmetric positive definite linear systems. Second, we compare the costs of various Cholesky decomposition implementations to these lower bounds and identify the algorithms and data structures that attain them. In the sequential case, we consider both the two-level and hierarchical memory models. Combined with prior results in [13, 14, 15], this gives a set of communication-optimal algorithms for O(n^3) implementations of the three basic factorizations of dense linear algebra: LU with pivoting, QR and Cholesky. But it goes beyond this prior work on sequential LU by optimizing communication for any number of levels of memory hierarchy. △ Less

Submitted 12 April, 2010; v1 submitted 15 February, 2009; originally announced February 2009.

Comments: 29 pages, 2 tables, 6 figures

ACM Class: F.2.1

Journal ref: SIAM J. Sci. Comput. 32, (2010) pp. 3495-3523

Showing 1–19 of 19 results for author: Schwartz, O