Search | arXiv e-print repository

SMaLL: A Software Framework for portable Machine Learning Libraries

Authors: Upasana Sridhar, Nicholai Tukanov, Elliott Binder, Tze Meng Low, Scott McMillan, Martin D. Schatz

Abstract: Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily ported across different devices, high-performance inference implementations rely on a good map** of the high-level interface to the target hardware platform. Comm… ▽ More Interest in deploying Deep Neural Network (DNN) inference on edge devices has resulted in an explosion of the number and types of hardware platforms to use. While the high-level programming interface, such as TensorFlow, can be readily ported across different devices, high-performance inference implementations rely on a good map** of the high-level interface to the target hardware platform. Commonly, this map** may use optimizing compilers to generate code at compile time or high-performance vendor libraries that have been specialized to the target platform. Both approaches rely on expert knowledge to produce the map**, which may be time-consuming and difficult to extend to new architectures. In this work, we present a DNN library framework, SMaLL, that is easily extensible to new architectures. The framework uses a unified loop structure and shared, cache-friendly data format across all intermediate layers, eliminating the time and memory overheads incurred by data transformation between layers. Layers are implemented by simply specifying the layer's dimensions and a kernel -- the key computing operations of each layer. The unified loop structure and kernel abstraction allows us to reuse code across layers and computing platforms. New architectures only require the 100s of lines in the kernel to be redesigned. To show the benefits of our approach, we have developed software that supports a range of layer types and computing platforms, which is easily extensible for rapidly instantiating high performance DNN libraries. We evaluate our software by instantiating networks from the TinyMLPerf benchmark suite on 5 ARM platforms and 1 x86 platform ( an AMD Zen 2). Our framework shows end-to-end performance that is comparable to or better than ML Frameworks such as TensorFlow, TVM and LibTorch. △ Less

Submitted 8 March, 2023; originally announced March 2023.

Comments: 14 pages, 12 figures

arXiv:2110.01409 [pdf, other]

Delayed Asynchronous Iterative Graph Algorithms

Authors: Mark P. Blanco, Scott McMillan, Tze Meng Low

Abstract: Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-… ▽ More Iterative graph algorithms often compute intermediate values and update them as computation progresses. Updated output values are used as inputs for computations in current or subsequent iterations; hence the number of iterations required for values to converge can potentially reduce if the newest values are asynchronously made available to other updates computed in the same iteration. In a multi-threaded shared memory system, the immediate propagation of updated values can cause memory contention that may offset the benefit of propagating updates sooner. In some cases, the benefit of a smaller number of iterations may be diminished by each iteration taking longer. Our key idea is to combine the low memory contention that synchronous approaches have with the faster information sharing of asynchronous approaches. Our hybrid approach buffers updates from threads locally before committing them to the global store to control how often threads may cause conflicts for others while still sharing data within one iteration and hence speeding convergence. On a 112-thread CPU system, our hybrid approach attains up to 4.5% - 19.4% speedup over an asynchronous approach for Pagerank and up to 1.9% - 17% speedup over asynchronous Bellman Ford SSSP. Further, our hybrid approach attains 2.56x better performance than the synchronous approach. Finally, we provide insights as to why delaying updates is not helpful on certain graphs where connectivity is clustered on the main diagonal of the adjacency matrix. △ Less

Submitted 29 September, 2021; originally announced October 2021.

Comments: 6 pages, 6 figures, 2 tables, IEEE High Performance Extreme Computing (HPEC) Conference 2021

arXiv:2009.07935 [pdf, other]

doi 10.1109/HPEC43674.2020.9286188

Towards an Objective Metric for the Performance of Exact Triangle Count

Authors: Mark P. Blanco, Scott McMillan, Tze Meng Low

Abstract: The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss… ▽ More The performance of graph algorithms is often measured in terms of the number of traversed edges per second (TEPS). However, this performance metric is inadequate for a graph operation such as exact triangle counting. In triangle counting, execution times on graphs with a similar number of edges can be distinctly different as demonstrated by results from the past Graph Challenge entries. We discuss the need for an objective performance metric for graph operations and the desired characteristics of such a metric such that it more accurately captures the interactions between the amount of work performed and the capabilities of the hardware on which the code is executed. Using exact triangle counting as an example, we derive a metric that captures how certain techniques employed in many implementations improve performance. We demonstrate that our proposed metric can be used to evaluate and compare multiple approaches for triangle counting, using a SIMD approach as a case study against a scalar baseline. △ Less

Submitted 29 September, 2021; v1 submitted 16 September, 2020; originally announced September 2020.

Comments: 6 Pages, 2020 IEEE High Performance Extreme Computing Conference(HPEC)

arXiv:2009.07929 [pdf, other]

doi 10.1109/HPEC.2019.8916473

Exploration of Fine-Grained Parallelism for Load Balancing Eager K-truss on GPU and CPU

Authors: Mark Blanco, Tze Meng Low, Kyungjoo Kim

Abstract: In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU… ▽ More In this work we present a performance exploration on Eager K-truss, a linear-algebraic formulation of the K-truss graph algorithm. We address performance issues related to load imbalance of parallel tasks in symmetric, triangular graphs by presenting a fine-grained parallel approach to executing the support computation. This approach also increases available parallelism, making it amenable to GPU execution. We demonstrate our fine-grained parallel approach using implementations in Kokkos and evaluate them on an Intel Skylake CPU and an Nvidia Tesla V100 GPU. Overall, we observe between a 1.261. 48x improvement on the CPU and a 9.97-16.92x improvement on the GPU due to our fine-grained parallel formulation. △ Less

Submitted 16 September, 2020; originally announced September 2020.

Comments: 2019 IEEE High Performance Extreme Computing Conference (HPEC)

arXiv:1911.06895 [pdf, other]

doi 10.1109/IPDPSW.2019.00047

Delta-step** SSSP: from Vertices and Edges to GraphBLAS Implementations

Authors: Upasana Sridhar, Mark Blanco, Rahul Mayuranath, Daniele G. Spampinato, Tze Meng Low, Scott McMillan

Abstract: GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two… ▽ More GraphBLAS is an interface for implementing graph algorithms. Algorithms implemented using the GraphBLAS interface are cast in terms of linear algebra-like operations. However, many graph algorithms are canonically described in terms of operations on vertices and/or edges. Despite the known duality between these two representations, the differences in the way algorithms are described using the two approaches can pose considerable difficulties in the adoption of the GraphBLAS as standard interface for development. This paper investigates a systematic approach for translating a graph algorithm described in the canonical vertex and edge representation into an implementation that leverages the GraphBLAS interface. We present a two-step approach to this problem. First, we express common vertex- and edge-centric design patterns using a linear algebraic language. Second, we map this intermediate representation to the GraphBLAS interface. We illustrate our approach by translating the delta-step** single source shortest path algorithm from its canonical description to a GraphBLAS implementation, and highlight lessons learned when implementing using GraphBLAS. △ Less

Submitted 16 September, 2020; v1 submitted 15 November, 2019; originally announced November 2019.

Comments: 10 pages, 4 figures, IPDPSW GRAPL 2019 Workshop

Journal ref: IEEE International Parallel and Distributed Processing Symposium Workshops, 2019, pp 241 to 250

arXiv:1904.10119 [pdf, other]

A Flexible Framework for Parallel Multi-Dimensional DFTs

Authors: Doru Thom Popovici, Martin D. Schatz, Franz Franchetti, Tze Meng Low

Abstract: Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various da… ▽ More Multi-dimensional discrete Fourier transforms (DFT) are typically decomposed into multiple 1D transforms. Hence, parallel implementations of any multi-dimensional DFT focus on parallelizing within or across the 1D DFT. Existing DFT packages exploit the inherent parallelism across the 1D DFTs and offer rigid frameworks, that cannot be extended to incorporate both forms of parallelism and various data layouts to enable some of the parallelism. However, in the era of exascale, where systems have thousand of nodes and intricate network topologies, flexibility and parallel efficiency are key aspects all multi-dimensional DFT frameworks need to have in order to map and scale the computation appropriately. In this work, we present a flexible framework, built on the Redistribution Operations and Tensor Expressions (ROTE) framework, that facilitates the development of a family of parallel multi-dimensional DFT algorithms by 1) unifying the two parallelization schemes within a single framework, 2) exploiting the two different parallelization schemes to different degrees and 3) using different data layouts to distribute the data across the compute nodes. We demonstrate the need of a versatile framework and thus a need for a family of parallel multi-dimensional DFT algorithms on the K-Computer, where we show almost linear strong scaling results for problem sizes of 1024^3 on 32k compute nodes. △ Less

Submitted 22 December, 2019; v1 submitted 22 April, 2019; originally announced April 2019.

arXiv:1903.01042 [pdf, other]

CodeNet: Training Large Scale Neural Networks in Presence of Soft-Errors

Authors: Sanghamitra Dutta, Ziqian Bai, Tze Meng Low, Pulkit Grover

Abstract: This work proposes the first strategy to make distributed training of neural networks resilient to computing errors, a problem that has remained unsolved despite being first posed in 1956 by von Neumann. He also speculated that the efficiency and reliability of the human brain is obtained by allowing for low power but error-prone components with redundancy for error-resilience. It is surprising th… ▽ More This work proposes the first strategy to make distributed training of neural networks resilient to computing errors, a problem that has remained unsolved despite being first posed in 1956 by von Neumann. He also speculated that the efficiency and reliability of the human brain is obtained by allowing for low power but error-prone components with redundancy for error-resilience. It is surprising that this problem remains open, even as massive artificial neural networks are being trained on increasingly low-cost and unreliable processing units. Our coding-theory-inspired strategy, "CodeNet," solves this problem by addressing three challenges in the science of reliable computing: (i) Providing the first strategy for error-resilient neural network training by encoding each layer separately; (ii) Kee** the overheads of coding (encoding/error-detection/decoding) low by obviating the need to re-encode the updated parameter matrices after each iteration from scratch. (iii) Providing a completely decentralized implementation with no central node (which is a single point of failure), allowing all primary computational steps to be error-prone. We theoretically demonstrate that CodeNet has higher error tolerance than replication, which we leverage to speed up computation time. Simultaneously, CodeNet requires lower redundancy than replication, and equal computational and communication costs in scaling sense. We first demonstrate the benefits of CodeNet in reducing expected computation time over replication when accounting for checkpointing. Our experiments show that CodeNet achieves the best accuracy-runtime tradeoff compared to both replication and uncoded strategies. CodeNet is a significant step towards biologically plausible neural network training, that could hold the key to orders of magnitude efficiency improvements. △ Less

Submitted 3 March, 2019; originally announced March 2019.

Comments: Currently under review

arXiv:1811.10751 [pdf, other]

doi 10.1109/ISIT.2018.8437852

A Unified Coded Deep Neural Network Training Strategy Based on Generalized PolyDot Codes for Matrix Multiplication

Authors: Sanghamitra Dutta, Ziqian Bai, Haewon Jeong, Tze Meng Low, Pulkit Grover

Abstract: This paper has two contributions. First, we propose a novel coded matrix multiplication technique called Generalized PolyDot codes that advances on existing methods for coded matrix multiplication under storage and communication constraints. This technique uses "garbage alignment," i.e., aligning computations in coded computing that are not a part of the desired output. Generalized PolyDot codes b… ▽ More This paper has two contributions. First, we propose a novel coded matrix multiplication technique called Generalized PolyDot codes that advances on existing methods for coded matrix multiplication under storage and communication constraints. This technique uses "garbage alignment," i.e., aligning computations in coded computing that are not a part of the desired output. Generalized PolyDot codes bridge between Polynomial codes and MatDot codes, trading off between recovery threshold and communication costs. Second, we demonstrate that Generalized PolyDot can be used for training large Deep Neural Networks (DNNs) on unreliable nodes prone to soft-errors. This requires us to address three additional challenges: (i) prohibitively large overhead of coding the weight matrices in each layer of the DNN at each iteration; (ii) nonlinear operations during training, which are incompatible with linear coding; and (iii) not assuming presence of an error-free master node, requiring us to architect a fully decentralized implementation without any "single point of failure." We allow all primary DNN training steps, namely, matrix multiplication, nonlinear activation, Hadamard product, and update steps as well as the encoding/decoding to be error-prone. We consider the case of mini-batch size $B=1$, as well as $B>1$, leveraging coded matrix-vector products, and matrix-matrix products respectively. The problem of DNN training under soft-errors also motivates an interesting, probabilistic error model under which a real number $(P,Q)$ MDS code is shown to correct $P-Q-1$ errors with probability $1$ as compared to $\lfloor \frac{P-Q}{2} \rfloor$ for the more conventional, adversarial error model. We also demonstrate that our proposed strategy can provide unbounded gains in error tolerance over a competing replication strategy and a preliminary MDS-code-based strategy for both these error models. △ Less

Submitted 26 November, 2018; originally announced November 2018.

Comments: Presented in part at the IEEE International Symposium on Information Theory 2018 (Submission Date: Jan 12 2018); Currently under review at the IEEE Transactions on Information Theory

arXiv:1809.10170 [pdf, other]

High Performance Zero-Memory Overhead Direct Convolutions

Authors: Jiyuan Zhang, Franz Franchetti, Tze Meng Low

Abstract: The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve performance. The problems with such an approach are two-fold. First, these routines incur additional memory overhead which reduces the overall size of the network… ▽ More The computation of convolution layers in deep neural networks typically rely on high performance routines that trade space for time by using additional memory (either for packing purposes or required as part of the algorithm) to improve performance. The problems with such an approach are two-fold. First, these routines incur additional memory overhead which reduces the overall size of the network that can fit on embedded devices with limited memory capacity. Second, these high performance routines were not optimized for performing convolution, which means that the performance obtained is usually less than conventionally expected. In this paper, we demonstrate that direct convolution, when implemented correctly, eliminates all memory overhead, and yields performance that is between 10% to 400% times better than existing high performance implementations of convolution layers on conventional and embedded CPU architectures. We also show that a high performance direct convolution exhibits better scaling performance, i.e. suffers less performance drop, when increasing the number of threads. △ Less

Submitted 19 September, 2018; originally announced September 2018.

Comments: the 35th International Conference on Machine Learning(ICML 2018), camera ready

arXiv:1805.09891 [pdf, other]

Coded FFT and Its Communication Overhead

Authors: Haewon Jeong, Tze Meng Low, Pulkit Grover

Abstract: We propose a coded computing strategy and examine communication costs of coded computing algorithms to make distributed Fast Fourier Transform (FFT) resilient to errors during the computation. We apply maximum distance separable (MDS) codes to a widely used "Transpose" algorithm for parallel FFT. In the uncoded distributed FFT algorithm, the most expensive step is a single "all-to-all" communicati… ▽ More We propose a coded computing strategy and examine communication costs of coded computing algorithms to make distributed Fast Fourier Transform (FFT) resilient to errors during the computation. We apply maximum distance separable (MDS) codes to a widely used "Transpose" algorithm for parallel FFT. In the uncoded distributed FFT algorithm, the most expensive step is a single "all-to-all" communication. We compare this with communication overhead of coding in our coded FFT algorithm. We show that by using a systematic MDS code, the communication overhead of coding is negligible in comparison with the communication costs inherent in an uncoded FFT implementation if the number of parity nodes is at most $o(\log K)$, where $K$ is the number of systematic nodes. △ Less

Submitted 24 May, 2018; originally announced May 2018.

arXiv:1611.08035 [pdf, other]

Automating the Last-Mile for High Performance Dense Linear Algebra

Authors: Richard Michael Veras, Tze Meng Low, Tyler Michael Smith, Robert van de Geijn, Franz Franchetti

Abstract: High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving… ▽ More High performance dense linear algebra (DLA) libraries often rely on a general matrix multiply (Gemm) kernel that is implemented using assembly or with vector intrinsics. In particular, the real-valued Gemm kernels provide the overwhelming fraction of performance for the complex-valued Gemm kernels, along with the entire level-3 BLAS and many of the real and complex LAPACK routines. Thus,achieving high performance for the Gemm kernel translates into a high performance linear algebra stack above this kernel. However, it is a monumental task for a domain expert to manually implement the kernel for every library-supported architecture. This leads to the belief that the craft of a Gemm kernel is more dark art than science. It is this premise that drives the popularity of autotuning with code generation in the domain of DLA. This paper, instead, focuses on an analytical approach to code generation of the Gemm kernel for different architecture, in order to shed light on the details or voo-doo required for implementing a high performance Gemm kernel. We distill the implementation of the kernel into an even smaller kernel, an outer-product, and analytically determine how available SIMD instructions can be used to compute the outer-product efficiently. We codify this approach into a system to automatically generate a high performance SIMD implementation of the Gemm kernel. Experimental results demonstrate that our approach yields generated kernels with performance that is competitive with kernels implemented manually or using empirical search. △ Less

Submitted 28 April, 2017; v1 submitted 23 November, 2016; originally announced November 2016.

arXiv:1301.7744 [pdf, ps, other]

doi 10.1137/130907215

Exploiting Symmetry in Tensors for High Performance: Multiplication with Symmetric Tensors

Authors: Martin D. Schatz, Tze Meng Low, Robert A. van de Geijn, Tamara G. Kolda

Abstract: Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric te… ▽ More Symmetric tensor operations arise in a wide variety of computations. However, the benefits of exploiting symmetry in order to reduce storage and computation is in conflict with a desire to simplify memory access patterns. In this paper, we propose a blocked data structure (Blocked Compact Symmetric Storage) wherein we consider the tensor by blocks and store only the unique blocks of a symmetric tensor. We propose an algorithm-by-blocks, already shown of benefit for matrix computations, that exploits this storage format by utilizing a series of temporary tensors to avoid redundant computation. Further, partial symmetry within temporaries is exploited to further avoid redundant storage and redundant computation. A detailed analysis shows that, relative to storing and computing with tensors without taking advantage of symmetry and partial symmetry, storage requirements are reduced by a factor of $ O\left( m! \right)$ and computational requirements by a factor of $O\left( (m+1)!/2^m \right)$, where $ m $ is the order of the tensor. However, as the analysis shows, care must be taken in choosing the correct block size to ensure these storage and computational benefits are achieved (particularly for low-order tensors). An implementation demonstrates that storage is greatly reduced and the complexity introduced by storing and computing with tensors by blocks is manageable. Preliminary results demonstrate that computational time is also reduced. The paper concludes with a discussion of how insights in this paper point to opportunities for generalizing recent advances in the domain of linear algebra libraries to the field of multi-linear computation. △ Less

Submitted 9 April, 2014; v1 submitted 31 January, 2013; originally announced January 2013.

MSC Class: 15-02 (Primary)

Journal ref: SIAM Journal on Scientific Computing, Vol. 36, No. 5, pp. C453-C479, September 2014

Showing 1–12 of 12 results for author: Low, T M