Search | arXiv e-print repository

An Empirical Investigation of Matrix Factorization Methods for Pre-trained Transformers

Authors: Ashim Gupta, Sina Mahdipour Saravani, P. Sadayappan, Vivek Srikumar

Abstract: The increasing size of transformer-based models in NLP makes the question of compressing them important. In this work, we present a comprehensive analysis of factorization based model compression techniques. Specifically, we focus on comparing straightforward low-rank factorization against the recently introduced Monarch factorization, which exhibits impressive performance preservation on the GLUE… ▽ More The increasing size of transformer-based models in NLP makes the question of compressing them important. In this work, we present a comprehensive analysis of factorization based model compression techniques. Specifically, we focus on comparing straightforward low-rank factorization against the recently introduced Monarch factorization, which exhibits impressive performance preservation on the GLUE benchmark. To mitigate stability issues associated with low-rank factorization of the matrices in pre-trained transformers, we introduce a staged factorization approach wherein layers are factorized one by one instead of being factorized simultaneously. Through this strategy we significantly enhance the stability and reliability of the compression process. Further, we introduce a simple block-wise low-rank factorization method, which has a close relationship to Monarch factorization. Our experiments lead to the surprising conclusion that straightforward low-rank factorization consistently outperforms Monarch factorization across both different compression ratios and six different text classification tasks. △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.11209 [pdf, other]

doi 10.1145/3624062.3625122

What Operations can be Performed Directly on Compressed Arrays, and with What Error?

Authors: Tripti Agarwal, Harvey Dam, Dorra Ben Khalifa, Matthieu Martel, P. Sadayappan, Ganesh Gopalakrishnan

Abstract: In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental oper… ▽ More In response to the rapidly escalating costs of computing with large matrices and tensors caused by data movement, several lossy compression methods have been developed to significantly reduce data volumes. Unfortunately, all these methods require the data to be decompressed before further computations are done. In this work, we develop a lossy compressor that allows a dozen fairly fundamental operations directly on compressed data while offering good compression ratios and modest errors. We implement a new compressor PyBlaz based on the familiar GPU-powered PyTorch framework, and evaluate it on three non-trivial applications, choosing different number systems for internal representation. Our results demonstrate that the compressed-domain operations achieve good scalability with problem sizes while incurring errors well within acceptable limits. To our best knowledge, this is the first such lossy compressor that supports compressed-domain operations while achieving acceptable performance as well as error. △ Less

Submitted 17 June, 2024; originally announced June 2024.

Comments: An extended but earlier version of paper in https://dl.acm.org/doi/10.1145/3624062.3625122 published at the DRBSD Workshop in 2023

arXiv:2305.19400 [pdf, other]

Automating GPU Scalability for Complex Scientific Models: Phonon Boltzman Transport Equation

Authors: Eric Heisler, Siddharth Saurav, Aadesh Deshmukh, Sandip Mazumder, Ponnuswamy Sadayappan, Hari Sundar

Abstract: Heterogeneous computing environments combining CPU and GPU resources provide a great boost to large-scale scientific computing applications. Code generation utilities that partition the work into CPU and GPU tasks while considering data movement costs allow researchers to more quickly and easily develop high-performance solutions, and make these resources accessible to a larger user base. We pre… ▽ More Heterogeneous computing environments combining CPU and GPU resources provide a great boost to large-scale scientific computing applications. Code generation utilities that partition the work into CPU and GPU tasks while considering data movement costs allow researchers to more quickly and easily develop high-performance solutions, and make these resources accessible to a larger user base. We present developments for a domain-specific language (DSL) and code generation framework for solving partial differential equations (PDEs). These enhancements facilitate GPU-accelerated solution of the Boltzmann transport equation (BTE) for phonons, which is the governing equation for simulating thermal transport in semiconductor materials at sub-micron scales. The solution of the BTE involves thousands of coupled PDEs as well as complicated boundary conditions and nonlinear processing at each time step. These developments enable the DSL to generate configurable hybrid GPU/CPU code that couples accelerated kernels with user-defined code. We observed performance improvements of around 18X compared to a CPU-only version produced by this same DSL with minimal additional programming effort. △ Less

Submitted 30 May, 2023; originally announced May 2023.

arXiv:2211.03715 [pdf, other]

doi 10.1145/3572848.3577478

TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

Authors: Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao

Abstract: Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference cod… ▽ More Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference code on GPUs. Specifically, we propose an ADMM-based training algorithm that can achieve highly accurate Tucker-format models. We also develop a high-performance kernel for Tucker-format convolutions and analytical performance models to guide the selection of execution parameters. We further propose a co-design framework to determine the proper Tucker ranks driven by practical inference time (rather than FLOPs). Our evaluation on five modern CNNs with A100 demonstrates that our compressed models with our optimized code achieve up to 2.21X speedup over cuDNN, 1.12X speedup over TVM, and 3.27X over the original models using cuDNN with at most 0.05% accuracy loss. △ Less

Submitted 4 January, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

Comments: 14 pages, 9 figures, 3 tables, accepted by PPoPP '23

arXiv:2105.13480 [pdf]

doi 10.1145/3409964.3461828

Efficient distributed algorithms for Convolutional Neural Networks

Authors: Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, P Sadayappan

Abstract: Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-m… ▽ More Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-multiplication combined with neighborhood stencil computations. We develop communication-efficient distributed-memory algorithms for CNNs that are analogous to the 2D/2.5D/3D algorithms for matrix-matrix multiplication. △ Less

Submitted 30 May, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

Comments: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '21), July 6--8, 2021, Virtual Event, USA

Journal ref: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '21), July 6--8, 2021, Virtual Event, USA

arXiv:2101.09808 [pdf]

doi 10.1145/3445814.3446759

Analytical Characterization and Design Space Exploration for Optimization of CNNs

Authors: Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, P. Sadayappan

Abstract: Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is… ▽ More Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is explosively large. This paper develops an analytical modeling approach for finding the best loop-level optimization configuration for CNNs on multi-core CPUs. Experimental evaluation shows that this approach achieves comparable or better performance than state-of-the-art libraries and auto-tuning based optimizers for CNNs. △ Less

Submitted 5 March, 2021; v1 submitted 24 January, 2021; originally announced January 2021.

Comments: In proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21), April 19-23, 2021, Virtual, USA

Journal ref: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

arXiv:1911.06664 [pdf, other]

Automated Derivation of Parametric Data Movement Lower Bounds for Affine Programs

Authors: Auguste Olivry, Julien Langou, Louis-Noël Pouchet, P. Sadayappan, Fabrice Rastello

Abstract: For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how effi… ▽ More For most relevant computation, the energy and time needed for data movement dominates that for performing arithmetic operations on all computing systems today. Hence it is of critical importance to understand the minimal total data movement achievable during the execution of an algorithm. The achieved total data movement for different schedules of an algorithm can vary widely depending on how efficiently the cache is used, e.g., untiled versus effectively tiled matrix-matrix multiplication. A significant current challenge is that no existing tool is able to meaningfully quantify the potential reduction to the data movement of a computation that can be achieved by more effective use of the cache through operation rescheduling. Asymptotic parametric expressions of data movement lower bounds have previously been manually derived for a limited number of algorithms, often without scaling constants. In this paper, we present the first compile-time approach for deriving non-asymptotic parametric expressions of data movement lower bounds for arbitrary affine computations. The approach has been implemented in a fully automatic tool (IOLB) that can generate these lower bounds for input affine programs. IOLB's use is demonstrated by exercising it on all the benchmarks of the PolyBench suite. The advantages of IOLB are many: (1) IOLB enables us to derive bounds for few dozens of algorithms for which these lower bounds have never been derived. This reflects an increase of productivity by automation. (2) Anyone is able to obtain these lower bounds through IOLB, no expertise is required. (3) For some of the most well-studied algorithms, the lower bounds obtained by \tool are higher than any previously reported manually derived lower bounds. △ Less

Submitted 15 November, 2019; originally announced November 2019.

arXiv:1904.07935 [pdf, other]

PL-NMF: Parallel Locality-Optimized Non-negative Matrix Factorization

Authors: Gordon E. Moon, Aravind Sukumaran-Rajam, Srinivasan Parthasarathy, P. Sadayappan

Abstract: Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including topic modeling, recommender systems and bioinformatics. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed in the past. However, existing parallel NMF algorithms have not ad… ▽ More Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including topic modeling, recommender systems and bioinformatics. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed in the past. However, existing parallel NMF algorithms have not addressed data locality optimizations, which are critical for high performance since data movement costs greatly exceed the cost of arithmetic/logic operations on current computer systems. In this paper, we devise a parallel NMF algorithm based on the HALS (Hierarchical Alternating Least Squares) scheme that incorporates algorithmic transformations to enhance data locality. Efficient realizations of the algorithm on multi-core CPUs and GPUs are developed, demonstrating significant performance improvement over existing state-of-the-art parallel NMF algorithms. △ Less

Submitted 16 April, 2019; originally announced April 2019.

Comments: 11 pages, 5 tables, 9 figures

arXiv:1904.03329 [pdf, other]

Load-Balanced Sparse MTTKRP on GPUs

Authors: Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, P. Sadayappan

Abstract: Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs… ▽ More Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables high-performance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations. △ Less

Submitted 5 April, 2019; originally announced April 2019.

arXiv:1811.00839 [pdf, other]

ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation

Authors: Jiankai Sun, Bortik Bandyopadhyay, Armin Bashizade, Jiongqian Liang, P. Sadayappan, Srinivasan Parthasarathy

Abstract: Directed graphs have been widely used in Community Question Answering services (CQAs) to model asymmetric relationships among different types of nodes in CQA graphs, e.g., question, answer, user. Asymmetric transitivity is an essential property of directed graphs, since it can play an important role in downstream graph inference and analysis. Question difficulty and user expertise follow the chara… ▽ More Directed graphs have been widely used in Community Question Answering services (CQAs) to model asymmetric relationships among different types of nodes in CQA graphs, e.g., question, answer, user. Asymmetric transitivity is an essential property of directed graphs, since it can play an important role in downstream graph inference and analysis. Question difficulty and user expertise follow the characteristic of asymmetric transitivity. Maintaining such properties, while reducing the graph to a lower dimensional vector embedding space, has been the focus of much recent research. In this paper, we tackle the challenge of directed graph embedding with asymmetric transitivity preservation and then leverage the proposed embedding method to solve a fundamental task in CQAs: how to appropriately route and assign newly posted questions to users with the suitable expertise and interest in CQAs. The technique incorporates graph hierarchy and reachability information naturally by relying on a non-linear transformation that operates on the core reachability and implicit hierarchy within such graphs. Subsequently, the methodology levers a factorization-based approach to generate two embedding vectors for each node within the graph, to capture the asymmetric transitivity. Extensive experiments show that our framework consistently and significantly outperforms the state-of-the-art baselines on two diverse real-world tasks: link prediction, and question difficulty estimation and expert finding in online forums like Stack Exchange. Particularly, our framework can support inductive embedding learning for newly posted questions (unseen nodes during training), and therefore can properly route and assign these kinds of questions to experts in CQAs. △ Less

Submitted 6 November, 2018; v1 submitted 2 November, 2018; originally announced November 2018.

Comments: has been accepted to the Thirty-Third AAAI Conference on Artificial Intelligence (AAAI 2019), acceptance rate: 1150/7095 = 16.2%

arXiv:1411.2286 [pdf, ps, other]

doi 10.1145/2676726.2677010

On Characterizing the Data Access Complexity of Programs

Authors: Venmugil Elango, Fabrice Rastello, Louis-Noel Pouchet, J. Ramanujam, P. Sadayappan

Abstract: Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of develo** lower bounds for data access complexity has be… ▽ More Technology trends will cause data movement to account for the majority of energy expenditure and execution time on emerging computers. Therefore, computational complexity will no longer be a sufficient metric for comparing algorithms, and a fundamental characterization of data access complexity will be increasingly important. The problem of develo** lower bounds for data access complexity has been modeled using the formalism of Hong & Kung's red/blue pebble game for computational directed acyclic graphs (CDAGs). However, previously developed approaches to lower bounds analysis for the red/blue pebble game are very limited in effectiveness when applied to CDAGs of real programs, with computations comprised of multiple sub-computations with differing DAG structure. We address this problem by develo** an approach for effectively composing lower bounds based on graph decomposition. We also develop a static analysis algorithm to derive the asymptotic data-access lower bounds of programs, as a function of the problem size and cache size. △ Less

Submitted 9 November, 2014; originally announced November 2014.

ACM Class: F.2; D.2.8

arXiv:1404.4767 [pdf, other]

On Characterizing the Data Movement Complexity of Computational DAGs for Parallel Execution

Authors: Venmugil Elango, Fabrice Rastello, Louis-Noël Pouchet, J. Ramanujam, P. Sadayappan

Abstract: Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerabl… ▽ More Technology trends are making the cost of data movement increasingly dominant, both in terms of energy and time, over the cost of performing arithmetic operations in computer systems. The fundamental ratio of aggregate data movement bandwidth to the total computational power (also referred to the machine balance parameter) in parallel computer systems is decreasing. It is there- fore of considerable importance to characterize the inherent data movement requirements of parallel algorithms, so that the minimal architectural balance parameters required to support it on future systems can be well understood. In this paper, we develop an extension of the well-known red-blue pebble game to develop lower bounds on the data movement complexity for the parallel execution of computational directed acyclic graphs (CDAGs) on parallel systems. We model multi-node multi-core parallel systems, with the total physical memory distributed across the nodes (that are connected through some interconnection network) and in a multi-level shared cache hierarchy for processors within a node. We also develop new techniques for lower bound characterization of non-homogeneous CDAGs. We demonstrate the use of the methodology by analyzing the CDAGs of several numerical algorithms, to develop lower bounds on data movement for their parallel execution. △ Less

Submitted 18 April, 2014; originally announced April 2014.

Report number: RR-8522

arXiv:1401.5024 [pdf, other]

Beyond Reuse Distance Analysis: Dynamic Analysis for Characterization of Data Locality Potential

Authors: Naznin Fauzia, Venmugil Elango, Mahesh Ravishankar, J. Ramanujam, Fabrice Rastello, Atanas Rountev, Louis-Noël Pouchet, P. Sadayappan

Abstract: Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order… ▽ More Emerging computer architectures will feature drastically decreased flops/byte (ratio of peak processing rate to memory bandwidth) as highlighted by recent studies on Exascale architectural trends. Further, flops are getting cheaper while the energy cost of data movement is increasingly dominant. The understanding and characterization of data locality properties of computations is critical in order to guide efforts to enhance data locality. Reuse distance analysis of memory address traces is a valuable tool to perform data locality characterization of programs. A single reuse distance analysis can be used to estimate the number of cache misses in a fully associative LRU cache of any size, thereby providing estimates on the minimum bandwidth requirements at different levels of the memory hierarchy to avoid being bandwidth bound. However, such an analysis only holds for the particular execution order that produced the trace. It cannot estimate potential improvement in data locality through dependence preserving transformations that change the execution schedule of the operations in the computation. In this article, we develop a novel dynamic analysis approach to characterize the inherent locality properties of a computation and thereby assess the potential for data locality enhancement via dependence preserving transformations. The execution trace of a code is analyzed to extract a computational directed acyclic graph (CDAG) of the data dependences. The CDAG is then partitioned into convex subsets, and the convex partitioning is used to reorder the operations in the execution trace to enhance data locality. The approach enables us to go beyond reuse distance analysis of a single specific order of execution of the operations of a computation in characterization of its data locality properties. It can serve a valuable role in identifying promising code regions for manual transformation, as well as assessing the effectiveness of compiler transformations for data locality enhancement. We demonstrate the effectiveness of the approach using a number of benchmarks, including case studies where the potential shown by the analysis is exploited to achieve lower data movement costs and better performance. △ Less

Submitted 21 December, 2013; originally announced January 2014.

Comments: Transaction on Architecture and Code Optimization (2014)

arXiv:1103.2405 [pdf]

Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining

Authors: Xintian Yang, Srinivasan Parthasarathy, Ponnuswamy Sadayappan

Abstract: Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable, approach to data representation for computing this kernel, particularly targeting sparse matrices representing power-law graphs. Using real data, we show how our… ▽ More Scaling up the sparse matrix-vector multiplication kernel on modern Graphics Processing Units (GPU) has been at the heart of numerous studies in both academia and industry. In this article we present a novel non-parametric, self-tunable, approach to data representation for computing this kernel, particularly targeting sparse matrices representing power-law graphs. Using real data, we show how our representation scheme, coupled with a novel tiling algorithm, can yield significant benefits over the current state of the art GPU efforts on a number of core data mining algorithms such as PageRank, HITS and Random Walk with Restart. △ Less

Submitted 11 March, 2011; originally announced March 2011.

Comments: VLDB2011

Journal ref: Proceedings of the VLDB Endowment (PVLDB), Vol. 4, No. 4, pp. 231-242 (2011)

Showing 1–14 of 14 results for author: Sadayappan, P