-
SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs
Authors:
Mohammad Mozaffari,
Amir Yazdanbakhsh,
Zhao Zhang,
Maryam Mehri Dehnavi
Abstract:
We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretr…
▽ More
We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.14\times$ and $1.34\times$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.77\times$ and $0.51\times$ for training and inference respectively.
△ Less
Submitted 14 June, 2024; v1 submitted 25 May, 2024;
originally announced May 2024.
-
MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates
Authors:
Mohammad Mozaffari,
Sikan Li,
Zhao Zhang,
Maryam Mehri Dehnavi
Abstract:
This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibi…
▽ More
This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibit poor scalability and performance in transformer models, e.g. large language models (LLMs), because the batch sizes in these models scale by the attention mechanism sequence length, leading to large model size and batch sizes. MKOR's complexity is quadratic with respect to the model size, alleviating the computation bottlenecks in second-order methods. Because of their high computation complexity, state-of-the-art implementations of second-order methods can only afford to update the second order information infrequently, and thus do not fully exploit the promise of better convergence from these updates. By reducing the communication complexity of the second-order updates as well as achieving a linear communication complexity, MKOR increases the frequency of second order updates. We also propose a hybrid version of MKOR (called MKOR-H) that mid-training falls backs to a first order optimizer if the second order updates no longer accelerate convergence. Our experiments show that MKOR outperforms state -of-the-art first order methods, e.g. the LAMB optimizer, and best implementations of second-order methods, i.e. KAISA/KFAC, up to 2.57x and 1.85x respectively on BERT-Large-Uncased on 64 GPUs.
△ Less
Submitted 30 January, 2024; v1 submitted 2 June, 2023;
originally announced June 2023.
-
A Framework for Fine-Grained Synchronization of Dependent GPU Kernels
Authors:
Abhinav Jangda,
Saeed Maleki,
Maryam Mehri Dehnavi,
Madan Musuvathi,
Olli Saarikivi
Abstract:
Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on a…
▽ More
Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in one or more waves. However, the number of tiles is not always a multiple of the number of execution units. Thus, tiles executed in the final wave can under-utilize the GPU.
To address this issue, we present cuSync, a framework for synchronizing dependent kernels using a user-defined fine-grained synchronization policy to improve the GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing independent tiles of dependent kernels concurrently. We also present a compiler to generate diverse fine-grained synchronization policies based on dependencies between kernels. Our experiments found that synchronizing CUDA kernels using cuSync reduces the inference times of four popular ML models: MegatronLM GPT-3 by up to 15%, LLaMA by up to 14%, ResNet-38 by up to 22%, and VGG-19 by up to 16% over several batch sizes.
△ Less
Submitted 14 February, 2024; v1 submitted 22 May, 2023;
originally announced May 2023.
-
Vectorizing Sparse Matrix Codes with Dependency Driven Trace Analysis
Authors:
Zachary Cetinic,
Kazem Cheshmi,
Maryam Mehri Dehnavi
Abstract:
Sparse computations frequently appear in scientific simulations and the performance of these simulations rely heavily on the optimization of the sparse codes. The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize regular regions of computations in the sparse code. They a…
▽ More
Sparse computations frequently appear in scientific simulations and the performance of these simulations rely heavily on the optimization of the sparse codes. The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize regular regions of computations in the sparse code. They also reorganize data and computations, at a cost, to increase the number of regular regions. In this work, we propose a novel polyhedral model, called the partially strided codelets (PSC), that enables the vectorization of computation regions with irregular data access patterns. PSCs also improve data locality in sparse computation. Our DDF inspector-executor framework efficiently mines the memory accesses in the sparse computation, using an access function differentiation approach, to find PSC codelets. It generates vectorized code for the sparse matrix multiplication kernel (SpMV), a kernel with parallel outer loops, and for kernels with carried dependence, specifically the sparse triangular solver (SpTRSV). We demonstrate the performance of the DDF-generated code on a set of 60 large and small matrices (0.05-130M nonzeros). DDF outperforms the highly specialized library MKL with an average speedup of 1.93 and 4.5X for SpMV and SpTRSV, respectively. For the same matrices, DDF outperforms the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel by up to 11X and the work by Augustine et. al [2] for the SpMV kernel by up to 12X.
△ Less
Submitted 8 December, 2021; v1 submitted 23 November, 2021;
originally announced November 2021.
-
Composing Loop-carried Dependence with Other Loops
Authors:
Kazem Cheshmi,
Michelle Mills Strout,
Maryam Mehri Dehnavi
Abstract:
Sparse fusion is a compile-time loop transformation and runtime scheduling implemented as a domain-specific code generator. Sparse fusion generates efficient parallel code for the combination of two sparse matrix kernels where at least one of the kernels has loop-carried dependencies. Available implementations optimize individual sparse kernels. When optimized separately, the irregular dependence…
▽ More
Sparse fusion is a compile-time loop transformation and runtime scheduling implemented as a domain-specific code generator. Sparse fusion generates efficient parallel code for the combination of two sparse matrix kernels where at least one of the kernels has loop-carried dependencies. Available implementations optimize individual sparse kernels. When optimized separately, the irregular dependence patterns of sparse kernels create synchronization overheads and load imbalance, and their irregular memory access patterns result in inefficient cache usage, which reduces parallel efficiency. Sparse fusion uses a novel inspection strategy with code transformations to generate parallel fused code for sparse kernel combinations that is optimized for data locality and load balance. Code generated by Sparse fusion outperforms the existing implementations ParSy and MKL on average 1.6X and 5.1X respectively and outperforms the LBC and DAGP coarsening strategies applied to a fused data dependence graph on average 5.1X and 7.2X respectively for various kernel combinations.
△ Less
Submitted 23 November, 2021;
originally announced November 2021.
-
L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method
Authors:
Bugra Can,
Saeed Soori,
Maryam Mehri Dehnavi,
Mert Gürbüzbalaban
Abstract:
This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration the master node and workers…
▽ More
This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration the master node and workers communicate vectors of size $O(d)$, where $d$ is the dimension of the decision variable, and the amount of memory required on each node is $O(md)$, where $m$ is an adjustable parameter. To our knowledge, this is the first distributed quasi-Newton method with provable global linear convergence guarantees in the asynchronous setting where delays between nodes are present. Numerical experiments are provided to illustrate the theory and the practical performance of our method.
△ Less
Submitted 4 September, 2021; v1 submitted 20 August, 2021;
originally announced August 2021.
-
TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion
Authors:
Saeed Soori,
Bugra Can,
Baourun Mu,
Mert Gürbüzbalaban,
Maryam Mehri Dehnavi
Abstract:
This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with…
▽ More
This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with approximation. However, the approximations do not reduce the overall time significantly and lead to less accurate parameter updates and loss of curvature information. TENGraD improves the time efficiency of NGD by computing Fisher block inverses with a computationally efficient covariance factorization and reuse method. It computes the inverse of each block exactly using the Woodbury matrix identity to preserve curvature information while admitting (linear) fast convergence rates. Our experiments on image classification tasks for state-of-the-art deep neural architecture on CIFAR-10, CIFAR-100, and Fashion-MNIST show that TENGraD significantly outperforms state-of-the-art NGD methods and often stochastic gradient descent in wall-clock time.
△ Less
Submitted 3 March, 2022; v1 submitted 7 June, 2021;
originally announced June 2021.
-
ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning
Authors:
Saeed Soori,
Bugra Can,
Mert Gurbuzbalaba,
Maryam Mehri Dehnavi
Abstract:
ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current bulk-processing cloud engines do not provide a rob…
▽ More
ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current bulk-processing cloud engines do not provide a robust support for asynchrony and history. With introducing three main modules and bookkee** system-specific and application parameters, ASYNC provides practitioners with a framework to implement asynchronous machine learning methods. To demonstrate ease-of-implementation in ASYNC, the synchronous and asynchronous variants of two well-known optimization methods, stochastic gradient descent and SAGA, are demonstrated in ASYNC.
△ Less
Submitted 20 February, 2020; v1 submitted 19 July, 2019;
originally announced July 2019.
-
MatRox: Modular approach for improving data locality in Hierarchical (Mat)rix App(Rox)imation
Authors:
Bangtian Liu,
Kazem Cheshmi,
Saeed Soori,
Michelle Mills Strout,
Maryam Mehri Dehnavi
Abstract:
Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase.…
▽ More
Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase. Existing implementations of HMatrix evaluations do not preserve locality and often lead to unbalanced parallel execution with high synchronization. Also, current solutions require the compression phase to re-execute if the kernel method or the required accuracy change. In this work, we describe MatRox, a framework that uses novel structure analysis strategies, blocking and coarsen, with code specialization and a storage format to improve locality and create load-balanced parallel tasks for HMatrix-matrix multiplications. Modularization of the matrix compression phase enables the reuse of computations when there are changes to the input accuracy and the kernel function. The MatRox-generated code for matrix-matrix multiplication is 2.98x, 1.60x, and 5.98x faster than library implementations available in GOFMM, SMASH, and STRUMPACK respectively. Additionally, the ability to reuse portions of the compression computation for changes to the accuracy leads to up to 2.64x improvement with MatRox over five changes to accuracy using GOFMM.
△ Less
Submitted 30 November, 2019; v1 submitted 17 December, 2018;
originally announced December 2018.
-
Sparse Matrix Code Dependence Analysis Simplification at Compile Time
Authors:
Mahdi Soltan Mohammadi,
Kazem Cheshmi,
Ganesh Gopalakrishnan,
Mary Hall,
Maryam Mehri Dehnavi,
Anand Venkat,
Tomofumi Yuki,
Michelle Mills Strout
Abstract:
Analyzing array-based computations to determine data dependences is useful for many applications including automatic parallelization, race detection, computation and communication overlap, verification, and shape analysis. For sparse matrix codes, array data dependence analysis is made more difficult by the use of index arrays that make it possible to store only the nonzero entries of the matrix (…
▽ More
Analyzing array-based computations to determine data dependences is useful for many applications including automatic parallelization, race detection, computation and communication overlap, verification, and shape analysis. For sparse matrix codes, array data dependence analysis is made more difficult by the use of index arrays that make it possible to store only the nonzero entries of the matrix (e.g., in A[B[i]], B is an index array). Here, dependence analysis is often stymied by such indirect array accesses due to the values of the index array not being available at compile time. Consequently, many dependences cannot be proven unsatisfiable or determined until runtime. Nonetheless, index arrays in sparse matrix codes often have properties such as monotonicity of index array elements that can be exploited to reduce the amount of runtime analysis needed. In this paper, we contribute a formulation of array data dependence analysis that includes encoding index array properties as universally quantified constraints. This makes it possible to leverage existing SMT solvers to determine whether such dependences are unsatisfiable and significantly reduces the number of dependences that require runtime analysis in a set of eight sparse matrix kernels. Another contribution is an algorithm for simplifying the remaining satisfiable data dependences by discovering equalities and/or subset relationships. These simplifications are essential to make a runtime-inspection-based approach feasible.
△ Less
Submitted 27 July, 2018;
originally announced July 2018.
-
Avoiding Communication in Proximal Methods for Convex Optimization Problems
Authors:
Saeed Soori,
Aditya Devarakonda,
James Demmel,
Mert Gurbuzbalaban,
Maryam Mehri Dehnavi
Abstract:
The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The comm…
▽ More
The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The communication costs of FISTA, including bandwidth and latency costs, is closely tied to the mathematical formulation of the algorithm. This work reformulates FISTA to communicate data at every k iterations and reduce data communication when operating on large data sets. We formulate the algorithm for two different optimization methods on the Lasso problem and show that the latency cost is reduced by a factor of k while bandwidth and floating-point operation costs remain the same. The convergence rates and stability properties of the reformulated algorithms are similar to the standard formulations. The performance of communication-avoiding FISTA and Proximal Newton methods is evaluated on 1 to 1024 nodes for multiple benchmarks and demonstrate average speedups of 3-10x with scaling properties that outperform the classical algorithms.
△ Less
Submitted 24 October, 2017;
originally announced October 2017.
-
A Unified Optimization Approach for Sparse Tensor Operations on GPUs
Authors:
Bangtian Liu,
Chengyao Wen,
Anand D. Sarwate,
Maryam Mehri Dehnavi
Abstract:
Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor oper…
▽ More
Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make such implementations challenging. We leverage the fact that sparse tensor operations share similar computation patterns to propose a unified tensor representation called F-COO. Combined with GPU-specific optimizations, F-COO provides highly-optimized implementations of sparse tensor computations on GPUs. The performance of the proposed unified approach is demonstrated for tensor-based kernels such as the Sparse Matricized Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs.
△ Less
Submitted 28 May, 2017;
originally announced May 2017.
-
Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis
Authors:
Kazem Cheshmi,
Shoaib Kamil,
Michelle Mills Strout,
Maryam Mehri Dehnavi
Abstract:
Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Symp…
▽ More
Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile-time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.
△ Less
Submitted 18 May, 2017;
originally announced May 2017.