Skip to main content

Showing 1–13 of 13 results for author: Dehnavi, M M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.16325  [pdf, other

    cs.LG cs.AI

    SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

    Authors: Mohammad Mozaffari, Amir Yazdanbakhsh, Zhao Zhang, Maryam Mehri Dehnavi

    Abstract: We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretr… ▽ More

    Submitted 14 June, 2024; v1 submitted 25 May, 2024; originally announced May 2024.

  2. arXiv:2306.01685  [pdf, other

    cs.LG cs.AI cs.CV math.OC

    MKOR: Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 Updates

    Authors: Mohammad Mozaffari, Sikan Li, Zhao Zhang, Maryam Mehri Dehnavi

    Abstract: This work proposes a Momentum-Enabled Kronecker-Factor-Based Optimizer Using Rank-1 updates, called MKOR, that improves the training time and convergence properties of deep neural networks (DNNs). Second-order techniques, while enjoying higher convergence rates vs first-order counterparts, have cubic complexity with respect to either the model size and/or the training batch size. Hence they exhibi… ▽ More

    Submitted 30 January, 2024; v1 submitted 2 June, 2023; originally announced June 2023.

    Comments: Published at 37th Conference on Neural Information Processing Systems (NeurIPS 2023)

  3. arXiv:2305.13450  [pdf, other

    cs.DC

    A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

    Authors: Abhinav Jangda, Saeed Maleki, Maryam Mehri Dehnavi, Madan Musuvathi, Olli Saarikivi

    Abstract: Machine Learning (ML) models execute several parallel computations including Generalized Matrix Multiplication, Convolution, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation into independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on a… ▽ More

    Submitted 14 February, 2024; v1 submitted 22 May, 2023; originally announced May 2023.

    Comments: Accepted at CGO 2024

  4. arXiv:2111.12243  [pdf, other

    cs.PL

    Vectorizing Sparse Matrix Codes with Dependency Driven Trace Analysis

    Authors: Zachary Cetinic, Kazem Cheshmi, Maryam Mehri Dehnavi

    Abstract: Sparse computations frequently appear in scientific simulations and the performance of these simulations rely heavily on the optimization of the sparse codes. The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize regular regions of computations in the sparse code. They a… ▽ More

    Submitted 8 December, 2021; v1 submitted 23 November, 2021; originally announced November 2021.

  5. arXiv:2111.12238  [pdf, other

    cs.PL

    Composing Loop-carried Dependence with Other Loops

    Authors: Kazem Cheshmi, Michelle Mills Strout, Maryam Mehri Dehnavi

    Abstract: Sparse fusion is a compile-time loop transformation and runtime scheduling implemented as a domain-specific code generator. Sparse fusion generates efficient parallel code for the combination of two sparse matrix kernels where at least one of the kernels has loop-carried dependencies. Available implementations optimize individual sparse kernels. When optimized separately, the irregular dependence… ▽ More

    Submitted 23 November, 2021; originally announced November 2021.

  6. arXiv:2108.09365  [pdf, other

    math.OC cs.DC

    L-DQN: An Asynchronous Limited-Memory Distributed Quasi-Newton Method

    Authors: Bugra Can, Saeed Soori, Maryam Mehri Dehnavi, Mert Gürbüzbalaban

    Abstract: This work proposes a distributed algorithm for solving empirical risk minimization problems, called L-DQN, under the master/worker communication model. L-DQN is a distributed limited-memory quasi-Newton method that supports asynchronous computations among the worker nodes. Our method is efficient both in terms of storage and communication costs, i.e., in every iteration the master node and workers… ▽ More

    Submitted 4 September, 2021; v1 submitted 20 August, 2021; originally announced August 2021.

    MSC Class: 68W15 (Primary)

  7. arXiv:2106.03947  [pdf, other

    cs.LG

    TENGraD: Time-Efficient Natural Gradient Descent with Exact Fisher-Block Inversion

    Authors: Saeed Soori, Bugra Can, Baourun Mu, Mert Gürbüzbalaban, Maryam Mehri Dehnavi

    Abstract: This work proposes a time-efficient Natural Gradient Descent method, called TENGraD, with linear convergence guarantees. Computing the inverse of the neural network's Fisher information matrix is expensive in NGD because the Fisher matrix is large. Approximate NGD methods such as KFAC attempt to improve NGD's running time and practical application by reducing the Fisher matrix inversion cost with… ▽ More

    Submitted 3 March, 2022; v1 submitted 7 June, 2021; originally announced June 2021.

  8. arXiv:1907.08526  [pdf, ps, other

    cs.DC cs.LG

    ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

    Authors: Saeed Soori, Bugra Can, Mert Gurbuzbalaba, Maryam Mehri Dehnavi

    Abstract: ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current bulk-processing cloud engines do not provide a rob… ▽ More

    Submitted 20 February, 2020; v1 submitted 19 July, 2019; originally announced July 2019.

  9. arXiv:1812.07152  [pdf, other

    cs.DC

    MatRox: Modular approach for improving data locality in Hierarchical (Mat)rix App(Rox)imation

    Authors: Bangtian Liu, Kazem Cheshmi, Saeed Soori, Michelle Mills Strout, Maryam Mehri Dehnavi

    Abstract: Hierarchical matrix approximations have gained significant traction in the machine learning and scientific community as they exploit available low-rank structures in kernel methods to compress the kernel matrix. The resulting compressed matrix, HMatrix, is used to reduce the computational complexity of operations such as HMatrix-matrix multiplications with tuneable accuracy in an evaluation phase.… ▽ More

    Submitted 30 November, 2019; v1 submitted 17 December, 2018; originally announced December 2018.

  10. arXiv:1807.10852  [pdf, other

    cs.PL

    Sparse Matrix Code Dependence Analysis Simplification at Compile Time

    Authors: Mahdi Soltan Mohammadi, Kazem Cheshmi, Ganesh Gopalakrishnan, Mary Hall, Maryam Mehri Dehnavi, Anand Venkat, Tomofumi Yuki, Michelle Mills Strout

    Abstract: Analyzing array-based computations to determine data dependences is useful for many applications including automatic parallelization, race detection, computation and communication overlap, verification, and shape analysis. For sparse matrix codes, array data dependence analysis is made more difficult by the use of index arrays that make it possible to store only the nonzero entries of the matrix (… ▽ More

    Submitted 27 July, 2018; originally announced July 2018.

  11. arXiv:1710.08883  [pdf, other

    cs.DC cs.LG math.NA math.OC

    Avoiding Communication in Proximal Methods for Convex Optimization Problems

    Authors: Saeed Soori, Aditya Devarakonda, James Demmel, Mert Gurbuzbalaban, Maryam Mehri Dehnavi

    Abstract: The fast iterative soft thresholding algorithm (FISTA) is used to solve convex regularized optimization problems in machine learning. Distributed implementations of the algorithm have become popular since they enable the analysis of large datasets. However, existing formulations of FISTA communicate data at every iteration which reduces its performance on modern distributed architectures. The comm… ▽ More

    Submitted 24 October, 2017; originally announced October 2017.

  12. A Unified Optimization Approach for Sparse Tensor Operations on GPUs

    Authors: Bangtian Liu, Chengyao Wen, Anand D. Sarwate, Maryam Mehri Dehnavi

    Abstract: Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor oper… ▽ More

    Submitted 28 May, 2017; originally announced May 2017.

  13. Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis

    Authors: Kazem Cheshmi, Shoaib Kamil, Michelle Mills Strout, Maryam Mehri Dehnavi

    Abstract: Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Symp… ▽ More

    Submitted 18 May, 2017; originally announced May 2017.

    Comments: 12 pages

    Journal ref: in SC 2017, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis