Skip to main content

Showing 1–5 of 5 results for author: Sukumaran-Rajam, A

.
  1. TDC: Towards Extremely Efficient CNNs on GPUs via Hardware-Aware Tucker Decomposition

    Authors: Lizhi Xiang, Miao Yin, Chengming Zhang, Aravind Sukumaran-Rajam, P. Sadayappan, Bo Yuan, Dingwen Tao

    Abstract: Tucker decomposition is one of the SOTA CNN model compression techniques. However, unlike the FLOPs reduction, we observe very limited inference time reduction with Tucker-compressed models using existing GPU software such as cuDNN. To this end, we propose an efficient end-to-end framework that can generate highly accurate and compact CNN models via Tucker decomposition and optimized inference cod… ▽ More

    Submitted 4 January, 2023; v1 submitted 7 November, 2022; originally announced November 2022.

    Comments: 14 pages, 9 figures, 3 tables, accepted by PPoPP '23

  2. Efficient distributed algorithms for Convolutional Neural Networks

    Authors: Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, P Sadayappan

    Abstract: Several efficient distributed algorithms have been developed for matrix-matrix multiplication: the 3D algorithm, the 2D SUMMA algorithm, and the 2.5D algorithm. Each of these algorithms was independently conceived and they trade-off memory needed per node and the inter-node data communication volume. The convolutional neural network (CNN) computation may be viewed as a generalization of matrix-m… ▽ More

    Submitted 30 May, 2021; v1 submitted 27 May, 2021; originally announced May 2021.

    Comments: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '21), July 6--8, 2021, Virtual Event, USA

    Journal ref: Proceedings of the 33rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '21), July 6--8, 2021, Virtual Event, USA

  3. Analytical Characterization and Design Space Exploration for Optimization of CNNs

    Authors: Rui Li, Yufan Xu, Aravind Sukumaran-Rajam, Atanas Rountev, P. Sadayappan

    Abstract: Moving data through the memory hierarchy is a fundamental bottleneck that can limit the performance of core algorithms of machine learning, such as convolutional neural networks (CNNs). Loop-level optimization, including loop tiling and loop permutation, are fundamental transformations to reduce data movement. However, the search space for finding the best loop-level optimization configuration is… ▽ More

    Submitted 5 March, 2021; v1 submitted 24 January, 2021; originally announced January 2021.

    Comments: In proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '21), April 19-23, 2021, Virtual, USA

    Journal ref: Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

  4. arXiv:1904.07935  [pdf, other

    cs.LG cs.DC stat.ML

    PL-NMF: Parallel Locality-Optimized Non-negative Matrix Factorization

    Authors: Gordon E. Moon, Aravind Sukumaran-Rajam, Srinivasan Parthasarathy, P. Sadayappan

    Abstract: Non-negative Matrix Factorization (NMF) is a key kernel for unsupervised dimension reduction used in a wide range of applications, including topic modeling, recommender systems and bioinformatics. Due to the compute-intensive nature of applications that must perform repeated NMF, several parallel implementations have been developed in the past. However, existing parallel NMF algorithms have not ad… ▽ More

    Submitted 16 April, 2019; originally announced April 2019.

    Comments: 11 pages, 5 tables, 9 figures

  5. arXiv:1904.03329  [pdf, other

    cs.DC

    Load-Balanced Sparse MTTKRP on GPUs

    Authors: Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, Richard Vuduc, P. Sadayappan

    Abstract: Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs… ▽ More

    Submitted 5 April, 2019; originally announced April 2019.