Skip to main content

Showing 1–9 of 9 results for author: Aga, S

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.20297  [pdf, other

    cs.AR cs.DC

    Balanced Data Placement for GEMV Acceleration with Processing-In-Memory

    Authors: Mohamed Assem Ibrahim, Mahzabeen Islam, Shaizeen Aga

    Abstract: With unprecedented demand for generative AI (GenAI) inference, acceleration of primitives that dominate GenAI such as general matrix-vector multiplication (GEMV) is receiving considerable attention. A challenge with GEMVs is the high memory bandwidth this primitive demands. Multiple memory vendors have proposed commercially viable processing-in-memory (PIM) prototypes that attain bandwidth boost o… ▽ More

    Submitted 1 April, 2024; v1 submitted 29 March, 2024; originally announced March 2024.

  2. arXiv:2401.16677  [pdf, other

    cs.AR cs.DC cs.LG

    T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

    Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serializ… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: To appear at the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024

    ACM Class: C.2.4; C.1.2

  3. arXiv:2311.05034  [pdf, other

    cs.AR cs.DC

    Just-in-time Quantization with Processing-In-Memory for Efficient ML Training

    Authors: Mohamed Assem Ibrahim, Shaizeen Aga, Ada Li, Suchita Pati, Mahzabeen Islam

    Abstract: Data format innovations have been critical for machine learning (ML) scaling, which in turn fuels ground-breaking ML capabilities. However, even in the presence of low-precision formats, model weights are often stored in both high-precision and low-precision during training. Furthermore, with emerging directional data formats (e.g., MX9, MX6, etc.) multiple low-precision weight copies can be requi… ▽ More

    Submitted 8 November, 2023; originally announced November 2023.

  4. arXiv:2309.07984  [pdf, other

    cs.AR

    Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

    Authors: Johnathan Alsop, Shaizeen Aga, Mohamed Ibrahim, Mahzabeen Islam, Andrew Mccrabb, Nuwan Jayasena

    Abstract: Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such propo… ▽ More

    Submitted 17 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  5. arXiv:2308.03973  [pdf, other

    cs.AR cs.DC

    Collaborative Acceleration for FFT on Commercial Processing-In-Memory Architectures

    Authors: Mohamed Assem Ibrahim, Shaizeen Aga

    Abstract: This paper evaluates the efficacy of recent commercial processing-in-memory (PIM) solutions to accelerate fast Fourier transform (FFT), an important primitive across several domains. Specifically, we observe that efficient implementations of FFT on modern GPUs are memory bandwidth bound. As such, the memory bandwidth boost availed by commercial PIM solutions makes a case for PIM to accelerate FFT.… ▽ More

    Submitted 7 August, 2023; originally announced August 2023.

  6. arXiv:2304.09411  [pdf, other

    cs.AR

    Egalitarian ORAM: Wear-Leveling for ORAM

    Authors: Yi Zheng, Aasheesh Kolli, Shaizeen Aga

    Abstract: While non-volatile memories (NVMs) provide several desirable characteristics like better density and comparable energy efficiency than DRAM, DRAM-like performance, and disk-like durability, the limited endurance NVMs manifest remains a challenge with these memories. Indeed, the endurance constraints of NVMs can prevent solutions that are commonly employed for other mainstream memories like DRAM fr… ▽ More

    Submitted 18 April, 2023; originally announced April 2023.

  7. arXiv:2302.02825  [pdf

    cs.AR cs.DC

    Computation vs. Communication Scaling for Future Transformers on Future Hardware

    Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which ans… ▽ More

    Submitted 2 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    ACM Class: C.4; C.2.4

  8. arXiv:2104.08335  [pdf

    cs.AR cs.DC cs.LG

    Demystifying BERT: Implications for Accelerator Design

    Authors: Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    ACM Class: C.3; C.4

  9. arXiv:2007.10459  [pdf

    cs.DC

    SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks

    Authors: Suchita Pati, Shaizeen Aga, Matthew D. Sinclair, Nuwan Jayasena

    Abstract: The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

    Comments: To appear in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2020)

    ACM Class: C.4