Skip to main content

Showing 1–6 of 6 results for author: Jayasena, N

Searching in archive cs. Search in all archives.
.
  1. arXiv:2401.16677  [pdf, other

    cs.AR cs.DC cs.LG

    T3: Transparent Tracking & Triggering for Fine-grained Overlap of Compute & Collectives

    Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Large Language Models increasingly rely on distributed techniques for their training and inference. These techniques require communication across devices which can reduce scaling efficiency as the number of devices increases. While some distributed techniques can overlap, and thus, hide this communication with independent computations, techniques such as Tensor Parallelism (TP) inherently serializ… ▽ More

    Submitted 29 January, 2024; originally announced January 2024.

    Comments: To appear at the International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS) 2024

    ACM Class: C.2.4; C.1.2

  2. arXiv:2309.07984  [pdf, other

    cs.AR

    Inclusive-PIM: Hardware-Software Co-design for Broad Acceleration on Commercial PIM Architectures

    Authors: Johnathan Alsop, Shaizeen Aga, Mohamed Ibrahim, Mahzabeen Islam, Andrew Mccrabb, Nuwan Jayasena

    Abstract: Continual demand for memory bandwidth has made it worthwhile for memory vendors to reassess processing in memory (PIM), which enables higher bandwidth by placing compute units in/near-memory. As such, memory vendors have recently proposed commercially viable PIM designs. However, these proposals are largely driven by the needs of (a narrow set of) machine learning (ML) primitives. While such propo… ▽ More

    Submitted 17 January, 2024; v1 submitted 14 September, 2023; originally announced September 2023.

  3. arXiv:2302.02825  [pdf

    cs.AR cs.DC

    Computation vs. Communication Scaling for Future Transformers on Future Hardware

    Authors: Suchita Pati, Shaizeen Aga, Mahzabeen Islam, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Scaling neural network models has delivered dramatic quality gains across ML problems. However, this scaling has increased the reliance on efficient distributed training techniques. Accordingly, as with other distributed computing scenarios, it is important to understand how will compute and communication scale relative to one another as models scale and hardware evolves? A careful study which ans… ▽ More

    Submitted 2 May, 2023; v1 submitted 6 February, 2023; originally announced February 2023.

    ACM Class: C.4; C.2.4

  4. arXiv:2104.08335  [pdf

    cs.AR cs.DC cs.LG

    Demystifying BERT: Implications for Accelerator Design

    Authors: Suchita Pati, Shaizeen Aga, Nuwan Jayasena, Matthew D. Sinclair

    Abstract: Transfer learning in natural language processing (NLP), as realized using models like BERT (Bi-directional Encoder Representation from Transformer), has significantly improved language representation with models that can tackle challenging language problems. Consequently, these applications are driving the requirements of future systems. Thus, we focus on BERT, one of the most popular NLP transfer… ▽ More

    Submitted 13 April, 2021; originally announced April 2021.

    ACM Class: C.3; C.4

  5. arXiv:2007.10459  [pdf

    cs.DC

    SeqPoint: Identifying Representative Iterations of Sequence-based Neural Networks

    Authors: Suchita Pati, Shaizeen Aga, Matthew D. Sinclair, Nuwan Jayasena

    Abstract: The ubiquity of deep neural networks (DNNs) continues to rise, making them a crucial application class for hardware optimizations. However, detailed profiling and characterization of DNN training remains difficult as these applications often run for hours to days on real hardware. Prior works exploit the iterative nature of DNNs to profile a few training iterations. While such a strategy is sound… ▽ More

    Submitted 20 July, 2020; originally announced July 2020.

    Comments: To appear in IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS 2020)

    ACM Class: C.4

  6. CODA: Enabling Co-location of Computation and Data for Near-Data Processing

    Authors: Hyojong Kim, Ramyad Hadidi, Lifeng Nai, Hyesoon Kim, Nuwan Jayasena, Yasuko Eckert, Onur Kayiran, Gabriel H. Loh

    Abstract: Recent studies have demonstrated that near-data processing (NDP) is an effective technique for improving performance and energy efficiency of data-intensive workloads. However, leveraging NDP in realistic systems with multiple memory modules introduces a new challenge. In today's systems, where no computation occurs in memory modules, the physical address space is interleaved at a fine granularity… ▽ More

    Submitted 25 October, 2017; originally announced October 2017.

    Comments: 14 pages, 16 figures

    Journal ref: ACM Transactions on Architecture and Code Optimization (TACO) Volume 15 Issue 3, October 2018 Article No. 32