Skip to main content

Showing 1–3 of 3 results for author: Rafique, M M

Searching in archive cs. Search in all archives.
.
  1. Breaking the Memory Wall: A Study of I/O Patterns and GPU Memory Utilization for Hybrid CPU-GPU Offloaded Optimizers

    Authors: Avinash Maurya, Jie Ye, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

    Abstract: Transformers and LLMs have seen rapid adoption in all domains. Their sizes have exploded to hundreds of billions of parameters and keep increasing. Under these circumstances, the training of transformers is slow and often takes in the order of weeks or months. Thanks to 3D model parallelism (data, pipeline, and tensor-level parallelism), the training can scale to a large number of GPUs, which redu… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Accepted at FlexScience'24' Workshop on AI and Scientific Computing at Scale using Flexible Computing Infrastructures (co-located with HPDC'24)

  2. DataStates-LLM: Lazy Asynchronous Checkpointing for Large Language Models

    Authors: Avinash Maurya, Robert Underwood, M. Mustafa Rafique, Franck Cappello, Bogdan Nicolae

    Abstract: LLMs have seen rapid adoption in all domains. They need to be trained on high-end high-performance computing (HPC) infrastructures and ingest massive amounts of input data. Unsurprisingly, at such a large scale, unexpected events (e.g., failures of components, instability of the software, undesirable learning patterns, etc.), are frequent and typically impact the training in a negative fashion. Th… ▽ More

    Submitted 15 June, 2024; originally announced June 2024.

    Comments: Published at HPDC '24: The 33rd International Symposium on High-Performance Parallel and Distributed Computing. Source code at https://github.com/DataStates/datastates-llm

  3. arXiv:2309.01662   

    cs.DC cs.PF

    Towards Persistent Memory based Stateful Serverless Computing for Big Data Applications

    Authors: Yuze Li, Kevin Assogba, Abhijit Tripathy, Moiz Arif, M. Mustafa Rafique, Ali R. Butt, Dimitrios Nikolopoulos

    Abstract: The Function-as-a-service (FaaS) computing model has recently seen significant growth especially for highly scalable, event-driven applications. The easy-to-deploy and cost-efficient fine-grained billing of FaaS is highly attractive to big data applications. However, the stateless nature of serverless platforms poses major challenges when supporting stateful I/O intensive workloads such as a lack… ▽ More

    Submitted 8 September, 2023; v1 submitted 4 September, 2023; originally announced September 2023.

    Comments: Not yet ready to be publicly available