Skip to main content

Showing 1–4 of 4 results for author: Gulavani, B

.
  1. arXiv:2405.05465  [pdf, other

    cs.LG cs.AI cs.CL

    Vidur: A Large-Scale Simulation Framework For LLM Inference

    Authors: Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

    Abstract: Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easil… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  2. arXiv:2403.02310  [pdf, other

    cs.LG cs.DC

    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

    Authors: Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

    Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low… ▽ More

    Submitted 17 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  3. arXiv:2308.16369  [pdf, other

    cs.LG cs.DC

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Authors: Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

    Abstract: Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

  4. arXiv:2202.07848  [pdf, other

    cs.DC cs.AI

    Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

    Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa , et al. (1 additional authors not shown)

    Abstract: Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically sca… ▽ More

    Submitted 21 February, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Revision: Fixed some typos