Skip to main content

Showing 1–12 of 12 results for author: Ramjee, R

.
  1. arXiv:2405.05465  [pdf, other

    cs.LG cs.AI cs.CL

    Vidur: A Large-Scale Simulation Framework For LLM Inference

    Authors: Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

    Abstract: Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easil… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  2. arXiv:2405.04437  [pdf, other

    cs.LG cs.OS

    vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

    Authors: Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

    Abstract: Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving w… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 15 pages, 12 figures, 8 tables

  3. arXiv:2403.02310  [pdf, other

    cs.LG cs.DC

    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

    Authors: Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

    Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low… ▽ More

    Submitted 17 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  4. arXiv:2308.16369  [pdf, other

    cs.LG cs.DC

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Authors: Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

    Abstract: Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

  5. arXiv:2207.04452  [pdf, other

    cs.LG cs.IR

    NGAME: Negative Mining-aware Mini-batching for Extreme Classification

    Authors: Kunal Dahiya, Nilesh Gupta, Deepak Saini, Akshay Soni, Yajun Wang, Kushal Dave, Jian Jiao, Gururaj K, Prasenjit Dey, Amit Singh, Deepesh Hada, Vidit Jain, Bhawna Paliwal, Anshul Mittal, Sonu Mehta, Ramachandran Ramjee, Sumeet Agarwal, Purushottam Kar, Manik Varma

    Abstract: Extreme Classification (XC) seeks to tag data points with the most relevant subset of labels from an extremely large label set. Performing deep XC with dense, learnt representations for data points and labels has attracted much attention due to its superiority over earlier XC methods that used sparse, hand-crafted features. Negative mining techniques have emerged as a critical component of all dee… ▽ More

    Submitted 10 July, 2022; originally announced July 2022.

  6. arXiv:2202.07848  [pdf, other

    cs.DC cs.AI

    Singularity: Planet-Scale, Preemptive and Elastic Scheduling of AI Workloads

    Authors: Dharma Shukla, Muthian Sivathanu, Srinidhi Viswanatha, Bhargav Gulavani, Rimma Nehme, Amey Agrawal, Chen Chen, Nipun Kwatra, Ramachandran Ramjee, Pankaj Sharma, Atul Katiyar, Vipul Modi, Vaibhav Sharma, Abhishek Singh, Shreshth Singhal, Kaustubh Welankar, Lu Xun, Ravi Anupindi, Karthik Elangovan, Hasibur Rahman, Zhou Lin, Rahul Seetharaman, Cheng Xu, Eddie Ailijiang, Suresh Krishnappa , et al. (1 additional authors not shown)

    Abstract: Lowering costs by driving high utilization across deep learning workloads is a crucial lever for cloud providers. We present Singularity, Microsoft's globally distributed scheduling service for highly-efficient and reliable execution of deep learning training and inference workloads. At the heart of Singularity is a novel, workload-aware scheduler that can transparently preempt and elastically sca… ▽ More

    Submitted 21 February, 2022; v1 submitted 15 February, 2022; originally announced February 2022.

    Comments: Revision: Fixed some typos

  7. arXiv:2111.04007  [pdf, other

    cs.DC

    Varuna: Scalable, Low-cost Training of Massive Deep Learning Models

    Authors: Sanjith Athlur, Nitika Saran, Muthian Sivathanu, Ramachandran Ramjee, Nipun Kwatra

    Abstract: Systems for training massive deep learning models (billions of parameters) today assume and require specialized "hyper-clusters": hundreds or thousands of GPUs wired with specialized high-bandwidth interconnects such as NV-Link and Infiniband. Besides being expensive, such dependence on hyper-clusters and custom high-speed inter-connects limits the size of such clusters, creating (a) scalability l… ▽ More

    Submitted 15 November, 2021; v1 submitted 7 November, 2021; originally announced November 2021.

    Comments: 14 pages, 10 figures

  8. arXiv:2105.14526  [pdf, other

    cs.LG

    LRTuner: A Learning Rate Tuner for Deep Neural Networks

    Authors: Nikhil Iyer, V Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

    Abstract: One very important hyperparameter for training deep neural networks is the learning rate schedule of the optimizer. The choice of learning rate schedule determines the computational cost of getting close to a minima, how close you actually get to the minima, and most importantly the kind of local minima (wide/narrow) attained. The kind of minima attained has a significant impact on the generalizat… ▽ More

    Submitted 30 May, 2021; originally announced May 2021.

    Comments: 17 pages

  9. arXiv:2003.03977  [pdf, other

    cs.LG stat.ML

    Wide-minima Density Hypothesis and the Explore-Exploit Learning Rate Schedule

    Authors: Nikhil Iyer, V Thejas, Nipun Kwatra, Ramachandran Ramjee, Muthian Sivathanu

    Abstract: Several papers argue that wide minima generalize better than narrow minima. In this paper, through detailed experiments that not only corroborate the generalization properties of wide minima, we also provide empirical evidence for a new hypothesis that the density of wide minima is likely lower than the density of narrow minima. Further, motivated by this hypothesis, we design a novel explore-expl… ▽ More

    Submitted 1 June, 2021; v1 submitted 9 March, 2020; originally announced March 2020.

    Comments: 34 pages

  10. arXiv:1810.00602  [pdf, other

    cs.CR cs.AI cs.CV

    Privado: Practical and Secure DNN Inference with Enclaves

    Authors: Karan Grover, Shruti Tople, Shweta Shinde, Ranjita Bhagwan, Ramachandran Ramjee

    Abstract: Cloud providers are extending support for trusted hardware primitives such as Intel SGX. Simultaneously, the field of deep learning is seeing enormous innovation as well as an increase in adoption. In this paper, we ask a timely question: "Can third-party cloud services use Intel SGX enclaves to provide practical, yet secure DNN Inference-as-a-service?" We first demonstrate that DNN models executi… ▽ More

    Submitted 5 September, 2019; v1 submitted 1 October, 2018; originally announced October 2018.

    Comments: 13 pages, 5 figures

  11. arXiv:1603.08461  [pdf

    cs.NI

    Perspectives on Software-Defined Networks: interviews with five leading scientists from the networking community

    Authors: Daniel M Batista, Gordon Blair, Fabio Kon, Raouf Boutaba, David Hutchison, Raj Jain, Ramachandran Ramjee, Christian E Rothenberg

    Abstract: Software defined Networks (SDNs) have drawn much attention both from academia and industry over the last few years. Despite the fact that underlying ideas already exist through areas such as P2P applications and active networks (e.g. virtual topologies and dynamic changes of the network via software), only now has the technology evolved to a point where it is possible to scale the implementations,… ▽ More

    Submitted 28 March, 2016; originally announced March 2016.

    Journal ref: Journal of Internet Services and Applications, Springer London, October 2015, Print ISSN: 1867-4828, Online ISSN:1869-0238, Vol. 6, No. 1, pp. 1867-4828, DOI: 10.1186/s13174-015-0035-3

  12. arXiv:0909.3717  [pdf, ps, other

    cs.NI cs.PF

    Analytical Models for Energy Consumption in Infrastructure WLAN STAs Carrying TCP Traffic

    Authors: Pranav Agrawal, Anurag Kumar, Joy Kuri, Manoj Panda, Vishnu Navda, Ramachandran Ramjee, Venkata N. Padmanabhan

    Abstract: We develop analytical models for estimating the energy spent by stations (STAs) in infrastructure WLANs when performing TCP controlled file downloads. We focus on the energy spent in radio communication when the STAs are in the Continuously Active Mode (CAM), or in the static Power Save Mode (PSM). Our approach is to develop accurate models for obtaining the fraction of times the STA radios spen… ▽ More

    Submitted 6 October, 2009; v1 submitted 21 September, 2009; originally announced September 2009.