Skip to main content

Showing 1–9 of 9 results for author: Panwar, A

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.05465  [pdf, other

    cs.LG cs.AI cs.CL

    Vidur: A Large-Scale Simulation Framework For LLM Inference

    Authors: Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

    Abstract: Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easil… ▽ More

    Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

  2. arXiv:2405.04437  [pdf, other

    cs.LG cs.OS

    vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

    Authors: Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

    Abstract: Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving w… ▽ More

    Submitted 7 May, 2024; originally announced May 2024.

    Comments: 15 pages, 12 figures, 8 tables

  3. arXiv:2403.02310  [pdf, other

    cs.LG cs.DC

    Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

    Authors: Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

    Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low… ▽ More

    Submitted 17 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

  4. arXiv:2312.16362  [pdf

    cs.LG cs.CY cs.HC

    Kee** Teams in the Game: Predicting Dropouts in Online Problem-Based Learning Competition

    Authors: Aditya Panwar, Ashwin T S, Ramkumar Rajendran, Kavi Arya

    Abstract: Online learning and MOOCs have become increasingly popular in recent years, and the trend will continue, given the technology boom. There is a dire need to observe learners' behavior in these online courses, similar to what instructors do in a face-to-face classroom. Learners' strategies and activities become crucial to understanding their behavior. One major challenge in online courses is predict… ▽ More

    Submitted 26 December, 2023; originally announced December 2023.

    ACM Class: K.3

    Journal ref: 31st International Conference on Computers in Education, Volume 1, 2023

  5. arXiv:2308.16369  [pdf, other

    cs.LG cs.DC

    SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

    Authors: Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

    Abstract: Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times… ▽ More

    Submitted 30 August, 2023; originally announced August 2023.

  6. arXiv:2107.11845  [pdf, other

    cs.CV cs.AI

    On-Device Content Moderation

    Authors: Anchal Pandey, Sukumar Moharana, Debi Prasanna Mohanty, Archit Panwar, Dewang Agarwal, Siva Prasad Thota

    Abstract: With the advent of internet, not safe for work(NSFW) content moderation is a major problem today. Since,smartphones are now part of daily life of billions of people,it becomes even more important to have a solution which coulddetect and suggest user about potential NSFW content present ontheir phone. In this paper we present a novel on-device solutionfor detecting NSFW images. In addition to conve… ▽ More

    Submitted 25 July, 2021; originally announced July 2021.

  7. arXiv:2105.08463  [pdf, other

    cs.CV cs.LG

    Unsupervised Compound Domain Adaptation for Face Anti-Spoofing

    Authors: Ankush Panwar, Pratyush Singh, Suman Saha, Danda Pani Paudel, Luc Van Gool

    Abstract: We address the problem of face anti-spoofing which aims to make the face verification systems robust in the real world settings. The context of detecting live vs. spoofed face images may differ significantly in the target domain, when compared to that of labeled source domain where the model is trained. Such difference may be caused due to new and unknown spoof types, illumination conditions, scen… ▽ More

    Submitted 18 May, 2021; originally announced May 2021.

    Comments: 9 pages, 6 figures

  8. arXiv:2011.12092  [pdf, other

    cs.OS cs.AR cs.PF

    Leveraging Architectural Support of Three Page Sizes with Trident

    Authors: Venkat Sri Sai Ram, Ashish Panwar, Arkaprava Basu

    Abstract: Large pages are commonly deployed to reduce address translation overheads for big-memory workloads. Modern x86-64 processors from Intel and AMD support two large page sizes -- 1GB and 2MB. However, previous works on large pages have primarily focused on 2MB pages, partly due to lack of substantial evidence on the profitability of 1GB pages to real-world applications. We argue that in fact, inadequ… ▽ More

    Submitted 24 November, 2020; originally announced November 2020.

    Comments: 13 pages, 16 figures, 5 tables

    ACM Class: D.4

  9. arXiv:1910.05398  [pdf, other

    cs.OS cs.AR cs.PF

    Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

    Authors: Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, Jayneel Gandhi

    Abstract: Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case… ▽ More

    Submitted 8 November, 2019; v1 submitted 11 October, 2019; originally announced October 2019.