Search | arXiv e-print repository

Vidur: A Large-Scale Simulation Framework For LLM Inference

Authors: Amey Agrawal, Nitin Kedia, Jayashree Mohan, Ashish Panwar, Nipun Kwatra, Bhargav Gulavani, Ramachandran Ramjee, Alexey Tumanov

Abstract: Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easil… ▽ More Optimizing the deployment of Large language models (LLMs) is expensive today since it requires experimentally running an application workload against an LLM implementation while exploring large configuration space formed by system knobs such as parallelization strategies, batching techniques, and scheduling policies. To address this challenge, we present Vidur - a large-scale, high-fidelity, easily-extensible simulation framework for LLM inference performance. Vidur models the performance of LLM operators using a combination of experimental profiling and predictive modeling, and evaluates the end-to-end inference performance for different workloads by estimating several metrics of interest such as latency and throughput. We validate the fidelity of Vidur on several LLMs and show that it estimates inference latency with less than 9% error across the range. Further, we present Vidur-Search, a configuration search tool that helps optimize LLM deployment. Vidur-Search uses Vidur to automatically identify the most cost-effective deployment configuration that meets application performance constraints. For example, Vidur-Search finds the best deployment configuration for LLaMA2-70B in one hour on a CPU machine, in contrast to a deployment-based exploration which would require 42K GPU hours - costing ~218K dollars. Source code for Vidur is available at https://github.com/microsoft/vidur. △ Less

Submitted 21 May, 2024; v1 submitted 8 May, 2024; originally announced May 2024.

arXiv:2405.04437 [pdf, other]

vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention

Authors: Ramya Prabhu, Ajay Nayak, Jayashree Mohan, Ramachandran Ramjee, Ashish Panwar

Abstract: Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving w… ▽ More Efficient use of GPU memory is essential for high throughput LLM inference. Prior systems reserved memory for the KV-cache ahead-of-time, resulting in wasted capacity due to internal fragmentation. Inspired by OS-based virtual memory systems, vLLM proposed PagedAttention to enable dynamic memory allocation for KV-cache. This approach eliminates fragmentation, enabling high-throughput LLM serving with larger batch sizes. However, to be able to allocate physical memory dynamically, PagedAttention changes the layout of KV-cache from contiguous virtual memory to non-contiguous virtual memory. This change requires attention kernels to be rewritten to support paging, and serving framework to implement a memory manager. Thus, the PagedAttention model leads to software complexity, portability issues, redundancy and inefficiency. In this paper, we propose vAttention for dynamic KV-cache memory management. In contrast to PagedAttention, vAttention retains KV-cache in contiguous virtual memory and leverages low-level system support for demand paging, that already exists, to enable on-demand physical memory allocation. Thus, vAttention unburdens the attention kernel developer from having to explicitly support paging and avoids re-implementation of memory management in the serving framework. We show that vAttention enables seamless dynamic memory management for unchanged implementations of various attention kernels. vAttention also generates tokens up to 1.97x faster than vLLM, while processing input prompts up to 3.92x and 1.45x faster than the PagedAttention variants of FlashAttention and FlashInfer. △ Less

Submitted 7 May, 2024; originally announced May 2024.

Comments: 15 pages, 12 figures, 8 tables

arXiv:2403.02310 [pdf, other]

Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve

Authors: Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Alexey Tumanov, Ramachandran Ramjee

Abstract: Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low… ▽ More Each LLM serving request goes through two phases. The first is prefill which processes the entire input prompt and produces the first output token and the second is decode which generates the rest of output tokens, one-at-a-time. Prefill iterations have high latency but saturate GPU compute due to parallel processing of the input prompt. In contrast, decode iterations have low latency but also low compute utilization because a decode iteration processes only a single token per request. This makes batching highly effective for decodes and consequently for overall throughput. However, batching multiple requests leads to an interleaving of prefill and decode iterations which makes it challenging to achieve both high throughput and low latency. We introduce an efficient LLM inference scheduler, Sarathi-Serve, to address this throughput-latency tradeoff. Sarathi-Serve introduces chunked-prefills which splits a prefill request into near equal sized chunks and creates stall-free schedules that adds new requests in a batch without pausing ongoing decodes. Stall-free scheduling unlocks the opportunity to improve throughput with large batch sizes while minimizing the effect of batching on latency. Furthermore, uniform batches in Sarathi-Serve ameliorate the imbalance between iterations resulting in minimal pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware under tail latency constraints. For Mistral-7B on single A100 GPUs, we achieve 2.6x higher serving capacity and up to 3.7x higher serving capacity for the Yi-34B model on two A100 GPUs as compared to vLLM. When used with pipeline parallelism on Falcon-180B, Sarathi-Serve provides up to 5.6x gain in the end-to-end serving capacity. The source code for Sarathi-Serve is available at https://github.com/microsoft/sarathi-serve. △ Less

Submitted 17 June, 2024; v1 submitted 4 March, 2024; originally announced March 2024.

arXiv:2312.16362 [pdf]

Kee** Teams in the Game: Predicting Dropouts in Online Problem-Based Learning Competition

Authors: Aditya Panwar, Ashwin T S, Ramkumar Rajendran, Kavi Arya

Abstract: Online learning and MOOCs have become increasingly popular in recent years, and the trend will continue, given the technology boom. There is a dire need to observe learners' behavior in these online courses, similar to what instructors do in a face-to-face classroom. Learners' strategies and activities become crucial to understanding their behavior. One major challenge in online courses is predict… ▽ More Online learning and MOOCs have become increasingly popular in recent years, and the trend will continue, given the technology boom. There is a dire need to observe learners' behavior in these online courses, similar to what instructors do in a face-to-face classroom. Learners' strategies and activities become crucial to understanding their behavior. One major challenge in online courses is predicting and preventing dropout behavior. While several studies have tried to perform such analysis, there is still a shortage of studies that employ different data streams to understand and predict the drop rates. Moreover, studies rarely use a fully online team-based collaborative environment as their context. Thus, the current study employs an online longitudinal problem-based learning (PBL) collaborative robotics competition as the testbed. Through methodological triangulation, the study aims to predict dropout behavior via the contributions of Discourse discussion forum 'activities' of participating teams, along with a self-reported Online Learning Strategies Questionnaire (OSLQ). The study also uses Qualitative interviews to enhance the ground truth and results. The OSLQ data is collected from more than 4000 participants. Furthermore, the study seeks to establish the reliability of OSLQ to advance research within online environments. Various Machine Learning algorithms are applied to analyze the data. The findings demonstrate the reliability of OSLQ with our substantial sample size and reveal promising results for predicting the dropout rate in online competition. △ Less

Submitted 26 December, 2023; originally announced December 2023.

ACM Class: K.3

Journal ref: 31st International Conference on Computers in Education, Volume 1, 2023

arXiv:2308.16369 [pdf, other]

SARATHI: Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills

Authors: Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, Ramachandran Ramjee

Abstract: Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times… ▽ More Large Language Model (LLM) inference consists of two distinct phases - prefill phase which processes the input prompt and decode phase which generates output tokens autoregressively. While the prefill phase effectively saturates GPU compute at small batch sizes, the decode phase results in low compute utilization as it generates one token at a time per request. The varying prefill and decode times also lead to imbalance across micro-batches when using pipeline parallelism, resulting in further inefficiency due to bubbles. We present SARATHI to address these challenges. SARATHI employs chunked-prefills, which splits a prefill request into equal sized chunks, and decode-maximal batching, which constructs a batch using a single prefill chunk and populates the remaining slots with decodes. During inference, the prefill chunk saturates GPU compute, while the decode requests 'piggyback' and cost up to an order of magnitude less compared to a decode-only batch. Chunked-prefills allows constructing multiple decode-maximal batches from a single prefill request, maximizing coverage of decodes that can piggyback. Furthermore, the uniform compute design of these batches ameliorates the imbalance between micro-batches, significantly reducing pipeline bubbles. Our techniques yield significant improvements in inference performance across models and hardware. For the LLaMA-13B model on A6000 GPU, SARATHI improves decode throughput by up to 10x, and accelerates end-to-end throughput by up to 1.33x. For LLaMa-33B on A100 GPU, we achieve 1.25x higher end-to-end-throughput and up to 4.25x higher decode throughput. When used with pipeline parallelism on GPT-3, SARATHI reduces bubbles by 6.29x, resulting in an end-to-end throughput improvement of 1.91x. △ Less

Submitted 30 August, 2023; originally announced August 2023.

arXiv:2107.11845 [pdf, other]

On-Device Content Moderation

Authors: Anchal Pandey, Sukumar Moharana, Debi Prasanna Mohanty, Archit Panwar, Dewang Agarwal, Siva Prasad Thota

Abstract: With the advent of internet, not safe for work(NSFW) content moderation is a major problem today. Since,smartphones are now part of daily life of billions of people,it becomes even more important to have a solution which coulddetect and suggest user about potential NSFW content present ontheir phone. In this paper we present a novel on-device solutionfor detecting NSFW images. In addition to conve… ▽ More With the advent of internet, not safe for work(NSFW) content moderation is a major problem today. Since,smartphones are now part of daily life of billions of people,it becomes even more important to have a solution which coulddetect and suggest user about potential NSFW content present ontheir phone. In this paper we present a novel on-device solutionfor detecting NSFW images. In addition to conventional porno-graphic content moderation, we have also included semi-nudecontent moderation as it is still NSFW in a large demography.We have curated a dataset comprising of three major categories,namely nude, semi-nude and safe images. We have created anensemble of object detector and classifier for filtering of nudeand semi-nude contents. The solution provides unsafe body partannotations along with identification of semi-nude images. Weextensively tested our proposed solution on several public datasetand also on our custom dataset. The model achieves F1 scoreof 0.91 with 95% precision and 88% recall on our customNSFW16k dataset and 0.92 MAP on NPDI dataset. Moreover itachieves average 0.002 false positive rate on a collection of safeimage open datasets. △ Less

Submitted 25 July, 2021; originally announced July 2021.

arXiv:2105.08463 [pdf, other]

Unsupervised Compound Domain Adaptation for Face Anti-Spoofing

Authors: Ankush Panwar, Pratyush Singh, Suman Saha, Danda Pani Paudel, Luc Van Gool

Abstract: We address the problem of face anti-spoofing which aims to make the face verification systems robust in the real world settings. The context of detecting live vs. spoofed face images may differ significantly in the target domain, when compared to that of labeled source domain where the model is trained. Such difference may be caused due to new and unknown spoof types, illumination conditions, scen… ▽ More We address the problem of face anti-spoofing which aims to make the face verification systems robust in the real world settings. The context of detecting live vs. spoofed face images may differ significantly in the target domain, when compared to that of labeled source domain where the model is trained. Such difference may be caused due to new and unknown spoof types, illumination conditions, scene backgrounds, among many others. These varieties of differences make the target a compound domain, thus calling for the problem of the unsupervised compound domain adaptation. We demonstrate the effectiveness of the compound domain assumption for the task of face anti-spoofing, for the first time in this work. To this end, we propose a memory augmentation method for adapting the source model to the target domain in a domain aware manner. The adaptation process is further improved by using the curriculum learning and the domain agnostic source network training approaches. The proposed method successfully adapts to the compound target domain consisting multiple new spoof types. Our experiments on multiple benchmark datasets demonstrate the superiority of the proposed method over the state-of-the-art. △ Less

Submitted 18 May, 2021; originally announced May 2021.

Comments: 9 pages, 6 figures

arXiv:2011.12092 [pdf, other]

Leveraging Architectural Support of Three Page Sizes with Trident

Authors: Venkat Sri Sai Ram, Ashish Panwar, Arkaprava Basu

Abstract: Large pages are commonly deployed to reduce address translation overheads for big-memory workloads. Modern x86-64 processors from Intel and AMD support two large page sizes -- 1GB and 2MB. However, previous works on large pages have primarily focused on 2MB pages, partly due to lack of substantial evidence on the profitability of 1GB pages to real-world applications. We argue that in fact, inadequ… ▽ More Large pages are commonly deployed to reduce address translation overheads for big-memory workloads. Modern x86-64 processors from Intel and AMD support two large page sizes -- 1GB and 2MB. However, previous works on large pages have primarily focused on 2MB pages, partly due to lack of substantial evidence on the profitability of 1GB pages to real-world applications. We argue that in fact, inadequate system software support is responsible for a decade of underutilized hardware support for 1GB pages. Through extensive experimentation on a real system, we demonstrate that 1GB pages can improve performance over 2MB pages, and when used in tandem with 2MB pages for an important set of applications; the support for the latter is crucial but missing in current systems. Our design and implementation of \trident{} in Linux fully exploit hardware supported large pages by dynamically and transparently allocating 1GB, 2MB, and 4KB pages as deemed suitable. \trident{} speeds up eight memory-intensive applications by {$18\%$}, on average, over Linux's use of 2MB pages. We also propose \tridentpv{}, an extension to \trident{} that effectively virtualizes 1GB pages via copy-less promotion and compaction in the guest OS. Overall, this paper shows that even GB-sized pages have considerable practical significance with adequate software enablement, in turn motivating architects to continue investing/innovating in large pages. △ Less

Submitted 24 November, 2020; originally announced November 2020.

Comments: 13 pages, 16 figures, 5 tables

ACM Class: D.4

arXiv:1910.05398 [pdf, other]

Mitosis: Transparently Self-Replicating Page-Tables for Large-Memory Machines

Authors: Reto Achermann, Ashish Panwar, Abhishek Bhattacharjee, Timothy Roscoe, Jayneel Gandhi

Abstract: Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case… ▽ More Multi-socket machines with 1-100 TBs of physical memory are becoming prevalent. Applications running on multi-socket machines suffer non-uniform bandwidth and latency when accessing physical memory. Decades of research have focused on data allocation and placement policies in NUMA settings, but there have been no studies on the question of how to place page-tables amongst sockets. We make the case for explicit page-table allocation policies and show that page-table placement is becoming crucial to overall performance. We propose Mitosis to mitigate NUMA effects on page-table walks by transparently replicating and migrating page-tables across sockets without application changes. This reduces the frequency of accesses to remote NUMA nodes when performing page-table walks. Mitosis uses two components: (i) a mechanism to enable efficient page-table replication and migration; and (ii) policies for processes to efficiently manage and control page-table replication and migration. We implement Mitosis in Linux and evaluate its benefits on real hardware. Mitosis improves performance for large-scale multi-socket workloads by up to 1.34x by replicating page-tables across sockets. Moreover, it improves performance by up to 3.24x in cases when the OS migrates a process across sockets by enabling cross-socket page-table migration. △ Less

Submitted 8 November, 2019; v1 submitted 11 October, 2019; originally announced October 2019.

Showing 1–9 of 9 results for author: Panwar, A