Search | arXiv e-print repository

doi 10.1145/3620666.3651383

Optimal Kernel Orchestration for Tensor Programs with Korch

Authors: Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

Abstract: Kernel orchestration is the task of map** the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel… ▽ More Kernel orchestration is the task of map** the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel orchestration. This paper presents Korch, a tensor program optimizer that discovers optimal kernel orchestration strategies for tensor programs. Instead of directly fusing operators, Korch first applies operator fission to decompose tensor operators into a small set of basic tensor algebra primitives. This decomposition enables a diversity of fine-grained, inter-operator optimizations. Next, Korch optimizes kernel orchestration by formalizing it as a constrained optimization problem, leveraging an off-the-shelf binary linear programming solver to discover an optimal orchestration strategy, and generating an executable that can be directly deployed on modern GPU platforms. Evaluation on a variety of DNNs shows that Korch outperforms existing tensor program optimizers by up to 1.7x on V100 GPUs and up to 1.6x on A100 GPUs. Korch is publicly available at https://github.com/humuyan/Korch. △ Less

Submitted 13 June, 2024; originally announced June 2024.

Comments: Fix some typos in the ASPLOS version

Journal ref: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769

arXiv:2402.18789 [pdf, other]

FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

Authors: Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia

Abstract: Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutili… ▽ More Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutilized, leading to inefficiencies. To address this problem, we present FlexLLM, the first system that can serve inference and parameter-efficient finetuning requests in the same iteration. Our system leverages the complementary nature of these two tasks and utilizes shared GPU resources to run them jointly, using a method called co-serving. To achieve this, FlexLLM introduces a novel token-level finetuning mechanism, which breaks down the finetuning computation of a sequence into smaller token-level computations and uses dependent parallelization and graph pruning, two static compilation optimizations, to minimize the memory overhead and latency for co-serving. Compared to existing systems, FlexLLM's co-serving approach reduces the activation GPU memory overhead by up to 8x, and the end-to-end GPU memory requirement of finetuning by up to 36% while maintaining a low inference latency and improving finetuning throughput. For example, under a heavy inference workload, FlexLLM can still preserve more than 80% of the peak finetuning throughput, whereas existing systems cannot make any progress with finetuning. The source code of FlexLLM is publicly available at https://github.com/flexflow/FlexFlow. △ Less

Submitted 28 February, 2024; originally announced February 2024.

arXiv:2401.07159 [pdf, other]

Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

Authors: Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

Abstract: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from… ▽ More Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from three contributors: model weights, optimizer states, and intermediate activations. However, existing works still require considerable memory and none can simultaneously mitigate memory footprint for all three sources. In this paper, we present Quantized Side Tuing (QST), which enables memory-efficient and fast finetuning of LLMs by operating through a dual-stage process. First, QST quantizes an LLM's model weights into 4-bit to reduce the memory footprint of the LLM's original weights; QST also introduces a side network separated from the LLM, which utilizes the hidden states of the LLM to make task-specific predictions. Using a separate side network avoids performing backpropagation through the LLM, thus reducing the memory requirement of the intermediate activations. Furthermore, QST leverages several low-rank adaptors and gradient-free downsample modules to significantly reduce the trainable parameters, so as to save the memory footprint of the optimizer states. Experiments show that QST can reduce the total memory footprint by up to 2.3 $\times$ and speed up the finetuning process by up to 3 $\times$ while achieving competent performance compared with the state-of-the-art. When it comes to full finetuning, QST can reduce the total memory footprint up to 7 $\times$. △ Less

Submitted 13 January, 2024; originally announced January 2024.

ACM Class: I.2.7

arXiv:2312.15234 [pdf, other]

Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

Authors: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi **, Tianqi Chen, Zhihao Jia

Abstract: In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This… ▽ More In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This survey addresses the imperative need for efficient LLM serving methodologies from a machine learning system (MLSys) research perspective, standing at the crux of advanced AI innovations and practical system optimizations. We provide in-depth analysis, covering a spectrum of solutions, ranging from cutting-edge algorithmic modifications to groundbreaking changes in system designs. The survey aims to provide a comprehensive understanding of the current state and future directions in efficient LLM serving, offering valuable insights for researchers and practitioners in overcoming the barriers of effective LLM deployment, thereby resha** the future of AI. △ Less

Submitted 23 December, 2023; originally announced December 2023.

arXiv:2305.09781 [pdf, other]

doi 10.1145/3620666.3651335

SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

Authors: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia

Abstract: This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token… ▽ More This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token sequences represented by a token tree is verified against the LLM in parallel using a novel tree-based parallel decoding mechanism. SpecInfer uses an LLM as a token tree verifier instead of an incremental decoder, which significantly reduces the end-to-end latency and computational requirement for serving generative LLMs while provably preserving model quality. Our evaluation shows that SpecInfer outperforms existing LLM serving systems by 1.5-2.8x for distributed LLM inference and by 2.6-3.5x for offloading-based LLM inference, while preserving the same generative performance. SpecInfer is publicly available at https://github.com/flexflow/FlexFlow/ △ Less

Submitted 31 March, 2024; v1 submitted 16 May, 2023; originally announced May 2023.

Comments: ASPLOS'24

arXiv:2202.02270 [pdf, other]

Direct Telemetry Access

Authors: Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, Gianni Antichi

Abstract: Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence,… ▽ More Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems. We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors' memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple. We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to $16$x. △ Less

Submitted 10 August, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

Comments: As appearing in the proceedings of ACM SIGCOMM'23

arXiv:2110.05438 [pdf, other]

doi 10.1145/3484266.3487366

Zero-CPU Collection with Direct Telemetry Access

Authors: Jonatan Langlet, Ran Ben Basat, Sivaramakrishnan Ramanathan, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, Gianni Antichi

Abstract: Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are… ▽ More Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are responsible for directly inserting queryable telemetry data into the collectors' memory, bypassing their CPU, and thereby improving their collection scalability. We propose to use a method we call \emph{direct telemetry access}, where switches jointly write telemetry reports directly into the same collector's memory region, without coordination. Our solution, DART, is probabilistic, trading memory redundancy and query success probability for CPU resources at collectors. We prototype DART using commodity hardware such as P4 switches and RDMA NICs and show that we get high query success rates with a reasonable memory overhead. For example, we can collect INT path tracing information on a fat tree topology without a collector's CPU involvement while achieving 99.9\% query success probability and using just 300 bytes per flow. △ Less

Submitted 11 October, 2021; originally announced October 2021.

Comments: To appear in ACM HotNets 2021

arXiv:cs/0111055 [pdf]

Overview of the NSTX Control System

Authors: P. Sichta, J. Dong, G. Oliaro, P. Roney

Abstract: The National Spherical Torus Experiment (NSTX) is an innovative magnetic fusion device that was constructed by the Princeton Plasma Physics Laboratory (PPPL) in collaboration with the Oak Ridge National Laboratory, Columbia University, and the University of Washington at Seattle. Since achieving first plasma in 1999, the device has been used for fusion research through an international collabora… ▽ More The National Spherical Torus Experiment (NSTX) is an innovative magnetic fusion device that was constructed by the Princeton Plasma Physics Laboratory (PPPL) in collaboration with the Oak Ridge National Laboratory, Columbia University, and the University of Washington at Seattle. Since achieving first plasma in 1999, the device has been used for fusion research through an international collaboration of over twenty institutions. The NSTX is operated through a collection of control systems that encompass a wide range of technology, from hardwired relay controls to real-time control systems with giga-FLOPS of capability. This paper presents a broad introduction to the control systems used on NSTX, with an emphasis on the computing controls, data acquisition, and synchronization systems. △ Less

Submitted 10 December, 2001; v1 submitted 20 November, 2001; originally announced November 2001.

Comments: 3 PDF pages, 8th International Conference on Accelerator and Large Experimental Physics Control Systems (PSN TUBT004), San Jose, CA, USA, November 27-30

ACM Class: C.3; J.7

Journal ref: eConf C011127 (2001) TUBT004

Showing 1–8 of 8 results for author: Oliaro, G