Skip to main content

Showing 1–8 of 8 results for author: Oliaro, G

.
  1. Optimal Kernel Orchestration for Tensor Programs with Korch

    Authors: Muyan Hu, Ashwin Venkatram, Shreyashri Biswas, Balamurugan Marimuthu, Bohan Hou, Gabriele Oliaro, Haojie Wang, Liyan Zheng, Xupeng Miao, Jidong Zhai

    Abstract: Kernel orchestration is the task of map** the computation defined in different operators of a deep neural network (DNN) to the execution of GPU kernels on modern hardware platforms. Prior approaches optimize kernel orchestration by greedily applying operator fusion, which fuses the computation of multiple operators into a single kernel, and miss a variety of optimization opportunities in kernel… ▽ More

    Submitted 13 June, 2024; originally announced June 2024.

    Comments: Fix some typos in the ASPLOS version

    Journal ref: Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems 3 (2024) 755-769

  2. arXiv:2402.18789  [pdf, other

    cs.DC cs.CL cs.LG

    FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning

    Authors: Xupeng Miao, Gabriele Oliaro, Xinhao Cheng, Mengdi Wu, Colin Unger, Zhihao Jia

    Abstract: Parameter-efficient finetuning (PEFT) is a widely used technique to adapt large language models for different tasks. Service providers typically create separate systems for users to perform PEFT model finetuning and inference tasks. This is because existing systems cannot handle workloads that include a mix of inference and PEFT finetuning requests. As a result, shared GPU resources are underutili… ▽ More

    Submitted 28 February, 2024; originally announced February 2024.

  3. arXiv:2401.07159  [pdf, other

    cs.LG cs.AI

    Quantized Side Tuning: Fast and Memory-Efficient Tuning of Quantized Large Language Models

    Authors: Zhengxin Zhang, Dan Zhao, Xupeng Miao, Gabriele Oliaro, Qing Li, Yong Jiang, Zhihao Jia

    Abstract: Finetuning large language models (LLMs) has been empirically effective on a variety of downstream tasks. Existing approaches to finetuning an LLM either focus on parameter-efficient finetuning, which only updates a small number of trainable parameters, or attempt to reduce the memory footprint during the training phase of the finetuning. Typically, the memory footprint during finetuning stems from… ▽ More

    Submitted 13 January, 2024; originally announced January 2024.

    ACM Class: I.2.7

  4. arXiv:2312.15234  [pdf, other

    cs.LG cs.AI cs.DC cs.PF

    Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems

    Authors: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Hongyi **, Tianqi Chen, Zhihao Jia

    Abstract: In the rapidly evolving landscape of artificial intelligence (AI), generative large language models (LLMs) stand at the forefront, revolutionizing how we interact with our data. However, the computational intensity and memory consumption of deploying these models present substantial challenges in terms of serving efficiency, particularly in scenarios demanding low latency and high throughput. This… ▽ More

    Submitted 23 December, 2023; originally announced December 2023.

  5. arXiv:2305.09781  [pdf, other

    cs.CL cs.DC cs.LG

    SpecInfer: Accelerating Generative Large Language Model Serving with Tree-based Speculative Inference and Verification

    Authors: Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, Zhihao Jia

    Abstract: This paper introduces SpecInfer, a system that accelerates generative large language model (LLM) serving with tree-based speculative inference and verification. The key idea behind SpecInfer is leveraging small speculative models to predict the LLM's outputs; the predictions are organized as a token tree, whose nodes each represent a candidate token sequence. The correctness of all candidate token… ▽ More

    Submitted 31 March, 2024; v1 submitted 16 May, 2023; originally announced May 2023.

    Comments: ASPLOS'24

  6. arXiv:2202.02270  [pdf, other

    cs.NI

    Direct Telemetry Access

    Authors: Jonatan Langlet, Ran Ben Basat, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, Gianni Antichi

    Abstract: Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence,… ▽ More

    Submitted 10 August, 2023; v1 submitted 4 February, 2022; originally announced February 2022.

    Comments: As appearing in the proceedings of ACM SIGCOMM'23

  7. arXiv:2110.05438  [pdf, other

    cs.NI cs.DS eess.SY

    Zero-CPU Collection with Direct Telemetry Access

    Authors: Jonatan Langlet, Ran Ben Basat, Sivaramakrishnan Ramanathan, Gabriele Oliaro, Michael Mitzenmacher, Minlan Yu, Gianni Antichi

    Abstract: Programmable switches are driving a massive increase in fine-grained measurements. This puts significant pressure on telemetry collectors that have to process reports from many switches. Past research acknowledged this problem by either improving collectors' stack performance or by limiting the amount of data sent from switches. In this paper, we take a different and radical approach: switches are… ▽ More

    Submitted 11 October, 2021; originally announced October 2021.

    Comments: To appear in ACM HotNets 2021

  8. arXiv:cs/0111055  [pdf

    cs.OH

    Overview of the NSTX Control System

    Authors: P. Sichta, J. Dong, G. Oliaro, P. Roney

    Abstract: The National Spherical Torus Experiment (NSTX) is an innovative magnetic fusion device that was constructed by the Princeton Plasma Physics Laboratory (PPPL) in collaboration with the Oak Ridge National Laboratory, Columbia University, and the University of Washington at Seattle. Since achieving first plasma in 1999, the device has been used for fusion research through an international collabora… ▽ More

    Submitted 10 December, 2001; v1 submitted 20 November, 2001; originally announced November 2001.

    Comments: 3 PDF pages, 8th International Conference on Accelerator and Large Experimental Physics Control Systems (PSN TUBT004), San Jose, CA, USA, November 27-30

    ACM Class: C.3; J.7

    Journal ref: eConf C011127 (2001) TUBT004