Skip to main content

Showing 1–8 of 8 results for author: Kwon, W

Searching in archive cs. Search in all archives.
.
  1. arXiv:2406.14066  [pdf, other

    cs.AI cs.PF

    Optimizing Speculative Decoding for Serving Large Language Models Using Goodput

    Authors: Xiaoxuan Liu, Cade Daniel, Langxiang Hu, Woosuk Kwon, Zhuohan Li, Xiangxi Mo, Alvin Cheung, Zhijie Deng, Ion Stoica, Hao Zhang

    Abstract: Reducing the inference latency of large language models (LLMs) is crucial, and speculative decoding (SD) stands out as one of the most effective techniques. Rather than letting the LLM generate all tokens directly, speculative decoding employs effective proxies to predict potential outputs, which are then verified by the LLM without compromising the generation quality. Yet, deploying SD in real on… ▽ More

    Submitted 25 June, 2024; v1 submitted 20 June, 2024; originally announced June 2024.

  2. arXiv:2309.06180  [pdf, other

    cs.LG cs.DC

    Efficient Memory Management for Large Language Model Serving with PagedAttention

    Authors: Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, Ion Stoica

    Abstract: High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address… ▽ More

    Submitted 12 September, 2023; originally announced September 2023.

    Comments: SOSP 2023

  3. arXiv:2308.07741  [pdf, other

    cs.RO cs.LG

    Real Robot Challenge 2022: Learning Dexterous Manipulation from Offline Data in the Real World

    Authors: Nico Gürtler, Felix Widmaier, Cansu Sancaktar, Sebastian Blaes, Pavel Kolev, Stefan Bauer, Manuel Wüthrich, Markus Wulfmeier, Martin Riedmiller, Arthur Allshire, Qiang Wang, Robert McCarthy, Hangyeol Kim, Jongchan Baek, Wookyong Kwon, Shanliang Qian, Yasunori Toshimitsu, Mike Yan Michelis, Amirhossein Kazemipour, Arman Raayatsanati, Hehui Zheng, Barnabas Gavin Cangan, Bernhard Schölkopf, Georg Martius

    Abstract: Experimentation on real robots is demanding in terms of time and costs. For this reason, a large part of the reinforcement learning (RL) community uses simulators to develop and benchmark algorithms. However, insights gained in simulation do not necessarily translate to real robots, in particular for tasks involving complex interactions with the environment. The Real Robot Challenge 2022 therefore… ▽ More

    Submitted 24 November, 2023; v1 submitted 15 August, 2023; originally announced August 2023.

    Comments: Typo in author list fixed

  4. arXiv:2211.06522  [pdf

    eess.IV cs.CV q-bio.QM

    Deep Learning Generates Synthetic Cancer Histology for Explainability and Education

    Authors: James M. Dolezal, Rachelle Wolk, Hanna M. Hieromnimon, Frederick M. Howard, Andrew Srisuwananukorn, Dmitry Karpeyev, Siddhi Ramesh, Sara Kochanny, Jung Woo Kwon, Meghana Agni, Richard C. Simon, Chandni Desai, Raghad Kherallah, Tung D. Nguyen, Jefree J. Schulte, Kimberly Cole, Galina Khramtsova, Marina Chiara Garassino, Aliya N. Husain, Huihua Li, Robert Grossman, Nicole A. Cipriani, Alexander T. Pearson

    Abstract: Artificial intelligence methods including deep neural networks (DNN) can provide rapid molecular classification of tumors from routine histology with accuracy that matches or exceeds human pathologists. Discerning how neural networks make their predictions remains a significant challenge, but explainability tools help provide insights into what models have learned when corresponding histologic fea… ▽ More

    Submitted 9 December, 2022; v1 submitted 11 November, 2022; originally announced November 2022.

  5. arXiv:2204.09656  [pdf, other

    cs.CL cs.LG

    A Fast Post-Training Pruning Framework for Transformers

    Authors: Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, Amir Gholami

    Abstract: Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any… ▽ More

    Submitted 17 October, 2022; v1 submitted 29 March, 2022; originally announced April 2022.

    Comments: NeurIPS 2022

  6. arXiv:2107.00910  [pdf, other

    cs.CL

    Learned Token Pruning for Transformers

    Authors: Sehoon Kim, Sheng Shen, David Thorsley, Amir Gholami, Woosuk Kwon, Joseph Hassoun, Kurt Keutzer

    Abstract: Deploying transformer models in practice is challenging due to their inference cost, which scales quadratically with input sequence length. To address this, we present a novel Learned Token Pruning (LTP) method which adaptively removes unimportant tokens as an input sequence passes through transformer layers. In particular, LTP prunes tokens with an attention score below a threshold value which is… ▽ More

    Submitted 2 June, 2022; v1 submitted 2 July, 2021; originally announced July 2021.

    Comments: KDD 2022 (Research Track)

  7. arXiv:2012.02732  [pdf, other

    cs.LG cs.DC

    Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

    Authors: Woosuk Kwon, Gyeong-In Yu, Eunji Jeong, Byung-Gon Chun

    Abstract: Deep learning (DL) frameworks take advantage of GPUs to improve the speed of DL inference and training. Ideally, DL frameworks should be able to fully utilize the computation power of GPUs such that the running time depends on the amount of computation assigned to GPUs. Yet, we observe that in scheduling GPU tasks, existing DL frameworks suffer from inefficiencies such as large scheduling overhead… ▽ More

    Submitted 4 December, 2020; originally announced December 2020.

    Comments: In NeurIPS 2020

  8. Selective Distillation of Weakly Annotated GTD for Vision-based Slab Identification System

    Authors: Sang Jun Lee, Sang Woo Kim, Wookyong Kwon, Gyogwon Koo, Jong Pil Yun

    Abstract: This paper proposes an algorithm for recognizing slab identification numbers in factory scenes. In the development of a deep-learning based system, manual labeling to make ground truth data (GTD) is an important but expensive task. Furthermore, the quality of GTD is closely related to the performance of a supervised learning algorithm. To reduce manual work in the labeling process, we generated we… ▽ More

    Submitted 13 December, 2018; v1 submitted 9 October, 2018; originally announced October 2018.

    Comments: 10 pages, 12 figures, submitted to a journal

    Journal ref: IEEE Access 7 (2019) 23177-23186