Skip to main content

Showing 1–1 of 1 results for author: Chochowski, M

Searching in archive cs. Search in all archives.
.
  1. arXiv:2403.09636  [pdf, other

    cs.CL

    Dynamic Memory Compression: Retrofitting LLMs for Accelerated Inference

    Authors: Piotr Nawrot, Adrian Łańcucki, Marcin Chochowski, David Tarjan, Edoardo M. Ponti

    Abstract: Transformers have emerged as the backbone of large language models (LLMs). However, generation remains inefficient due to the need to store in memory a cache of key-value representations for past tokens, whose size scales linearly with the input sequence length and batch size. As a solution, we propose Dynamic Memory Compression (DMC), a method for on-line key-value cache compression at inference… ▽ More

    Submitted 14 March, 2024; originally announced March 2024.