MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Zuhri, Zayd Muhammad Kawakibi; Adilazuarda, Muhammad Farid; Purwarianti, Ayu; Aji, Alham Fikri

Computer Science > Machine Learning

arXiv:2406.09297 (cs)

[Submitted on 13 Jun 2024 (v1), last revised 16 Jun 2024 (this version, v2)]

Title:MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Authors:Zayd Muhammad Kawakibi Zuhri, Muhammad Farid Adilazuarda, Ayu Purwarianti, Alham Fikri Aji

View PDF HTML (experimental)

Abstract:Auto-regressive inference of transformers benefit greatly from Key-Value (KV) caching, but can lead to major memory bottlenecks as model size, batch size, and sequence length grow at scale. We introduce Multi-Layer Key-Value (MLKV) sharing, a novel approach extending KV sharing across transformer layers to reduce memory usage beyond what was possible with Multi-Query Attention (MQA) and Grouped-Query Attention (GQA). Evaluations on various NLP benchmarks and inference metrics using uptrained Pythia-160M variants demonstrate that MLKV significantly reduces memory usage with minimal performance loss, reducing KV cache size down to a factor of 6x compared to MQA. These results highlight MLKV's potential for efficient deployment of transformer models at scale. We provide code at this https URL

Subjects:	Machine Learning (cs.LG)
Cite as:	arXiv:2406.09297 [cs.LG]
	(or arXiv:2406.09297v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2406.09297

Submission history

From: Zayd Muhammad Kawakibi Zuhri [view email]
[v1] Thu, 13 Jun 2024 16:33:44 UTC (8,620 KB)
[v2] Sun, 16 Jun 2024 03:57:51 UTC (8,622 KB)

Computer Science > Machine Learning

Title:MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:MLKV: Multi-Layer Key-Value Heads for Memory Efficient Transformer Decoding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators