Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Antoniak, Szymon; Jaszczur, Sebastian; Krutul, Michał; Pióro, Maciej; Krajewski, Jakub; Ludziejewski, Jan; Odrzygóźdź, Tomasz; Cygan, Marek

Computer Science > Computation and Language

arXiv:2310.15961 (cs)

[Submitted on 24 Oct 2023]

Title:Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Authors:Szymon Antoniak, Sebastian Jaszczur, Michał Krutul, Maciej Pióro, Jakub Krajewski, Jan Ludziejewski, Tomasz Odrzygóźdź, Marek Cygan

View PDF

Abstract:Despite the promise of Mixture of Experts (MoE) models in increasing parameter counts of Transformer models while maintaining training and inference costs, their application carries notable drawbacks. The key strategy of these models is to, for each processed token, activate at most a few experts - subsets of an extensive feed-forward layer. But this approach is not without its challenges. The operation of matching experts and tokens is discrete, which makes MoE models prone to issues like training instability and uneven expert utilization. Existing techniques designed to address these concerns, such as auxiliary losses or balance-aware matching, result either in lower model performance or are more difficult to train. In response to these issues, we propose Mixture of Tokens, a fully-differentiable model that retains the benefits of MoE architectures while avoiding the aforementioned difficulties. Rather than routing tokens to experts, this approach mixes tokens from different examples prior to feeding them to experts, enabling the model to learn from all token-expert combinations. Importantly, this mixing can be disabled to avoid mixing of different sequences during inference. Crucially, this method is fully compatible with both masked and causal Large Language Model training and inference.

Subjects:	Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2310.15961 [cs.CL]
	(or arXiv:2310.15961v1 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2310.15961

Submission history

From: Sebastian Jaszczur [view email]
[v1] Tue, 24 Oct 2023 16:03:57 UTC (396 KB)

Computer Science > Computation and Language

Title:Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Mixture of Tokens: Efficient LLMs through Cross-Example Aggregation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators