Scalable MatMul-free Language Modeling
Authors:
Rui-Jie Zhu,
Yu Zhang,
Ethan Sifferman,
Tyler Sheaves,
Yiqiao Wang,
Dustin Richmond,
Peng Zhou,
Jason K. Eshraghian
Abstract:
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-fr…
▽ More
Matrix multiplication (MatMul) typically dominates the overall computational cost of large language models (LLMs). This cost only grows as LLMs scale to larger embedding dimensions and context lengths. In this work, we show that MatMul operations can be completely eliminated from LLMs while maintaining strong performance at billion-parameter scales. Our experiments show that our proposed MatMul-free models achieve performance on-par with state-of-the-art Transformers that require far more memory during inference at a scale up to at least 2.7B parameters. We investigate the scaling laws and find that the performance gap between our MatMul-free models and full precision Transformers narrows as the model size increases. We also provide a GPU-efficient implementation of this model which reduces memory usage by up to 61% over an unoptimized baseline during training. By utilizing an optimized kernel during inference, our model's memory consumption can be reduced by more than 10x compared to unoptimized models. To properly quantify the efficiency of our architecture, we build a custom hardware solution on an FPGA which exploits lightweight operations beyond what GPUs are capable of. We processed billion-parameter scale models at 13W beyond human readable throughput, moving LLMs closer to brain-like efficiency. This work not only shows how far LLMs can be stripped back while still performing effectively, but also points at the types of operations future accelerators should be optimized for in processing the next generation of lightweight LLMs. Our code implementation is available at https://github.com/ridgerchu/matmulfreellm.
△ Less
Submitted 18 June, 2024; v1 submitted 4 June, 2024;
originally announced June 2024.
FPIA: Field-Programmable Ising Arrays with In-Memory Computing
Authors:
George Higgins Hutchinson,
Ethan Sifferman,
Tinish Bhattacharya,
Dmitri B. Strukov
Abstract:
Ising Machine is a promising computing approach for solving combinatorial optimization problems. It is naturally suited for energy-saving and compact in-memory computing implementations with emerging memories. A naïve in-memory computing implementation of a quadratic Ising Machine requires an array of coupling weights that grows quadratically with problem size. However, the resources in such an ap…
▽ More
Ising Machine is a promising computing approach for solving combinatorial optimization problems. It is naturally suited for energy-saving and compact in-memory computing implementations with emerging memories. A naïve in-memory computing implementation of a quadratic Ising Machine requires an array of coupling weights that grows quadratically with problem size. However, the resources in such an approach are used inefficiently due to sparsity in practical optimization problems. We first show that this issue can be addressed by partitioning a coupling array into smaller sub-arrays. This technique, however, requires interconnecting subarrays; hence, we developed in-memory computing architecture for quadratic Ising Machines inspired by island-type field programmable gate arrays, which is the main contribution of our paper. We adapt open-source tools to optimize problem embedding and model routing overhead. Modeling results of benchmark problems for the developed architecture show up to 60x area improvement and faster operation than the baseline approach. Finally, we discuss algorithm/circuit co-design techniques for further improvements.
△ Less
Submitted 29 January, 2024;
originally announced January 2024.