SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts
Authors:
Raghu Prabhakar,
Ram Sivaramakrishnan,
Darshan Gandhi,
Yun Du,
Mingran Wang,
Xiangyu Song,
Kejie Zhang,
Tianren Gao,
Angela Wang,
Karen Li,
Yongning Sheng,
Joshua Brot,
Denis Sokolov,
Apurv Vivek,
Calvin Leung,
Arjun Sabnis,
Jiayu Bai,
Tuowen Zhao,
Mark Gottscho,
David Jackson,
Mark Luttrell,
Manish K. Shah,
Edison Chen,
Kaizhao Liang,
Swayambhoo Jain
, et al. (5 additional authors not shown)
Abstract:
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Expert…
▽ More
Monolithic large language models (LLMs) like GPT-4 have paved the way for modern generative AI applications. Training, serving, and maintaining monolithic LLMs at scale, however, remains prohibitively expensive and challenging. The disproportionate increase in compute-to-memory ratio of modern AI accelerators have created a memory wall, necessitating new methods to deploy AI. Composition of Experts (CoE) is an alternative modular approach that lowers the cost and complexity of training and serving. However, this approach presents two key challenges when using conventional hardware: (1) without fused operations, smaller models have lower operational intensity, which makes high utilization more challenging to achieve; and (2) hosting a large number of models can be either prohibitively expensive or slow when dynamically switching between them.
In this paper, we describe how combining CoE, streaming dataflow, and a three-tier memory system scales the AI memory wall. We describe Samba-CoE, a CoE system with 150 experts and a trillion total parameters. We deploy Samba-CoE on the SambaNova SN40L Reconfigurable Dataflow Unit (RDU) - a commercial dataflow accelerator architecture that has been co-designed for enterprise inference and training applications. The chip introduces a new three-tier memory system with on-chip distributed SRAM, on-package HBM, and off-package DDR DRAM. A dedicated inter-RDU network enables scaling up and out over multiple sockets. We demonstrate speedups ranging from 2x to 13x on various benchmarks running on eight RDU sockets compared with an unfused baseline. We show that for CoE inference deployments, the 8-socket RDU Node reduces machine footprint by up to 19x, speeds up model switching time by 15x to 31x, and achieves an overall speedup of 3.7x over a DGX H100 and 6.6x over a DGX A100.
△ Less
Submitted 13 May, 2024;
originally announced May 2024.
Rupture of a surfactant-laden draining thin film
Authors:
Atul S Vivek,
Ranabir Dey,
Harish N Dixit
Abstract:
Surfactant-laden thin liquid films overlaid on solid substrates are encountered in a variety of industrial and biological settings. As these films reach submicron thickness, they tend to become unstable owing to the influence of long-range dispersion forces. In the current study, we investigate how gravitational drainage affects the stability attributes of such thin liquid films. Using scaling arg…
▽ More
Surfactant-laden thin liquid films overlaid on solid substrates are encountered in a variety of industrial and biological settings. As these films reach submicron thickness, they tend to become unstable owing to the influence of long-range dispersion forces. In the current study, we investigate how gravitational drainage affects the stability attributes of such thin liquid films. Using scaling arguments, we demonstrate that gravity and dispersion forces can exert their influence simultaneously over a wide range of film thicknesses. In the lubrication limit, we carry out linear stability analysis and nonlinear simulations to understand the evolution of draining thin films. Linear stability indicates the existence of two unstable modes and two cut-off wavenumbers, as opposed to a single unstable mode and a unique cut-off wavenumber observed in stationary films. It is also found that surfactant-laden flowing films are more stable than stationary films with surfactants as well as draining films with clean interfaces. The origin of stabilization is identified as the enhanced surfactant perturbations generated due to drainage. We demonstrate that films exhibiting intermediate levels of surfactant activity and significant drainage exhibit the lowest rates of disturbance growth, leading to extending the time of rupture.
△ Less
Submitted 4 January, 2024; v1 submitted 23 December, 2023;
originally announced December 2023.