Jamba: A Hybrid Transformer-Mamba Language Model
Authors:
Opher Lieber,
Barak Lenz,
Hofit Bata,
Gal Cohen,
Jhonathan Osin,
Itay Dalmedigos,
Erez Safahi,
Shaked Meirom,
Yonatan Belinkov,
Shai Shalev-Shwartz,
Omri Abend,
Raz Alon,
Tomer Asida,
Amir Bergman,
Roman Glozman,
Michael Gokhman,
Avashalom Manevich,
Nir Ratner,
Noam Rozen,
Erez Shwartz,
Mor Zusman,
Yoav Shoham
Abstract:
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while kee** active parameter usage manageable. This flexible architecture allows reso…
▽ More
We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while kee** active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.
△ Less
Submitted 3 July, 2024; v1 submitted 28 March, 2024;
originally announced March 2024.
Large fluctuations of a Kardar-Parisi-Zhang interface on a half-line: the height statistics at a shifted point
Authors:
Tomer Asida,
Eli Livne,
Baruch Meerson
Abstract:
We consider a stochastic interface $h(x,t)$, described by the $1+1$ Kardar-Parisi-Zhang (KPZ) equation on the half-line $x\geq0$ with the reflecting boundary at $x=0$. The interface is initially flat, $h(x,t=0)=0$. We focus on the short-time probability distribution $\mathcal{P}\left(H,L,t\right)$ of the height $H$ of the interface at point $x=L$. Using the optimal fluctuation method, we determine…
▽ More
We consider a stochastic interface $h(x,t)$, described by the $1+1$ Kardar-Parisi-Zhang (KPZ) equation on the half-line $x\geq0$ with the reflecting boundary at $x=0$. The interface is initially flat, $h(x,t=0)=0$. We focus on the short-time probability distribution $\mathcal{P}\left(H,L,t\right)$ of the height $H$ of the interface at point $x=L$. Using the optimal fluctuation method, we determine the (Gaussian) body of the distribution and the strongly asymmetric non-Gaussian tails. We find that the slower-decaying tail scales as $-\sqrt{t}\,\ln\mathcal{P}\simeq\left|H\right|^{3/2}f_{-}\left(L/\sqrt{\left|H\right|t}\right)$, and calculate the function $f_{-}(\dots)$ analytically. Remarkably, this tail exhibits a first-order dynamical phase transition at a critical value of $L$, $L_{c}=0.60223\dots\sqrt{\left|H\right|t}$. The transition results from a competition between two different fluctuation paths of the system. The faster decaying tail scales as $-\sqrt{t}\,\ln\mathcal{P}\simeq|H|^{5/2}f_{+}\left(L/\sqrt{|H|t}\right)$. We evaluate the function $f_{+}(\dots)$ using a specially developed numerical method, which involves solving a nonlinear second-order elliptic equation in Lagrangian coordinates. The faster-decaying tail also involves a sharp transition, which occurs at a critical value $L_{c}\simeq2\sqrt{2|H|t}/π$. This transition is similar to the one recently found for the KPZ equation on a ring, and we believe that it has the same fractional order $5/2$. It is smoothed, however, by small diffusion effects.
△ Less
Submitted 31 March, 2019; v1 submitted 22 January, 2019;
originally announced January 2019.