MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs

Chi Ma Meituan Mincong Huang Meituan Chao Wang Meituan Yujie Wang Contact email:[email protected] Meituan Lei Yu Meituan
Abstract

Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant ’Impossible Trinity’: struggling to simultaneously maintain model performance, enhance inference speed, and extend applicability across various architectures. Due to the theoretical ambiguities surrounding MOYU, this paper elucidates the root cause of the MOYU property and outlines the mechanisms behind two primary limitations encountered by current DA methods: 1) history-related activation uncertainty, and 2) semantic-irrelevant activation inertia. Our analysis not only underscores the limitations of current dynamic activation strategies within large-scale LLaMA models but also proposes opportunities for refining the design of future sparsity schemes.

1 Introduction

Large language models(LLMs), including LLaMA, GPT, and OPT series, have showcased remarkable performance and in-context learning capabilities through the utilization of extensive parameters. However, their computational and memory demands during inference, especially in scenarios sensitive to latency, are substantial. To address these challenges, various techniques based on Massive Over-activation Yielded Uplifts(MOYU) have been proposed. These methods aim to reduce latency in these models by minimizing the excessive activation of heads, neurons, or weights during inference.

Existing MOYU-based techniques can be categorized into static and dynamic activation methods. Static activation(SA), such as pruning, reduces the excess activated weights in LLMs based on metrics like magnitude, and can be applied either once or iteratively. These configurations remain unchanged for all subsequent inputs and are fully activated during inference. However, a key limitation of SA is that once the process is complete, the deactivated weights cannot be reactivated without undergoing a recovery phase, which could lead to diminished performance and a loss of in-context learning capabilities. Furthermore, the iterative process of SA requires substantial additional training efforts, which may not proportionately enhance the speedup.

On the other hand, MOYU-based dynamic activation(DA) offers adaptability by selectively activating certain heads or neurons during inference, thereby enhancing computational efficiency. This approach capitalizes on the inherent property of massive over-activation found in LLMs to optimize resource utilization. The existing research on DA can be categorized as follows:

  1. 1.

    Threshold Dynamic Activation(TDA): TDA uses a predefined threshold to determine which activation units to retain or discard, as depicted in Figure 1(a). Units with activation values below this threshold are either set to zero or removed during the current forward propagation, thus reducing computational overhead.

  2. 2.

    Router-off-the-loop Dynamic Activation(RODA): This method employs a pre-trained router block to dynamically determine which activation units are crucial during the model’s forward propagation. The router is trained using the model’s historical data. For instance, DejaVu[1] utilizes a predictive router comprising a two-layer linear network as shown in Figure 1(b)

  3. 3.

    Router-in-the-loop Dynamic Activation(RIDA): In contrast to RODA, the router in RIDA dynamically makes decisions based on the current input and contextual information. This enables the router to adjust its routing strategy in real-time, adeptly managing the complexities of the task at hand, thereby enhancing both efficiency and accuracy. RIDA is primarily implemented within the MoE(Mixture of Experts) structure(in Figure 1(c). For example, DS-MoE[2] employs a TopK router in its framework. Similarly, Griffin[3] also utilizes a TopK router to construct MoE from a dense model in a train-free manner(in Figure 1(d)).

Refer to caption
(a) TDA
Refer to caption
(b) DejaVu RODA
Refer to caption
(c) MoE RIDA
Refer to caption
(d) Sequential RIDA
Figure 1: Four kinds of DA methods

Given the ’Impossible Trinity’, TDA struggles to enhance inference speed, while RODA fails to extend its applicability between ReLU and non-ReLU activated architectures. MoE-based RIDA necessitates training with the entire model, whereas sequential RIDA, as demonstrated by Griffin, can simultaneously address these three challenges.

However, despite these advancements, current research on MOYU and DA still lacks a comprehensive theoretical framework that adequately explains the MOYU phenomena across various architectures and activation functions, as well as the underlying mechanisms of MOYU within sequences.

Firstly, we have developed a mathematical rationale that elucidates the origins of the MOYU phenomenon. Then, from this perspective, we have analyzed the causes of two major limitations of existing DA methods:

  • Limitation 1: the restriction to ReLU activation functions.

    we suggest that at the token-level, history-information-related activation uncertainty(in Section 4.1) makes it challenging to predict the importance of weights in non-ReLU models, thereby limiting token-level RODA methods to ReLU models.

  • Limitation 2: the inability to identify active neurons based on semantic similarity.

    we suggest that at the sequence-level, neuron activation is semantically irrelevant(in Section 4.2). In other words, neurons are more likely to be activated by the most dominant elements within the same sequence rather than by the semantic content of the input itself, which in turn limits sequence-level DA to RIDA instead of RODA.

  • In short, it is disheartening that technically, we only have three DA strategies: token-level RODA for ReLU models, token-level RIDA(MoE), and sequence-level TDA and RIDA as discussed in this paper.

The rest of the paper is organized as follows. Related works are reviewed in Section 2. We introduce our universal theoretical framework in Section 3 and Section 4, and draw conclusions and limitations in Section 5.

2 Related Works

2.1 Massive Over-activation

In the study of Large Language Models(LLMs), the term massive over-activation refers to the excessive activation of numerous neurons during task execution, which can lead to computational waste and decreased efficiency[4, 5]. Research[6] indicates that dense deep neural networks often suffer from this issue. By treating the discrete sparse process as a continuous problem, optimization of the model architecture from end-to-end becomes feasible. The Lottery Hypothesis[7, 8] further emphasizes the significance of pruning techniques in reducing unnecessary connections and mitigating over-activation in dense models.

Additional research[9] introduces this concept through a ”sparsely-gated mixture-of-experts(MoE) layer,” which increases model capacity while reducing computational costs. Moreover, MC-SMoE[10] tackles the issue of massive over-activation in MoEs by streamlining the model architecture. This is achieved through the merging and low-rank decomposition of redundant experts, guided by the router’s information.

2.2 TDA and RODA

Research[6, 11] elucidates the capacity of the ReLU to introduce activation sparsity and proposes the concept of dynamic activation. DejaVu[1] identifies that the sparsity introduced by ReLU can be predicted, thus proposing the first viable RODA scheme. On the OPT series, DejaVu facilitates a 2-6x acceleration in inference latency at 75% sparsity. Building upon the DejaVu approach, ReLU2[12] first applies TDA to non-ReLU models and achieves nearly 70% sparsity with minimal loss to model performance. ProSparse[13] proposes a practical DA inference framework and, building on ReLU2, achieves only a 1-percent increase in perplexity at approximately 80% sparsity by replacing the activation function and continuing to induce sparsity.

2.3 RIDA

Router-in-the-loop is the predominant method within the MoE framework. Unlike TDA and RODA methods, most RIDA approaches rely on training an expert router to facilitate dynamic activation. LLaMA-MoE[14] transforms feed-forward networks(FFNs) into MoEs by constructing experts and training an additional gating network for expert routing. DS-MoE[2] introduces a framework that utilizes dense computation during training and switches to sparse computation during inference, significantly enhancing parameter efficiency over traditional sparse MoE methods and reducing the total parameter count. Learn-To-be-Efficient[15] achieves an optimal balance between sparsity and performance by activating fewer neurons and is applicable to models with both ReLU and non-ReLU activation functions. Lory[16] retains the autoregressive properties of language models by adopting a causally segmented routing strategy and a similarity-based data batching method. This approach enables efficient expert merging operations and promotes specialization among experts in processing similar documents during training sessions.

3 Unveiling MOYU

Section 2 provides a review of the literature relevant to MOYU. This section begins by outlining the theoretical foundations of MOYU and then presents mathematical proof. Following literature[17], we can demonstrate through the following derivation how massive over-activation arises and why SwiGLU cannot produce greater sparsity than ReLU.

Assuming a neural network as in Equation 1:

f(x)=𝑽σ(p(𝒙;𝜽))𝑓𝑥𝑽𝜎𝑝𝒙𝜽f(x)=\boldsymbol{V}\sigma(p(\boldsymbol{x};\boldsymbol{\theta}))italic_f ( italic_x ) = bold_italic_V italic_σ ( italic_p ( bold_italic_x ; bold_italic_θ ) ) (1)

,where 𝑽=[v1,,vdff]𝑽subscript𝑣1subscript𝑣subscript𝑑𝑓𝑓\boldsymbol{V}=[v_{1},...,v_{d_{ff}}]bold_italic_V = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_f italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] is network parameter for the last layer drawn from a random distribution, σ()𝜎\sigma()italic_σ ( ) is the SwiGLU activation function, and p(𝒙;𝜽)𝑝𝒙𝜽p(\boldsymbol{x};\boldsymbol{\theta})italic_p ( bold_italic_x ; bold_italic_θ ) denotes all other layers with parameter θ𝜃\thetaitalic_θ. We write p=p(𝒙;𝜽)𝑝𝑝𝒙𝜽p=p(\boldsymbol{x};\boldsymbol{\theta})italic_p = italic_p ( bold_italic_x ; bold_italic_θ ) for simplicity.

Consider the cross-entropy(CE) loss with function CE(f(𝒙),𝒚)subscript𝐶𝐸𝑓𝒙𝒚\ell_{CE}(f(\boldsymbol{x}),\boldsymbol{y})roman_ℓ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT ( italic_f ( bold_italic_x ) , bold_italic_y ), where 𝒚𝒚\boldsymbol{y}bold_italic_y is an arbitrary vector that sums up to one and independent of 𝑽𝑽\boldsymbol{V}bold_italic_V. Assume that the entries of 𝑽𝑽\boldsymbol{V}bold_italic_V are drawn from independent distributions, the probability of any entry of 𝑽𝑽\boldsymbol{V}bold_italic_V being 0 is less than 1, and E[𝑽]=0𝐸delimited-[]𝑽0E[\boldsymbol{V}]=0italic_E [ bold_italic_V ] = 0 . If there exist an isuperscript𝑖i^{*}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that pi>0subscript𝑝superscript𝑖0p_{i^{*}}>0italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > 0, then we have Equation 2:

pi=f,fpi=f,visubscript𝑝𝑖𝑓𝑓subscript𝑝𝑖𝑓subscript𝑣superscript𝑖\frac{\partial\ell}{\partial p_{i*}}=\left\langle\frac{\partial\ell}{\partial f% },\frac{\partial f}{\partial p_{i*}}\right\rangle=\left\langle\frac{\partial% \ell}{\partial f},v_{i^{*}}\right\rangledivide start_ARG ∂ roman_ℓ end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i ∗ end_POSTSUBSCRIPT end_ARG = ⟨ divide start_ARG ∂ roman_ℓ end_ARG start_ARG ∂ italic_f end_ARG , divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i ∗ end_POSTSUBSCRIPT end_ARG ⟩ = ⟨ divide start_ARG ∂ roman_ℓ end_ARG start_ARG ∂ italic_f end_ARG , italic_v start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ (2)

Substituting CE loss function into Equation 2 yields Equation 3:

CEfsubscript𝐶𝐸𝑓\displaystyle\frac{\partial\ell_{CE}}{\partial f}divide start_ARG ∂ roman_ℓ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_f end_ARG =exp(f(x))exp(f(x)),𝟏yabsent𝑒𝑥𝑝𝑓𝑥𝑒𝑥𝑝𝑓𝑥1𝑦\displaystyle=\frac{exp(f(x))}{\left\langle exp(f(x)),\boldsymbol{1}\right% \rangle}-y= divide start_ARG italic_e italic_x italic_p ( italic_f ( italic_x ) ) end_ARG start_ARG ⟨ italic_e italic_x italic_p ( italic_f ( italic_x ) ) , bold_1 ⟩ end_ARG - italic_y (3)
=exp(iσ(pi)𝒗𝒊)exp(iσ(pi)𝒗𝒊),𝟏yabsent𝑒𝑥𝑝subscript𝑖𝜎subscript𝑝𝑖subscript𝒗𝒊𝑒𝑥𝑝subscript𝑖𝜎subscript𝑝𝑖subscript𝒗𝒊1𝑦\displaystyle=\frac{exp({\textstyle\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}% }})}{\left\langle exp({\textstyle\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}}}% ),\boldsymbol{1}\right\rangle}-y= divide start_ARG italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG ⟨ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) , bold_1 ⟩ end_ARG - italic_y

By substituting Equation 3 back into Equation 2, we can obtain Equation 4:

CEpisubscript𝐶𝐸subscript𝑝superscript𝑖\displaystyle\frac{\partial\ell_{CE}}{\partial p_{i^{*}}}divide start_ARG ∂ roman_ℓ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG =exp(iσ(pi)𝒗𝒊),𝒗𝒊exp(iσ(pi)𝒗𝒊),𝟏𝒗𝒊,yabsent𝑒𝑥𝑝subscript𝑖𝜎subscript𝑝𝑖subscript𝒗𝒊subscript𝒗superscript𝒊𝑒𝑥𝑝subscript𝑖𝜎subscript𝑝𝑖subscript𝒗𝒊1subscript𝒗superscript𝒊𝑦\displaystyle=\frac{\left\langle exp(\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{% i}}),\boldsymbol{v_{i^{*}}}\right\rangle}{\left\langle exp(\sum_{i}\sigma(p_{i% })\cdot\boldsymbol{v_{i}}),\boldsymbol{1}\right\rangle}-\left\langle% \boldsymbol{v_{i^{*}}},y\right\rangle= divide start_ARG ⟨ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) , bold_italic_v start_POSTSUBSCRIPT bold_italic_i start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ end_ARG start_ARG ⟨ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) , bold_1 ⟩ end_ARG - ⟨ bold_italic_v start_POSTSUBSCRIPT bold_italic_i start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_y ⟩ (4)

Expanding the numerator of Equation 4 yields Equation 5. In Equation5, we assume that parameter θ𝜃\thetaitalic_θ and τ𝜏\tauitalic_τ have no negative features. If we have pi0=Swish1(xθ)(xτ)superscriptsubscript𝑝superscript𝑖0direct-product𝑆𝑤𝑖𝑠subscript1𝑥𝜃𝑥𝜏p_{i^{*}}^{0}=Swish_{1}(x\theta)\odot(x\tau)italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT = italic_S italic_w italic_i italic_s italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x italic_θ ) ⊙ ( italic_x italic_τ ) and pi1=ReLU(x)superscriptsubscript𝑝superscript𝑖1𝑅𝑒𝐿𝑈𝑥p_{i^{*}}^{1}=ReLU(x)italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT = italic_R italic_e italic_L italic_U ( italic_x ) respectively, it is easy to get Swish1(xθ)<xθ𝑆𝑤𝑖𝑠subscript1𝑥𝜃𝑥𝜃Swish_{1}(x\theta)<x\thetaitalic_S italic_w italic_i italic_s italic_h start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_x italic_θ ) < italic_x italic_θ when x>0𝑥0x>0italic_x > 0, and pi0<xθ=pi1superscriptsubscript𝑝superscript𝑖0𝑥𝜃superscriptsubscript𝑝superscript𝑖1p_{i^{*}}^{0}<x\theta=p_{i^{*}}^{1}italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT < italic_x italic_θ = italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT and pi0<xτsuperscriptsubscript𝑝superscript𝑖0𝑥𝜏p_{i^{*}}^{0}<x\tauitalic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT < italic_x italic_τ holds true.

exp(iσ(pi)𝒗𝒊),𝒗𝒊𝑒𝑥𝑝subscript𝑖𝜎subscript𝑝𝑖subscript𝒗𝒊subscript𝒗superscript𝒊\displaystyle\left\langle exp(\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}}),% \boldsymbol{v_{i^{*}}}\right\rangle⟨ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ bold_italic_v start_POSTSUBSCRIPT bold_italic_i end_POSTSUBSCRIPT ) , bold_italic_v start_POSTSUBSCRIPT bold_italic_i start_POSTSUPERSCRIPT bold_∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⟩ =m(vi,mexp(iσ(pi)vim)\displaystyle=\sum_{m}(v_{i^{*},m}\cdot exp(\sum_{i}\sigma(p_{i})\cdot v_{im})= ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_m end_POSTSUBSCRIPT ⋅ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_v start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT ) (5)
=m(vi,mexp(pivim)exp(iiσ(pi)vim)\displaystyle=\sum_{m}(v_{i^{*},m}\cdot exp(p_{i^{*}}\cdot v_{i^{*}m})\cdot exp% (\sum_{i\neq i^{*}}\sigma(p_{i})\cdot v_{im})= ∑ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_v start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_m end_POSTSUBSCRIPT ⋅ italic_e italic_x italic_p ( italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ⋅ italic_v start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT italic_m end_POSTSUBSCRIPT ) ⋅ italic_e italic_x italic_p ( ∑ start_POSTSUBSCRIPT italic_i ≠ italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_σ ( italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_v start_POSTSUBSCRIPT italic_i italic_m end_POSTSUBSCRIPT )

Similar to literature[17], we also have E[CEpi]>0Edelimited-[]subscript𝐶𝐸subscript𝑝superscript𝑖0\mathrm{E}[\frac{\partial\ell_{CE}}{\partial p_{i^{*}}}]>0roman_E [ divide start_ARG ∂ roman_ℓ start_POSTSUBSCRIPT italic_C italic_E end_POSTSUBSCRIPT end_ARG start_ARG ∂ italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ] > 0 holds true since the expectation of V is zero and the transformation of the activation function does not change the non-negative property of the loss expectations.

E[C1Vexp(pV)C2exp(pV)+C3]=E[C1VC2+C3exp(pV)]Edelimited-[]subscript𝐶1𝑉𝑒𝑥𝑝𝑝𝑉subscript𝐶2𝑒𝑥𝑝𝑝𝑉subscript𝐶3Edelimited-[]subscript𝐶1𝑉subscript𝐶2subscript𝐶3𝑒𝑥𝑝𝑝𝑉\displaystyle\mathrm{E}[\frac{C_{1}V\cdot exp(pV)}{C_{2}\leavevmode\nobreak\ % exp(pV)+C_{3}}]=\mathrm{E}[\frac{C_{1}V}{C_{2}+C_{3}exp(-pV)}]roman_E [ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V ⋅ italic_e italic_x italic_p ( italic_p italic_V ) end_ARG start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT italic_e italic_x italic_p ( italic_p italic_V ) + italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT end_ARG ] = roman_E [ divide start_ARG italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_V end_ARG start_ARG italic_C start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_C start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT italic_e italic_x italic_p ( - italic_p italic_V ) end_ARG ] (6)

The first term on the right-hand side (RHS) of the loss function (as shown in Equation 4) can be simplified to the form presented in Equation 6, while the expectation of the second term on the RHS is zero. Given pi0<pi1superscriptsubscript𝑝superscript𝑖0superscriptsubscript𝑝superscript𝑖1p_{i^{*}}^{0}<p_{i^{*}}^{1}italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 0 end_POSTSUPERSCRIPT < italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, Equation 6 demonstrates that switching the activation function from ReLU to SwiGLU decreases the expected value of the loss function.

This implies that if there exists an isuperscript𝑖i^{*}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that pi>0subscript𝑝superscript𝑖0p_{i^{*}}>0italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > 0, the gradient of the cross-entropy loss with respect to any positive activation pi>0subscript𝑝superscript𝑖0p_{i^{*}}>0italic_p start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT > 0 is positive in expectation. Consequently, any training algorithm that follows the negative gradient direction tends to reduce the magnitude of such positive activation, leading to a smaller training loss and thus promoting sparsity.

In this process, the ReLU activation function causes a greater reduction in magnitude compared to SwiGLU.

4 Sequencing MOYU

In Section 3, this paper theoretically deduces the root causes of the MOYU phenomenon and explores how non-ReLU activation functions might mitigate it. The literature[18, 19, 20] has highlighted that the current level of activation map sparsity is insufficient to fully exploit the performance of DA methods. In this section, we identify two limitations associated with choosing DA methods as discussed in Sections 4.1 and 4.2.

4.1 History-related Activation Uncertainty

RODA schemes excel in models that utilize ReLU as the activation function[11, 1, 12, 13]. However, in models employing non-ReLU activation functions, the offline-trained router struggles to accurately select which heads and neurons will be activated[21, 3].

We suggest in this section that the failure of RODA in non-ReLU scenarios is closely linked to shifts in weight importance under different historical inputs: a router trained on diverse historical activation data may find it challenging to accurately identify the weights that are most crucial for the current

Similarly, we assume the presence of a ReLU-activated model as described in Equation 1. And the simplified loss of input token xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT can be described as(Equation 7):

Li=(fxidxi+fθidθi)T(fxidxi+fθidθi)subscript𝐿𝑖superscript𝑓subscript𝑥𝑖dsubscript𝑥𝑖𝑓subscript𝜃𝑖dsubscript𝜃𝑖𝑇𝑓subscript𝑥𝑖dsubscript𝑥𝑖𝑓subscript𝜃𝑖dsubscript𝜃𝑖\displaystyle L_{i}=(\frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{% \partial f}{\partial\mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})^{T}(% \frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{\partial f}{\partial% \mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = ( divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) (7)

Weight change sensitivity(gradients) in model training is as Equation 8:

Lidθi=2(fxidxi+fθidθi)fθisubscript𝐿𝑖dsubscript𝜃𝑖2𝑓subscript𝑥𝑖dsubscript𝑥𝑖𝑓subscript𝜃𝑖dsubscript𝜃𝑖𝑓subscript𝜃𝑖\displaystyle\frac{\partial L_{i}}{\partial\mathrm{d}\mathbf{\theta}_{i}}=2(% \frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{\partial f}{\partial% \mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})\frac{\partial f}{\partial% \mathbf{\theta}_{i}}divide start_ARG ∂ italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG ∂ roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG = 2 ( divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (8)

By summing gradients, we have Equation 9:

dθiLsubscriptdsubscript𝜃𝑖𝐿\displaystyle\nabla_{\mathrm{d}\theta_{i}}L∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L =i2(fxidxi+fθidθi)fθiabsentsubscript𝑖2𝑓subscript𝑥𝑖dsubscript𝑥𝑖𝑓subscript𝜃𝑖dsubscript𝜃𝑖𝑓subscript𝜃𝑖\displaystyle=\sum_{i}2(\frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac% {\partial f}{\partial\mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})\frac{% \partial f}{\partial\mathbf{\theta}_{i}}= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT 2 ( divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) divide start_ARG ∂ italic_f end_ARG start_ARG ∂ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG (9)
=dθiLi+j=0:i1dθjLjabsentsubscriptdsubscript𝜃𝑖subscript𝐿𝑖subscript:𝑗0𝑖1subscriptdsubscript𝜃𝑗subscript𝐿𝑗\displaystyle=\nabla_{\mathrm{d}\theta_{i}}L_{i}+\sum_{j=0:i-1}\nabla_{\mathrm% {d}\theta_{j}}L_{j}= ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 : italic_i - 1 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT

And the importance of model weights can be described in Equation 10:

ΘisubscriptΘ𝑖\displaystyle\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =i|VdθiLi|absentsubscript𝑖𝑉subscriptdsubscript𝜃𝑖subscript𝐿𝑖\displaystyle=\sum_{i}|V\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}|= ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_V ⋅ ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | (10)
=|V|i|dθiLi|absent𝑉subscript𝑖subscriptdsubscript𝜃𝑖subscript𝐿𝑖\displaystyle=|V|\cdot\sum_{i}|\nabla_{\mathrm{d}\theta_{i}}L_{i}|= | italic_V | ⋅ ∑ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT |
=|V|(dθiLi+j=0:i1dθjLj)absent𝑉subscriptdsubscript𝜃𝑖subscript𝐿𝑖subscript:𝑗0𝑖1subscriptdsubscript𝜃𝑗subscript𝐿𝑗\displaystyle=|V|\cdot(\nabla_{\mathrm{d}\theta_{i}}L_{i}+\sum_{j=0:i-1}\nabla% _{\mathrm{d}\theta_{j}}L_{j})= | italic_V | ⋅ ( ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 0 : italic_i - 1 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT )
=|V|dθiLi+Θi1absent𝑉subscriptdsubscript𝜃𝑖subscript𝐿𝑖subscriptΘ𝑖1\displaystyle=|V|\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}+\Theta_{i-1}= | italic_V | ⋅ ∇ start_POSTSUBSCRIPT roman_d italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + roman_Θ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT

, which means weight importance of a model are not only related to current input along the direction of θ𝜃\thetaitalic_θ, but also to the cumulative gradient information from all previous data.

For models utilizing ReLU activation, Equation 10 simplifies to the sum of the weights corresponding to positive inputs, which linearly correlates with the magnitude of the current weights themselves. However, for models employing non-ReLU activations, the significance of the current weights becomes considerably more complex.

4.2 Semantic-irrelevant Activation Inertia

Using a simplified loss function, Section 4.1 demonstrated that models with non-ReLU activation rely on historical information to accurately decide which neurons to activate. This section reveals that historical information is significantly influenced by the Heavy Hitter (H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT), and the occurrence of H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is not related to semantics[22].

Following literature[23] we have H2:S[m]:subscript𝐻2superscript𝑆delimited-[]𝑚H_{2}:S^{*}\subset[m]italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT : italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊂ [ italic_m ], and k=|S|,τ(0,1)formulae-sequence𝑘superscript𝑆𝜏01k=|S^{*}|,\leavevmode\nobreak\ \tau\in(0,1)italic_k = | italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | , italic_τ ∈ ( 0 , 1 ) denote a threshold. α(0,1)𝛼01\alpha\in(0,1)italic_α ∈ ( 0 , 1 ) denote a fraction of mass(larger than τ𝜏\tauitalic_τ) outside Ssuperscript𝑆S^{*}italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

It is natural that attention with H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is a (α,τ,k)𝛼𝜏𝑘(\alpha,\tau,k)( italic_α , italic_τ , italic_k )-good map** since for all xRd𝑥superscriptR𝑑x\in\textbf{R}^{d}italic_x ∈ R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT, Ssuppτ(Att(x))superscript𝑆𝑠𝑢𝑝subscript𝑝𝜏𝐴𝑡𝑡𝑥S^{*}\subset supp_{\tau}(Att(x))italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊂ italic_s italic_u italic_p italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_A italic_t italic_t ( italic_x ) ), and |suppτ(Att(x))S|αk𝑠𝑢𝑝subscript𝑝𝜏𝐴𝑡𝑡𝑥superscript𝑆𝛼𝑘|supp_{\tau}(Att(x))\setminus S^{*}|\leq\alpha\cdot k| italic_s italic_u italic_p italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_A italic_t italic_t ( italic_x ) ) ∖ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_α ⋅ italic_k. Then we have Si[n]suppτ(xi)superscript𝑆subscript𝑖delimited-[]𝑛𝑠𝑢𝑝subscript𝑝𝜏subscript𝑥𝑖S^{*}\subseteq\cap_{i\in[n]}supp_{\tau}(x_{i})italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ⊆ ∩ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_s italic_u italic_p italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ), and |(i[n]suppτ(Att(x)))S|αknsubscript𝑖delimited-[]𝑛𝑠𝑢𝑝subscript𝑝𝜏𝐴𝑡𝑡𝑥superscript𝑆𝛼𝑘𝑛|(\cup_{i\in[n]}supp_{\tau}(Att(x)))\setminus S^{*}|\leq\alpha kn| ( ∪ start_POSTSUBSCRIPT italic_i ∈ [ italic_n ] end_POSTSUBSCRIPT italic_s italic_u italic_p italic_p start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ( italic_A italic_t italic_t ( italic_x ) ) ) ∖ italic_S start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT | ≤ italic_α italic_k italic_n for xisubscript𝑥𝑖x_{i}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT draw from (α,τ,k)𝛼𝜏𝑘(\alpha,\tau,k)( italic_α , italic_τ , italic_k )-good distribution uniformly at random. That is to say, H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT in a sequence significantly decides the activation pattern.

Refer to caption
(a) Neuron activation pattern of tokens sampled from a single sentence & input in parallel
Refer to caption
(b) Neuron activation pattern of tokens sampled from a single sentence & input sequentially
Refer to caption
(c) Neuron activation pattern of random tokens & input in parallel
Refer to caption
(d) Neuron activation pattern of random tokens & input sequentially
Figure 2: Neuron Activation Pattern Comparisons Across Different Sampling and Input Manners

Figures 2(a) through 2(d) showcase the phenomenon of activation inertia and its lack of semantic relevance. Figures 2(a) and 2(b) depict the activation pattern of neurons when tokens from a single sentence are input either individually or sequentially. Conversely, Figures 2(c) and 2(d) reveal the neuron activation when tokens from a random word list are introduced in the same formats.

The horizontal axis in Figure 2(a) to Figure 2(d) represents the neuron index, while the vertical axis represents the token index. The colors in the figures indicate the relative activation values for each token by each neuron. To make the data between different neurons and tokens comparable and to clarify the images, the activation values have been normalized in this study. From these images, we can get the following insights:

  1. 1.

    Sequential input induces activation inertia.

    Compared to Figure 2(a), the narrow vertical lines in Figure 2(b) are more pronounced. A similar pattern is observed when comparing Figure 2(c) with Figure 2(d). This suggests that for tokens from the same source, when inputs are made sequence, or as referred to by Griffin[3] as a ”prompt” into the model, the neurons activated by these input tokens will exhibit a clear clustering phenomenon. Neurons activated by the previous token demonstrate greater activation inertia and are thus more likely to be activated by subsequent tokens.

  2. 2.

    Activation inertia is semantic-irrelevant.

    The narrow vertical lines in Figure 2(d) are more pronounced compared to those in Figure 2(b). This comparison indicates that the activation inertia phenomenon observed is more intense with tokens from a random word list, confirming that activation inertia is semantically irrelevant, as highlighted in the title of this section.

    The intuitive reason behind this, as the paper posits, follows the theoretical analysis at the beginning of this section: activation patterns are caused by the earliest heavy hitters and maintained by semantically irrelevant activation inertia. The proportion of words with actual semantics in a random word list is higher than in an English sentence with many prepositions and articles, thus more significantly causing and maintaining activation patterns.

  3. 3.

    There is no significant difference in the narrow vertical line patterns between Figure 2(a) and Figure 2(c). Unlike the conclusions drawn from the previous two sets of observations, this comparison does not involve inputting tokens as a sequence. We suggest that this phenomena aligns with the theoretical derivation at the beginning of this subsection. When activation inertia is at the token level, each token acts as its own heavy hitter, leading to a more diversified activation pattern.

Through Figures 2(a) to 2(d), we have confirmed claims that sequential input strengthens activation inertia. It might be evident that activation inertia occurs with sequential input rather than parallel input, given that attention map** inherently processes all words in sequence.

However, it is crucial to recognize that this activation pattern, initiated by the earliest heavy hitter and sustained by activation inertia, can persist into subsequent generative processes. By combining sequential input with the RIDA method, it is feasible to precisely identify neurons that require activation during subsequent generation phases, with minimal impact on model performance, thereby realizing a training-free RIDA approach.

Griffin’s experiments have already confirmed the correctness of this speculation. Future ablation experiments could further test the correctness of the theory in this subsection by examining the impact on model performance of removing the first heavy hitter in a sentence (rather than all, which would be equivalent to excluding all neurons that need to be activated).

5 Conclusion and Limitations

Massive Over-activation Yielded Uplifts(MOYU) are intrinsic characteristics of large language models, and leveraging these properties through Dynamic Activation(DA) is a promising yet underutilized strategy to enhance inference speeds in these models. Traditional methods that exploit MOYU often encounter significant challenges, including maintaining model performance, speeding up inference, or extending their use to various architectures. This paper have developed a mathematical framework that elucidates the origins of the MOYU phenomenon. Through this framework, we have identified two primary limitations of current DA methods: 1) their reliance on ReLU activation functions; 2) their inability to detect active neurons based on semantic similarities.

This paper has following limitations: firstly, the mathematical rationale and implementation of the proposed DA methods could introduce complexities that might impede their practical application. Additionally, this paper highlights that sequence-level activation is predominantly influenced by heavy hitters within the same sequence; however, due to effort constraints, ablation experiment is not conducted. It is anticipated that future research will undertake more extensive experiments.

References

  • [1] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023.
  • [2] Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, and Rameswar Panda. Dense training, sparse inference: Rethinking training of mixture-of-experts language models, 2024.
  • [3] Harry Dong, Beidi Chen, and Yuejie Chi. Prompt-prompted mixture of experts for efficient llm generation, 2024.
  • [4] Rishi Bommasani et.al. On the opportunities and risks of foundation models, 2022.
  • [5] Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, and Kurt Keutzer. Llm inference unveiled: Survey and roofline model insights, 2024.
  • [6] Ziang Liu, Genggeng Zhou, Jeff He, Tobia Marcucci, Li Fei-Fei, Jiajun Wu, and Yunzhu Li. Model-based control with sparse neural dynamics, 2023.
  • [7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
  • [8] Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need, 2020.
  • [9] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
  • [10] **zhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024.
  • [11] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models, 2023.
  • [12] Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu2 wins: Discovering efficient activation functions for sparse llms, 2024.
  • [13] Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, and Maosong Sun. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models, 2024.
  • [14] LLaMA-MoE Team. Llama-moe: Building mixture-of-experts from llama with continual pre-training, Dec 2023.
  • [15] Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash. Learn to be efficient: Build structured sparsity in large language models, 2024.
  • [16] Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. Lory: Fully differentiable mixture-of-experts for autoregressive language model pre-training, 2024.
  • [17] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar. The lazy neuron phenomenon: On emergence of activation sparsity in transformers, 2023.
  • [18] Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression, 2019.
  • [19] Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5533–5543. PMLR, 13–18 Jul 2020.
  • [20] Zeqi Zhu, Arash Pourtaherian, Luc Waeijen, Egor Bondarev, and Orlando Moreira. Star: Sparse thresholded activation under partial-regularization for activation sparsity exploration. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4554–4563, 2023.
  • [21] Chi Ma, Mincong Huang, Chao Wang, Yujie Wang, and Lei Yu. Dynamic activation pitfalls in llama models: An empirical study, 2024.
  • [22] Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models, 2024.
  • [23] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models, 2023.