MOYU: A Theoretical Study on Massive Over-activation Yielded Uplifts in LLMs

Chi Ma Meituan Mincong Huang Meituan Chao Wang Meituan Yujie Wang Contact email:[email protected] Meituan Lei Yu Meituan

Abstract

Massive Over-activation Yielded Uplifts(MOYU) is an inherent property of large language models, and dynamic activation(DA) based on the MOYU property is a clever yet under-explored strategy designed to accelerate inference in these models. Existing methods that utilize MOYU often face a significant ’Impossible Trinity’: struggling to simultaneously maintain model performance, enhance inference speed, and extend applicability across various architectures. Due to the theoretical ambiguities surrounding MOYU, this paper elucidates the root cause of the MOYU property and outlines the mechanisms behind two primary limitations encountered by current DA methods: 1) history-related activation uncertainty, and 2) semantic-irrelevant activation inertia. Our analysis not only underscores the limitations of current dynamic activation strategies within large-scale LLaMA models but also proposes opportunities for refining the design of future sparsity schemes.

1 Introduction

Large language models(LLMs), including LLaMA, GPT, and OPT series, have showcased remarkable performance and in-context learning capabilities through the utilization of extensive parameters. However, their computational and memory demands during inference, especially in scenarios sensitive to latency, are substantial. To address these challenges, various techniques based on Massive Over-activation Yielded Uplifts(MOYU) have been proposed. These methods aim to reduce latency in these models by minimizing the excessive activation of heads, neurons, or weights during inference.

Existing MOYU-based techniques can be categorized into static and dynamic activation methods. Static activation(SA), such as pruning, reduces the excess activated weights in LLMs based on metrics like magnitude, and can be applied either once or iteratively. These configurations remain unchanged for all subsequent inputs and are fully activated during inference. However, a key limitation of SA is that once the process is complete, the deactivated weights cannot be reactivated without undergoing a recovery phase, which could lead to diminished performance and a loss of in-context learning capabilities. Furthermore, the iterative process of SA requires substantial additional training efforts, which may not proportionately enhance the speedup.

On the other hand, MOYU-based dynamic activation(DA) offers adaptability by selectively activating certain heads or neurons during inference, thereby enhancing computational efficiency. This approach capitalizes on the inherent property of massive over-activation found in LLMs to optimize resource utilization. The existing research on DA can be categorized as follows:

1.

Threshold Dynamic Activation(TDA): TDA uses a predefined threshold to determine which activation units to retain or discard, as depicted in Figure 1(a). Units with activation values below this threshold are either set to zero or removed during the current forward propagation, thus reducing computational overhead.
2.

Router-off-the-loop Dynamic Activation(RODA): This method employs a pre-trained router block to dynamically determine which activation units are crucial during the model’s forward propagation. The router is trained using the model’s historical data. For instance, DejaVu[1] utilizes a predictive router comprising a two-layer linear network as shown in Figure 1(b)
3.

Router-in-the-loop Dynamic Activation(RIDA): In contrast to RODA, the router in RIDA dynamically makes decisions based on the current input and contextual information. This enables the router to adjust its routing strategy in real-time, adeptly managing the complexities of the task at hand, thereby enhancing both efficiency and accuracy. RIDA is primarily implemented within the MoE(Mixture of Experts) structure(in Figure 1(c). For example, DS-MoE[2] employs a TopK router in its framework. Similarly, Griffin[3] also utilizes a TopK router to construct MoE from a dense model in a train-free manner(in Figure 1(d)).

Given the ’Impossible Trinity’, TDA struggles to enhance inference speed, while RODA fails to extend its applicability between ReLU and non-ReLU activated architectures. MoE-based RIDA necessitates training with the entire model, whereas sequential RIDA, as demonstrated by Griffin, can simultaneously address these three challenges.

However, despite these advancements, current research on MOYU and DA still lacks a comprehensive theoretical framework that adequately explains the MOYU phenomena across various architectures and activation functions, as well as the underlying mechanisms of MOYU within sequences.

Firstly, we have developed a mathematical rationale that elucidates the origins of the MOYU phenomenon. Then, from this perspective, we have analyzed the causes of two major limitations of existing DA methods:

•

Limitation 1: the restriction to ReLU activation functions.

we suggest that at the token-level, history-information-related activation uncertainty(in Section 4.1) makes it challenging to predict the importance of weights in non-ReLU models, thereby limiting token-level RODA methods to ReLU models.
•

Limitation 2: the inability to identify active neurons based on semantic similarity.

we suggest that at the sequence-level, neuron activation is semantically irrelevant(in Section 4.2). In other words, neurons are more likely to be activated by the most dominant elements within the same sequence rather than by the semantic content of the input itself, which in turn limits sequence-level DA to RIDA instead of RODA.
•

In short, it is disheartening that technically, we only have three DA strategies: token-level RODA for ReLU models, token-level RIDA(MoE), and sequence-level TDA and RIDA as discussed in this paper.

The rest of the paper is organized as follows. Related works are reviewed in Section 2. We introduce our universal theoretical framework in Section 3 and Section 4, and draw conclusions and limitations in Section 5.

2 Related Works

2.1 Massive Over-activation

In the study of Large Language Models(LLMs), the term massive over-activation refers to the excessive activation of numerous neurons during task execution, which can lead to computational waste and decreased efficiency[4, 5]. Research[6] indicates that dense deep neural networks often suffer from this issue. By treating the discrete sparse process as a continuous problem, optimization of the model architecture from end-to-end becomes feasible. The Lottery Hypothesis[7, 8] further emphasizes the significance of pruning techniques in reducing unnecessary connections and mitigating over-activation in dense models.

Additional research[9] introduces this concept through a ”sparsely-gated mixture-of-experts(MoE) layer,” which increases model capacity while reducing computational costs. Moreover, MC-SMoE[10] tackles the issue of massive over-activation in MoEs by streamlining the model architecture. This is achieved through the merging and low-rank decomposition of redundant experts, guided by the router’s information.

2.2 TDA and RODA

Research[6, 11] elucidates the capacity of the ReLU to introduce activation sparsity and proposes the concept of dynamic activation. DejaVu[1] identifies that the sparsity introduced by ReLU can be predicted, thus proposing the first viable RODA scheme. On the OPT series, DejaVu facilitates a 2-6x acceleration in inference latency at 75% sparsity. Building upon the DejaVu approach, ReLU²[12] first applies TDA to non-ReLU models and achieves nearly 70% sparsity with minimal loss to model performance. ProSparse[13] proposes a practical DA inference framework and, building on ReLU², achieves only a 1-percent increase in perplexity at approximately 80% sparsity by replacing the activation function and continuing to induce sparsity.

2.3 RIDA

Router-in-the-loop is the predominant method within the MoE framework. Unlike TDA and RODA methods, most RIDA approaches rely on training an expert router to facilitate dynamic activation. LLaMA-MoE[14] transforms feed-forward networks(FFNs) into MoEs by constructing experts and training an additional gating network for expert routing. DS-MoE[2] introduces a framework that utilizes dense computation during training and switches to sparse computation during inference, significantly enhancing parameter efficiency over traditional sparse MoE methods and reducing the total parameter count. Learn-To-be-Efficient[15] achieves an optimal balance between sparsity and performance by activating fewer neurons and is applicable to models with both ReLU and non-ReLU activation functions. Lory[16] retains the autoregressive properties of language models by adopting a causally segmented routing strategy and a similarity-based data batching method. This approach enables efficient expert merging operations and promotes specialization among experts in processing similar documents during training sessions.

3 Unveiling MOYU

Section 2 provides a review of the literature relevant to MOYU. This section begins by outlining the theoretical foundations of MOYU and then presents mathematical proof. Following literature[17], we can demonstrate through the following derivation how massive over-activation arises and why SwiGLU cannot produce greater sparsity than ReLU.

Assuming a neural network as in Equation 1:

f(x)=\boldsymbol{V}\sigma(p(\boldsymbol{x};\boldsymbol{\theta}))

(1)

,where $\boldsymbol{V}=[v_{1},...,v_{d_{ff}}]$ is network parameter for the last layer drawn from a random distribution, $\sigma()$ is the SwiGLU activation function, and $p(\boldsymbol{x};\boldsymbol{\theta})$ denotes all other layers with parameter $\theta$ . We write $p=p(\boldsymbol{x};\boldsymbol{\theta})$ for simplicity.

Consider the cross-entropy(CE) loss with function $\ell_{CE}(f(\boldsymbol{x}),\boldsymbol{y})$ , where $\boldsymbol{y}$ is an arbitrary vector that sums up to one and independent of $\boldsymbol{V}$ . Assume that the entries of $\boldsymbol{V}$ are drawn from independent distributions, the probability of any entry of $\boldsymbol{V}$ being 0 is less than 1, and $E[\boldsymbol{V}]=0$ . If there exist an $i^{*}$ such that $p_{i^{*}}>0$ , then we have Equation 2:

\frac{\partial\ell}{\partial p_{i*}}=\left\langle\frac{\partial\ell}{\partial f% },\frac{\partial f}{\partial p_{i*}}\right\rangle=\left\langle\frac{\partial% \ell}{\partial f},v_{i^{*}}\right\rangle

(2)

Substituting CE loss function into Equation 2 yields Equation 3:

	$\displaystyle\frac{\partial\ell_{CE}}{\partial f}$	$\displaystyle=\frac{exp(f(x))}{\left\langle exp(f(x)),\boldsymbol{1}\right% \rangle}-y$		(3)
		$\displaystyle=\frac{exp({\textstyle\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}% }})}{\left\langle exp({\textstyle\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}}}% ),\boldsymbol{1}\right\rangle}-y$		(3)

By substituting Equation 3 back into Equation 2, we can obtain Equation 4:

\displaystyle\frac{\partial\ell_{CE}}{\partial p_{i^{*}}}

\displaystyle=\frac{\left\langle exp(\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{% i}}),\boldsymbol{v_{i^{*}}}\right\rangle}{\left\langle exp(\sum_{i}\sigma(p_{i% })\cdot\boldsymbol{v_{i}}),\boldsymbol{1}\right\rangle}-\left\langle% \boldsymbol{v_{i^{*}}},y\right\rangle

(4)

Expanding the numerator of Equation 4 yields Equation 5. In Equation5, we assume that parameter $\theta$ and $\tau$ have no negative features. If we have $p_{i^{*}}^{0}=Swish_{1}(x\theta)\odot(x\tau)$ and $p_{i^{*}}^{1}=ReLU(x)$ respectively, it is easy to get $Swish_{1}(x\theta)<x\theta$ when $x>0$ , and $p_{i^{*}}^{0}<x\theta=p_{i^{*}}^{1}$ and $p_{i^{*}}^{0}<x\tau$ holds true.

	$\displaystyle\left\langle exp(\sum_{i}\sigma(p_{i})\cdot\boldsymbol{v_{i}}),% \boldsymbol{v_{i^{*}}}\right\rangle$	$\displaystyle=\sum_{m}(v_{i^{*},m}\cdot exp(\sum_{i}\sigma(p_{i})\cdot v_{im})$		(5)
		$\displaystyle=\sum_{m}(v_{i^{},m}\cdot exp(p_{i^{}}\cdot v_{i^{}m})\cdot exp% (\sum_{i\neq i^{}}\sigma(p_{i})\cdot v_{im})$		(5)

Similar to literature[17], we also have $\mathrm{E}[\frac{\partial\ell_{CE}}{\partial p_{i^{*}}}]>0$ holds true since the expectation of V is zero and the transformation of the activation function does not change the non-negative property of the loss expectations.

\displaystyle\mathrm{E}[\frac{C_{1}V\cdot exp(pV)}{C_{2}\leavevmode\nobreak\ % exp(pV)+C_{3}}]=\mathrm{E}[\frac{C_{1}V}{C_{2}+C_{3}exp(-pV)}]

(6)

The first term on the right-hand side (RHS) of the loss function (as shown in Equation 4) can be simplified to the form presented in Equation 6, while the expectation of the second term on the RHS is zero. Given $p_{i^{*}}^{0}<p_{i^{*}}^{1}$ , Equation 6 demonstrates that switching the activation function from ReLU to SwiGLU decreases the expected value of the loss function.

This implies that if there exists an $i^{*}$ such that $p_{i^{*}}>0$ , the gradient of the cross-entropy loss with respect to any positive activation $p_{i^{*}}>0$ is positive in expectation. Consequently, any training algorithm that follows the negative gradient direction tends to reduce the magnitude of such positive activation, leading to a smaller training loss and thus promoting sparsity.

In this process, the ReLU activation function causes a greater reduction in magnitude compared to SwiGLU.

4 Sequencing MOYU

In Section 3, this paper theoretically deduces the root causes of the MOYU phenomenon and explores how non-ReLU activation functions might mitigate it. The literature[18, 19, 20] has highlighted that the current level of activation map sparsity is insufficient to fully exploit the performance of DA methods. In this section, we identify two limitations associated with choosing DA methods as discussed in Sections 4.1 and 4.2.

4.1 History-related Activation Uncertainty

RODA schemes excel in models that utilize ReLU as the activation function[11, 1, 12, 13]. However, in models employing non-ReLU activation functions, the offline-trained router struggles to accurately select which heads and neurons will be activated[21, 3].

We suggest in this section that the failure of RODA in non-ReLU scenarios is closely linked to shifts in weight importance under different historical inputs: a router trained on diverse historical activation data may find it challenging to accurately identify the weights that are most crucial for the current

Similarly, we assume the presence of a ReLU-activated model as described in Equation 1. And the simplified loss of input token $x_{i}$ can be described as(Equation 7):

\displaystyle L_{i}=(\frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{% \partial f}{\partial\mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})^{T}(% \frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{\partial f}{\partial% \mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})

(7)

Weight change sensitivity(gradients) in model training is as Equation 8:

\displaystyle\frac{\partial L_{i}}{\partial\mathrm{d}\mathbf{\theta}_{i}}=2(% \frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac{\partial f}{\partial% \mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})\frac{\partial f}{\partial% \mathbf{\theta}_{i}}

(8)

By summing gradients, we have Equation 9:

	$\displaystyle\nabla_{\mathrm{d}\theta_{i}}L$	$\displaystyle=\sum_{i}2(\frac{\partial f}{\partial x_{i}}\mathrm{d}x_{i}+\frac% {\partial f}{\partial\mathbf{\theta}_{i}}\mathrm{d}\mathbf{\theta}_{i})\frac{% \partial f}{\partial\mathbf{\theta}_{i}}$		(9)
		$\displaystyle=\nabla_{\mathrm{d}\theta_{i}}L_{i}+\sum_{j=0:i-1}\nabla_{\mathrm% {d}\theta_{j}}L_{j}$		(9)

And the importance of model weights can be described in Equation 10:

$\displaystyle\Theta_{i}$	$\displaystyle=\sum_{i}\|V\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}\|$	(10)
	$\displaystyle=\|V\|\cdot\sum_{i}\|\nabla_{\mathrm{d}\theta_{i}}L_{i}\|$
	$\displaystyle=\|V\|\cdot(\nabla_{\mathrm{d}\theta_{i}}L_{i}+\sum_{j=0:i-1}\nabla% _{\mathrm{d}\theta_{j}}L_{j})$
	$\displaystyle=\|V\|\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}+\Theta_{i-1}$

, which means weight importance of a model are not only related to current input along the direction of $\theta$ , but also to the cumulative gradient information from all previous data.

For models utilizing ReLU activation, Equation 10 simplifies to the sum of the weights corresponding to positive inputs, which linearly correlates with the magnitude of the current weights themselves. However, for models employing non-ReLU activations, the significance of the current weights becomes considerably more complex.

4.2 Semantic-irrelevant Activation Inertia

Using a simplified loss function, Section 4.1 demonstrated that models with non-ReLU activation rely on historical information to accurately decide which neurons to activate. This section reveals that historical information is significantly influenced by the Heavy Hitter ( $H_{2}$ ), and the occurrence of $H_{2}$ is not related to semantics[22].

Following literature[23] we have $H_{2}:S^{*}\subset[m]$ , and $k=|S^{*}|,\leavevmode\nobreak\ \tau\in(0,1)$ denote a threshold. $\alpha\in(0,1)$ denote a fraction of mass(larger than $\tau$ ) outside $S^{*}$ .

It is natural that attention with $H_{2}$ is a $(\alpha,\tau,k)$ -good map** since for all $x\in\textbf{R}^{d}$ , $S^{*}\subset supp_{\tau}(Att(x))$ , and $|supp_{\tau}(Att(x))\setminus S^{*}|\leq\alpha\cdot k$ . Then we have $S^{*}\subseteq\cap_{i\in[n]}supp_{\tau}(x_{i})$ , and $|(\cup_{i\in[n]}supp_{\tau}(Att(x)))\setminus S^{*}|\leq\alpha kn$ for $x_{i}$ draw from $(\alpha,\tau,k)$ -good distribution uniformly at random. That is to say, $H_{2}$ in a sequence significantly decides the activation pattern.

Figures 2(a) through 2(d) showcase the phenomenon of activation inertia and its lack of semantic relevance. Figures 2(a) and 2(b) depict the activation pattern of neurons when tokens from a single sentence are input either individually or sequentially. Conversely, Figures 2(c) and 2(d) reveal the neuron activation when tokens from a random word list are introduced in the same formats.

The horizontal axis in Figure 2(a) to Figure 2(d) represents the neuron index, while the vertical axis represents the token index. The colors in the figures indicate the relative activation values for each token by each neuron. To make the data between different neurons and tokens comparable and to clarify the images, the activation values have been normalized in this study. From these images, we can get the following insights:

1.

Sequential input induces activation inertia.

Compared to Figure 2(a), the narrow vertical lines in Figure 2(b) are more pronounced. A similar pattern is observed when comparing Figure 2(c) with Figure 2(d). This suggests that for tokens from the same source, when inputs are made sequence, or as referred to by Griffin[3] as a ”prompt” into the model, the neurons activated by these input tokens will exhibit a clear clustering phenomenon. Neurons activated by the previous token demonstrate greater activation inertia and are thus more likely to be activated by subsequent tokens.
2.

Activation inertia is semantic-irrelevant.

The narrow vertical lines in Figure 2(d) are more pronounced compared to those in Figure 2(b). This comparison indicates that the activation inertia phenomenon observed is more intense with tokens from a random word list, confirming that activation inertia is semantically irrelevant, as highlighted in the title of this section.

The intuitive reason behind this, as the paper posits, follows the theoretical analysis at the beginning of this section: activation patterns are caused by the earliest heavy hitters and maintained by semantically irrelevant activation inertia. The proportion of words with actual semantics in a random word list is higher than in an English sentence with many prepositions and articles, thus more significantly causing and maintaining activation patterns.
3.

There is no significant difference in the narrow vertical line patterns between Figure 2(a) and Figure 2(c). Unlike the conclusions drawn from the previous two sets of observations, this comparison does not involve inputting tokens as a sequence. We suggest that this phenomena aligns with the theoretical derivation at the beginning of this subsection. When activation inertia is at the token level, each token acts as its own heavy hitter, leading to a more diversified activation pattern.

Through Figures 2(a) to 2(d), we have confirmed claims that sequential input strengthens activation inertia. It might be evident that activation inertia occurs with sequential input rather than parallel input, given that attention map** inherently processes all words in sequence.

However, it is crucial to recognize that this activation pattern, initiated by the earliest heavy hitter and sustained by activation inertia, can persist into subsequent generative processes. By combining sequential input with the RIDA method, it is feasible to precisely identify neurons that require activation during subsequent generation phases, with minimal impact on model performance, thereby realizing a training-free RIDA approach.

Griffin’s experiments have already confirmed the correctness of this speculation. Future ablation experiments could further test the correctness of the theory in this subsection by examining the impact on model performance of removing the first heavy hitter in a sentence (rather than all, which would be equivalent to excluding all neurons that need to be activated).

5 Conclusion and Limitations

Massive Over-activation Yielded Uplifts(MOYU) are intrinsic characteristics of large language models, and leveraging these properties through Dynamic Activation(DA) is a promising yet underutilized strategy to enhance inference speeds in these models. Traditional methods that exploit MOYU often encounter significant challenges, including maintaining model performance, speeding up inference, or extending their use to various architectures. This paper have developed a mathematical framework that elucidates the origins of the MOYU phenomenon. Through this framework, we have identified two primary limitations of current DA methods: 1) their reliance on ReLU activation functions; 2) their inability to detect active neurons based on semantic similarities.

This paper has following limitations: firstly, the mathematical rationale and implementation of the proposed DA methods could introduce complexities that might impede their practical application. Additionally, this paper highlights that sequence-level activation is predominantly influenced by heavy hitters within the same sequence; however, due to effort constraints, ablation experiment is not conducted. It is anticipated that future research will undertake more extensive experiments.

References

[1] Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivastava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023.
[2] Bowen Pan, Yikang Shen, Haokun Liu, Mayank Mishra, Gaoyuan Zhang, Aude Oliva, Colin Raffel, and Rameswar Panda. Dense training, sparse inference: Rethinking training of mixture-of-experts language models, 2024.
[3] Harry Dong, Beidi Chen, and Yuejie Chi. Prompt-prompted mixture of experts for efficient llm generation, 2024.
[4] Rishi Bommasani et.al. On the opportunities and risks of foundation models, 2022.
[5] Zhihang Yuan, Yuzhang Shang, Yang Zhou, Zhen Dong, Zhe Zhou, Chenhao Xue, Bingzhe Wu, Zhikai Li, Qingyi Gu, Yong Jae Lee, Yan Yan, Beidi Chen, Guangyu Sun, and Kurt Keutzer. Llm inference unveiled: Survey and roofline model insights, 2024.
[6] Ziang Liu, Genggeng Zhou, Jeff He, Tobia Marcucci, Li Fei-Fei, Jiajun Wu, and Yunzhu Li. Model-based control with sparse neural dynamics, 2023.
[7] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks, 2019.
[8] Eran Malach, Gilad Yehudai, Shai Shalev-Shwartz, and Ohad Shamir. Proving the lottery ticket hypothesis: Pruning is all you need, 2020.
[9] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer, 2017.
[10] **zhi Li, Zhenyu Zhang, Prateek Yadav, Yi-Lin Sung, Yu Cheng, Mohit Bansal, and Tianlong Chen. Merge, then compress: Demystify efficient smoe with hints from its routing policy, 2024.
[11] Iman Mirzadeh, Keivan Alizadeh, Sachin Mehta, Carlo C Del Mundo, Oncel Tuzel, Golnoosh Samei, Mohammad Rastegari, and Mehrdad Farajtabar. Relu strikes back: Exploiting activation sparsity in large language models, 2023.
[12] Zhengyan Zhang, Yixin Song, Guanghui Yu, Xu Han, Yankai Lin, Chaojun Xiao, Chenyang Song, Zhiyuan Liu, Zeyu Mi, and Maosong Sun. Relu² wins: Discovering efficient activation functions for sparse llms, 2024.
[13] Chenyang Song, Xu Han, Zhengyan Zhang, Shengding Hu, Xiyu Shi, Kuai Li, Chen Chen, Zhiyuan Liu, Guangli Li, Tao Yang, and Maosong Sun. Prosparse: Introducing and enhancing intrinsic activation sparsity within large language models, 2024.
[14] LLaMA-MoE Team. Llama-moe: Building mixture-of-experts from llama with continual pre-training, Dec 2023.
[15] Haizhong Zheng, Xiaoyan Bai, Xueshen Liu, Z. Morley Mao, Beidi Chen, Fan Lai, and Atul Prakash. Learn to be efficient: Build structured sparsity in large language models, 2024.
[16] Zexuan Zhong, Mengzhou Xia, Danqi Chen, and Mike Lewis. Lory: Fully differentiable mixture-of-experts for autoregressive language model pre-training, 2024.
[17] Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix Yu, Ruiqi Guo, and Sanjiv Kumar. The lazy neuron phenomenon: On emergence of activation sparsity in transformers, 2023.
[18] Georgios Georgiadis. Accelerating convolutional neural networks via activation map compression, 2019.
[19] Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 5533–5543. PMLR, 13–18 Jul 2020.
[20] Zeqi Zhu, Arash Pourtaherian, Luc Waeijen, Egor Bondarev, and Orlando Moreira. Star: Sparse thresholded activation under partial-regularization for activation sparsity exploration. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 4554–4563, 2023.
[21] Chi Ma, Mincong Huang, Chao Wang, Yujie Wang, and Lei Yu. Dynamic activation pitfalls in llama models: An empirical study, 2024.
[22] Mingjie Sun, Xinlei Chen, J. Zico Kolter, and Zhuang Liu. Massive activations in large language models, 2024.
[23] Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H₂o: Heavy-hitter oracle for efficient generative inference of large language models, 2023.

$\displaystyle\Theta_{i}$	$\displaystyle=\sum_{i}\|V\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}\|$	(10)
	$\displaystyle=\|V\|\cdot\sum_{i}\|\nabla_{\mathrm{d}\theta_{i}}L_{i}\|$
	$\displaystyle=\|V\|\cdot(\nabla_{\mathrm{d}\theta_{i}}L_{i}+\sum_{j=0:i-1}\nabla% _{\mathrm{d}\theta_{j}}L_{j})$
	$\displaystyle=\|V\|\cdot\nabla_{\mathrm{d}\theta_{i}}L_{i}+\Theta_{i-1}$