Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs

David S. Hippocampus
Department of Computer Science
Cranberry-Lemon University
Pittsburgh, PA 15213
[email protected]
Use footnote for providing further information about author (webpage, alternative address)—not for acknowledging funding agencies.
Abstract

The rapid advancement of large language models (LLMs) has led to architectures with billions to trillions of parameters, posing significant deployment challenges due to their substantial demands on memory, processing power, and energy consumption. Sparse Mixture-of-Experts (SMoE) architectures have emerged as a solution, activating only a subset of parameters per token, thereby achieving faster inference while maintaining performance. However, SMoE models still face limitations in broader deployment due to their large parameter counts and significant GPU memory requirements. In this work, we introduce a gradient-free evolutionary strategy named Efficient Expert Pruning (EEP) to enhance the pruning of experts in SMoE models. Specifically, EEP searches the pruning pattern and use expert merging as an memory-efficient way of fine-tuning the pruned model. EEP relies solely on model inference (i.e., no gradient computation) and achieves greater sparsity while maintaining or even improving performance on downstream tasks. EEP can be used to reduce both the total number of experts (thus saving GPU memory) and the number of active experts (thus accelerating inference). For example, we demonstrate that pruning up to 75% of experts in Mixtral 8×7878\times 78 × 7B-Instruct results in a substantial reduction in parameters with minimal performance loss. Remarkably, we observe improved performance on certain tasks, such as a significant increase in accuracy on the SQuAD dataset (from 53.4% to 75.4%), when pruning half of the experts. With these results, EEP not only lowers the barrier to deploying SMoE models, but also challenges the conventional understanding of model pruning by showing that fewer experts can lead to better task-specific performance without any fine-tuning. Code is available at https://github.com/imagination-research/EEP. footnotetext: Correspondence to Yu Wang <[email protected]>, Zinan Lin <[email protected]>, Xuefei Ning <[email protected]>.

1 Introduction

Large language models have significantly advanced, evolving into highly versatile tools [23, 7, 3, 46, 61, 33]. As these models grow in accordance with scaling laws [21], the norm has shifted towards architectures with billions to trillions of parameters. However, the larger scale brings considerable deployment challenges due to increased demands on memory, processing power, and energy consumption [65, 53]. In response to these challenges, there is a notable trend towards adopting sparse Mixture-of-Experts (SMoE) architectures [45, 14, 27, 19], as seen in models such as Mixtral 8×7878\times 78 × 7B and 8×228228\times 228 × 22[20], Qwen1.5-MoE-A2.7B [4], Qwen 2-57B-A14B [40], DBRX [50], and Grok-1 [57]. SMoE models activate only a subset of parameters for each token, resulting in faster inference while maintaining competitive performance compared to dense models of the same scale. For example, Mixtral 8×7878\times 78 × 7B outperforms or matches Llama-2 70B [51] and GPT-3.5 on many benchmarks, while it only activates 13B parameters to process each token. Although SMoE models have less computation per token, they remain parameter-heavy, e.g. Mixtral 8×7878\times 78 × 7B has 47B parameters in total while Grok-1 reaches 314B (see Tab. 6 for other models). This limits their broader deployment due to the substantial GPU memory requirements. Additionally, their throughput may not be ideal as the batch size needs to be restricted to fit the model within the available device memory. Therefore, it is vital to innovate methods that can reduce the size of SMoE models without compromising their performance.

Many studies have shown that only a subset of parameters significantly contributes to performance when applying LLMs to downstream tasks [6, 26, 42, 58]. Pruning is a crucial technique for eliminating redundancy in neural networks. It can be unstructured, achieving high sparsity while maintaining performance [6, 15, 47], or structured, removing entire channels or layers to provide computational efficiency and reduced latency [35, 49, 58, 18, 54, 26]. One particularly efficient way is expert pruning in SMoE LLMs, a type of structured pruning with coarse granularity, which enhances overall efficiency. Recent expert pruning methods achieve 25%-50% sparsity and accelerate inference, but struggle to maintain performance [34] or need fine-tuning, requiring substantial GPU memory and resources [8, 37]. Thus, there is a pressing need for efficient pruning methods that operate within the constraints of inference resources for SMoE LLMs.

Refer to caption
(a) A SMoE block before pruning.
Refer to caption
(b) Parameter space designed for expert pruning and merging.
Figure 1: (a) the original SMoE block and (b) our implementation of EEP. We introduce the expert merging matrix 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT, and the router map** matrix 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT, to enable the search for the optimal pruning configuration. When 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT and 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT have one-hot vectors as their rows, pruning is performed. When their elements are continuous values, routing weights and experts are aggregated to generate new weights and experts. We use an evolutionary strategy to search for the optimal 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT and 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT.

In this work, we propose a gradient-free evolutionary strategy that achieves high sparsity while maintaining performance given a small train set on the downstream tasks. Our method is divided into two phases: expert pruning and expert merging. To facilitate the search for optimal pruning configurations, we design a parameter space for router map** and expert merging, represented by two weight matrices, WRMsubscript𝑊𝑅𝑀W_{RM}italic_W start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT and WEMsubscript𝑊𝐸𝑀W_{EM}italic_W start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT. These matrices are applied to the router weighting and expert modules, as illustrated in Fig. 1. In the first phase, expert pruning, we search through the weight matrices to retain the most prominent experts without updating any network parameters. In the second phase, expert merging, we retrieve knowledge from the pruned experts and consolidate it into the retained experts. To these ends, WRMsubscript𝑊𝑅𝑀W_{RM}italic_W start_POSTSUBSCRIPT italic_R italic_M end_POSTSUBSCRIPT and WEMsubscript𝑊𝐸𝑀W_{EM}italic_W start_POSTSUBSCRIPT italic_E italic_M end_POSTSUBSCRIPT are set to one-hot rows in the first phase and to real numbers in the second phase. Since our method is gradient-free, it can be conducted on devices capable of inference. Our contributions can be summarized as follows:

  • Pruning the total number of experts: smaller memory consumption and better performance. Our approach enables more aggressive pruning of experts compared to current methods [34, 37]. In experiments on Mixtral 8×7878\times 78 × 7B-Instruct, we reduce the number of experts in each SMoE block from 8 to 2, a 72% reduction in parameters, while maintaining comparable performance across various downstream tasks. Surprisingly, we observe that fewer experts can lead to better performance. For instance, on the SQuAD dataset, pruning 4 out of 8 experts result in a performance increase from 53.4% to 75.4% without updating the remaining experts.

  • Pruning the number of active experts: better inference efficiency. We explore the pruning of active experts and find that effective expert merging compensates for the loss of active experts across downstream tasks. This process significantly improves efficiency without compromising the model’s utility on these tasks. For instance, by reducing the active experts in Mixtral 8 ×\times× 7B from two to one, we observe a prefill acceleration of up to 1.63×\times×.

  • Generalization ability. We test the performance of our method on datasets with higher diversity and out-of-distribution tasks using MMLU [17]. Specifically, we take 50 of the 57 datasets included in MMLU and conduct EEP using data from a small subset of each of the 50 datasets. We then evaluate the pruned model on the test data of i) the 50 datasets and ii) the 7 unseen datasets. In both evaluation tasks, we observe that EEP consistently outperforms other pruning methods, demonstrating the strong generalization ability of our method.

  • A novel and efficient pruning paradigm. Common pruning paradigm usually conducts two steps. In the first step, parameters are pruned using empirical criteria. This operation often lowers performance. In the second step, retained parameters are fine-tuned through stochastic gradient descent to recover performance. This operation often requires substantial GPU memory and computation time, making it prohibitive for most users of LLMs. EEP adopts a gradient-free evolutionary strategy for both pruning and fine-tuning. As a result, our pruned model significantly outperforms the pruned models of other methods, while our pruning and fine-tuning processes can run on devices affordable for inference, making EEP more widely applicable. In addition, to inherit knowledge from the unpruned model, existing methods either select a subset of weights based on predefined importance criteria [16, 60, 38], or rely on distillation techniques [39, 62, 1]. In contrast, EEP introduces a novel approach as a third paradigm, employing weight merging [56] to transfer knowledge during the model compression process.

2 Related work

Sparse Mixture-of-Experts LLMs. Shazeer et al. [45] introduced the sparse MoE layer, which consists of multiple experts, each being a simple feed-forward network (FFN), and a trainable router network that selects a sparse combination of the experts to process each input. Such SMoE models can significantly increase model capacity while maintaining computational efficiency. However, this utility is ideally achieved when the router accurately and evenly assigns experts to each token during training and inference. Many works address these challenges [14, 28, 12, 64]. Recently, many SOTA LLMs adopt the SMoE structure to achieve high performance and computational efficiency simultaneously [20, 4, 50, 57]. Additionally, Zhang et al. [63] propose transforming non-MoE models into SMoE models to accelerate inference, and Komatsuzaki et al. [25] upcycle pretrained models by reusing the parameters to initialize SMoE models, where all experts are replicates of the original FFNs, and then fine-tune the SMoE models.

Pruning for LLMs. Pruning techniques have emerged as a crucial strategy for optimizing LLMs by reducing model size and computational costs while maintaining performance. Unstructured pruning [6, 15, 47, 48] entails the removal of individual weights according to specific criteria, creating sparse networks that demand specialized hardware for efficient execution. In contrast, structured pruning [35, 49, 58, 18, 54, 26, 10, 59, 5] eliminates entire structures, such as neurons or attention heads, facilitating more straightforward implementation on standard hardware. Within structured pruning, specific focus areas include attention mechanisms, where redundant heads are pruned to streamline the self-attention layers, and FFNs where unnecessary neurons are removed to enhance computational efficiency. Additionally, expert pruning for SMoE models selectively prunes the expert networks [34, 37, 8, 24].

Evolutionary Strategy for Optimization. Evolutionary Strategies (ES) have been increasingly recognized for their robustness and flexibility in various optimization tasks, particularly where gradient-based methods fall short [55]. Notably, ES is highly effective for optimizing non-differentiable objective functions, offering a powerful alternative in scenarios where gradients are unavailable or unreliable [43, 22, 32, 52, 29]. Furthermore, ES excels in discrete optimization spaces, making it suitable for a wide range of combinatorial problems [2, 31, 30]. Recent advancements have extended the application of ES to the domain of LLMs, enabling memory-efficient fine-tuning without the need for backpropagation [36].

Refer to caption
Figure 2: We leverage EEP for two purposes: reducing the total number of experts, which lowers the memory footprint (use case 1), and reducing the number of active experts, thereby accelerating inference (use case 2).

3 Background of sparse Mixture-of-Expert language model

In this section, we discuss the general concept of sparse Mixture-of-Experts (SMoE) implementation in modern decoder-only models, using the Mixtral family [20] as a specific focus. A schematic illustration is provided in Fig. 1(a).

Notations. Let 𝑿n×d𝑿superscript𝑛𝑑\bm{X}\in\mathbb{R}^{n\times d}bold_italic_X ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT represent the input to a SMoE block, where n𝑛nitalic_n is the sequence length and d𝑑ditalic_d is the hidden dimension. The output of the attention block is denoted by 𝒁n×d𝒁superscript𝑛𝑑\bm{Z}\in\mathbb{R}^{n\times d}bold_italic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_d end_POSTSUPERSCRIPT. The main parameters in the attention block are the weight matrices for computing query, key, and value: 𝑾Q,𝑾K,𝑾Vsubscript𝑾𝑄subscript𝑾𝐾subscript𝑾𝑉\bm{W}_{Q},\bm{W}_{K},\bm{W}_{V}bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. In the SMoE structure, there are E𝐸Eitalic_E experts, each represented by a feed-forward network (FFN) with parameters 𝜽isubscript𝜽𝑖\bm{\theta}_{i}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the i𝑖iitalic_i-th expert. The router network, denoted by 𝑾Rsubscript𝑾𝑅\bm{W}_{R}bold_italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, produces routing weights 𝑮n×E𝑮superscript𝑛𝐸\bm{G}\in\mathbb{R}^{n\times E}bold_italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_E end_POSTSUPERSCRIPT for the sparse activation of the experts. For clarity, we omit the normalization layers and biases.

Self-Attention Mechanism. The self-attention mechanism computes the query, key, and value matrices as follows: 𝑸=𝑿𝑾Q,𝑲=𝑿𝑾K,𝑽=𝑿𝑾Vformulae-sequence𝑸𝑿subscript𝑾𝑄formulae-sequence𝑲𝑿subscript𝑾𝐾𝑽𝑿subscript𝑾𝑉\bm{Q}=\bm{X}\bm{W}_{Q},\,\bm{K}=\bm{X}\bm{W}_{K},\,\bm{V}=\bm{X}\bm{W}_{V}bold_italic_Q = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT , bold_italic_K = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT , bold_italic_V = bold_italic_X bold_italic_W start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT. The attention scores and the output 𝒁𝒁\bm{Z}bold_italic_Z are then computed as:

Attention(𝑸,𝑲,𝑽)=softmax(𝑸𝑲dk)𝑽,𝒁=Attention(𝑸,𝑲,𝑽)𝑾O,formulae-sequenceAttention𝑸𝑲𝑽softmax𝑸superscript𝑲topsubscript𝑑𝑘𝑽𝒁Attention𝑸𝑲𝑽subscript𝑾𝑂\displaystyle\text{Attention}(\bm{Q},\bm{K},\bm{V})=\text{softmax}\left(\frac{% \bm{Q}\bm{K}^{\top}}{\sqrt{d_{k}}}\right)\bm{V},\quad\bm{Z}=\text{Attention}(% \bm{Q},\bm{K},\bm{V})\bm{W}_{O},Attention ( bold_italic_Q , bold_italic_K , bold_italic_V ) = softmax ( divide start_ARG bold_italic_Q bold_italic_K start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_italic_V , bold_italic_Z = Attention ( bold_italic_Q , bold_italic_K , bold_italic_V ) bold_italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT , (1)

where softmax()softmax\text{softmax}(\cdot)softmax ( ⋅ ) denotes a row-wise softmax function. The attention mechanism produces a weighted sum of the values 𝑽𝑽\bm{V}bold_italic_V, where the weights are derived from the dot product of the queries 𝑸𝑸\bm{Q}bold_italic_Q and keys 𝑲𝑲\bm{K}bold_italic_K, scaled by the square root of key/query dimension dksubscript𝑑𝑘\sqrt{d_{k}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG. Then the weighted averaged values are mapped by the output matrix 𝑾Osubscript𝑾𝑂\bm{W}_{O}bold_italic_W start_POSTSUBSCRIPT italic_O end_POSTSUBSCRIPT to 𝒁𝒁\bm{Z}bold_italic_Z.

Router Network in SMoE Structure.

The router network determines which experts to activate and how to scale their outputs. The routing weights 𝑮n×E𝑮superscript𝑛𝐸\bm{G}\in\mathbb{R}^{n\times E}bold_italic_G ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_E end_POSTSUPERSCRIPT are computed as:

𝑮𝑮\displaystyle\bm{G}bold_italic_G =softmax(𝒁𝑾G).absentsoftmax𝒁subscript𝑾𝐺\displaystyle=\text{softmax}(\bm{Z}\bm{W}_{G}).= softmax ( bold_italic_Z bold_italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) . (2)

Sparse activation of the experts is achieved by selecting the top-k𝑘kitalic_k routing weights for each input token. The output of the activated experts is scaled by the routing weights and aggregated to form the output of the SMoE layer 𝑯𝑯\bm{H}bold_italic_H:***The top-k𝑘kitalic_k routing weights may be further normalized to sum to 1; this nuance is omitted here.

j=1n,𝑯j=iTopK(𝑮j)𝑮jiFFNi(𝒁j),formulae-sequencefor-all𝑗1𝑛subscript𝑯𝑗subscript𝑖TopKsubscript𝑮𝑗subscript𝑮𝑗𝑖subscriptFFN𝑖subscript𝒁𝑗\displaystyle\forall j=1\dots n,\quad\bm{H}_{j}=\sum_{i\in\text{TopK}(\bm{G}_{% j})}\bm{G}_{ji}\cdot\text{FFN}_{i}(\bm{Z}_{j}),∀ italic_j = 1 … italic_n , bold_italic_H start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_i ∈ TopK ( bold_italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT bold_italic_G start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT ⋅ FFN start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (3)

where TopK(𝑮j)TopKsubscript𝑮𝑗\text{TopK}(\bm{G}_{j})TopK ( bold_italic_G start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) denotes the indices of the top-k𝑘kitalic_k routing weights for the j𝑗jitalic_j-th input token, and FFNisubscriptFFN𝑖\text{FFN}_{i}FFN start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the function of the i𝑖iitalic_i-th expert, as defined below.

FFN as Expert. Each expert in the SMoE structure is an independent FFN with two fully-connected layers, denoted by 𝑾1isubscript𝑾1𝑖\bm{W}_{1i}bold_italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT and 𝑾2isubscript𝑾2𝑖\bm{W}_{2i}bold_italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT. When applying SwiGLU [44], an additional weight matrix 𝑾3isubscript𝑾3𝑖\bm{W}_{3i}bold_italic_W start_POSTSUBSCRIPT 3 italic_i end_POSTSUBSCRIPT is introduced for the activation function. The i𝑖iitalic_i-th expert processes the input as follows:

FFNi(𝒁sub)subscriptFFN𝑖subscript𝒁𝑠𝑢𝑏\displaystyle\text{FFN}_{i}(\bm{Z}_{sub})FFN start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_italic_Z start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT ) =SwiGLU(𝒁sub,𝑾1i,𝑾3i)𝑾2i,absentSwiGLUsubscript𝒁𝑠𝑢𝑏subscript𝑾1𝑖subscript𝑾3𝑖subscript𝑾2𝑖\displaystyle=\text{SwiGLU}(\bm{Z}_{sub},\bm{W}_{1i},\bm{W}_{3i})\bm{W}_{2i},= SwiGLU ( bold_italic_Z start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 3 italic_i end_POSTSUBSCRIPT ) bold_italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , (4)

where 𝒁subsubscript𝒁𝑠𝑢𝑏\bm{Z}_{sub}bold_italic_Z start_POSTSUBSCRIPT italic_s italic_u italic_b end_POSTSUBSCRIPT denotes the a subset of rows in 𝒁𝒁\bm{Z}bold_italic_Z that activates the i𝑖iitalic_i-th expert. Depending on the activation function, the parameters of the i𝑖iitalic_i-th expert are either 𝜽i={𝑾1i,𝑾2i}subscript𝜽𝑖subscript𝑾1𝑖subscript𝑾2𝑖\bm{\theta}_{i}=\{\bm{W}_{1i},\bm{W}_{2i}\}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT } or 𝜽i={𝑾1i,𝑾2i,𝑾3i}subscript𝜽𝑖subscript𝑾1𝑖subscript𝑾2𝑖subscript𝑾3𝑖\bm{\theta}_{i}=\{\bm{W}_{1i},\bm{W}_{2i},\bm{W}_{3i}\}bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { bold_italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT 3 italic_i end_POSTSUBSCRIPT }.

4 Method

Refer to caption
Figure 3: Performance from a single expert to an ensemble of experts.

In this section, we introduce our proposed approach for optimizing SMoE LLMs through expert pruning and merging. We aim to enhance the efficiency and performance of SMoE architectures by leveraging evolutionary strategies. Our method addresses the challenges of large and complex search spaces without incurring the prohibitive computational costs associated with gradient-based optimization. The subsequent subsections elaborate on our motivation (Sec. 4.1), the configuration of the parameter space (Sec. 4.2), the evolutionary optimization strategy employed to achieve our objectives (Sec. 4.3), and the use cases we apply EEP (Sec. 4.4).

4.1 Motivation

LLMs based on the SMoE architecture have shown remarkable performance across various natural language processing tasks [20, 50, 57]. These models leverage multiple experts, activating only a subset for any given input, thus balancing computational efficiency and model capacity. Typically, top-2 experts are activated, striking a balance between performance and computational cost.

Fig. 3 presents our investigation into the activation of different numbers of experts on Mixtral 8×7878\times 78 × 7B-Instruct, revealing the following observations: i) Activating only a single expert does not lead to model collapse and may result in only a minimal performance drop compared to the default setting of using two experts. This suggests that individual experts possess redundant knowledge, enabling them to maintain reasonable performance independently. This redundancy indicates potential for expert pruning. ii) Conversely, activating all 8 experts leads to a noticeable performance gain, highlighting the benefits of expert ensemble. However, the computational cost of such an ensemble is substantially higher. Wortsman et al. [56] have shown that merging differently fine-tuned models can efficiently substitute their ensemble, achieving similar performance with reduced computational overhead.

Building on these insights, we propose a two-step approach involving expert pruning followed by expert merging. Initially, we search for the optimal subset of experts given a fixed size. Subsequently, we employ expert merging to consolidate the knowledge from the pruned experts into the remaining ones. This approach not only restores the knowledge of the pruned experts but also updates the surviving experts to incorporate the collective expertise of the entire SMoE block.

4.2 Parameter space for expert pruning and merging

Expert Pruning and Merging Matrices. To efficiently prune and merge experts in each SMoE block (l=1L𝑙1𝐿l=1\dots Litalic_l = 1 … italic_L), we introduce two key matrices: the Router Map** matrix (𝑾RMlsuperscriptsubscript𝑾RM𝑙\bm{W}_{\text{RM}}^{l}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT) and the Expert Merging matrix (𝑾EMlsuperscriptsubscript𝑾EM𝑙\bm{W}_{\text{EM}}^{l}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT). For clarity, we omit the block index l𝑙litalic_l in this section. A schematic illustration is provided in Fig. 1(b). The router map** matrix 𝑾RME×Esubscript𝑾RMsuperscriptsuperscript𝐸𝐸\bm{W}_{\text{RM}}\in\mathbb{R}^{E^{\prime}\times E}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_E end_POSTSUPERSCRIPT, where Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT is the reduced number of experts (i.e., E𝐸Eitalic_E > Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), is applied to the routing weights 𝑮𝑮\bm{G}bold_italic_G to reduce the dimensionality and handle fewer experts:

𝑮=𝑾RMsoftmax(𝒁𝑾G),superscript𝑮subscript𝑾RMsoftmax𝒁subscript𝑾𝐺\displaystyle\bm{G}^{\prime}=\bm{W}_{\text{RM}}\text{softmax}(\bm{Z}\bm{W}_{G}),bold_italic_G start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT softmax ( bold_italic_Z bold_italic_W start_POSTSUBSCRIPT italic_G end_POSTSUBSCRIPT ) , (5)

The expert merging matrix 𝑾EME×Esubscript𝑾EMsuperscriptsuperscript𝐸𝐸\bm{W}_{\text{EM}}\in\mathbb{R}^{E^{\prime}\times E}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_E end_POSTSUPERSCRIPT is applied to the expert weights {𝜽i}i=1Esuperscriptsubscriptsubscript𝜽𝑖𝑖1𝐸\{\bm{\theta}_{i}\}_{i=1}^{E}{ bold_italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT to merge E𝐸Eitalic_E experts into Esuperscript𝐸E^{\prime}italic_E start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT experts. Each element in 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT operates blockwise on the parameters of the experts. Denote {ωj1,ωj2,,ωjE}subscript𝜔𝑗1subscript𝜔𝑗2subscript𝜔𝑗𝐸\{\omega_{j1},\omega_{j2},\dots,\omega_{jE}\}{ italic_ω start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_j 2 end_POSTSUBSCRIPT , … , italic_ω start_POSTSUBSCRIPT italic_j italic_E end_POSTSUBSCRIPT } as the j𝑗jitalic_j-th row of 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT that maps the original E𝐸Eitalic_E experts to the j𝑗jitalic_j-th new expert 𝜽jsubscriptsuperscript𝜽𝑗\bm{\theta}^{\prime}_{j}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT. We define merging as follows:

𝜽j={i=1Eωji𝑾1i,i=1Eωji𝑾2i,i=1Eωji𝑾3i},subscriptsuperscript𝜽𝑗superscriptsubscript𝑖1𝐸subscript𝜔𝑗𝑖subscript𝑾1𝑖superscriptsubscript𝑖1𝐸subscript𝜔𝑗𝑖subscript𝑾2𝑖superscriptsubscript𝑖1𝐸subscript𝜔𝑗𝑖subscript𝑾3𝑖\displaystyle\bm{\theta}^{\prime}_{j}=\{\sum_{i=1}^{E}\omega_{ji}\bm{W}_{1i},% \sum_{i=1}^{E}\omega_{ji}\bm{W}_{2i},\sum_{i=1}^{E}\omega_{ji}\bm{W}_{3i}\},bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 2 italic_i end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_ω start_POSTSUBSCRIPT italic_j italic_i end_POSTSUBSCRIPT bold_italic_W start_POSTSUBSCRIPT 3 italic_i end_POSTSUBSCRIPT } , (6)

where the parameters of the experts are defined in Eq. 4.

Expert Pruning Phase. During the expert pruning phase, the low-rank matrices 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT are initialized with each row as a one-hot vector to ensure that only pruning occurs. Additionally, 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT are set as to be identical 𝑾RM=𝑾EMsubscript𝑾RMsubscript𝑾EM\bm{W}_{\text{RM}}=\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT = bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT. Consequently, these matrices only retain the selected expert weights and their corresponding routing weights. During evolutionary search, EEP also maintains the one-hot format of 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT.

Expert Merging Phase. In the expert merging phase, 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT are decoupled and initialized from their optimal values obtained during the pruning phase. This decoupling allows for a more flexible transformation where multiple experts can be merged, and the router weights can be updated independently. During this phase, the elements of 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT transition from discrete 0/1010/10 / 1 values to continuous values. This allows the matrices to perform more nuanced transformations.

4.3 Evolutionary search for the router map** and expert merging matrices

The search space of the router map** and expert merging matrices is large and complex, making it difficult to design heuristics for determining a solution, as is done in other expert pruning studies [37, 8, 34]. Therefore, an efficient optimization strategy is necessary. Given the substantial size of SMoE LLMs, computing gradients for optimization is computationally prohibitive for most users. As a solution, we employ a gradient-free evolutionary strategy, similar to approaches found in previous works [30, 32]. Our algorithm is detailed in Alg. 1.

Initially, we populate the search space using random initialization. During the evolutionary search, each set of router map** and expert merging matrices is treated as an individual. In each iteration, only the top-performing individuals are selected as parents to produce the next generation through crossover and mutation. Specifically, during crossover, we randomly combine the entries of the matrices from two parents or select one parent’s matrices entirely. For mutation, we introduce random Gaussian noise to the matrices, ensuring stochastic variations. This process conserves beneficial adaptations while discarding detrimental modifications, enhancing the optimization process.

This evolutionary reproduction process is repeated for a predetermined number of iterations within each search phase, updating the population with newly generated individuals. Upon completion of the search process, the best individual is selected as the output of our search algorithm.

4.4 Use Cases

We explore two applications of EEP: expert pruning and expert activation pruning. In expert pruning, EEP searches for optimal router map** (𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT) and expert merging matrices (𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT) to minimize the total number of experts while maintaining high performance. For expert activation pruning, the goal is to achieve strong performance with only one active expert per token. Here, we use the same EEP search algorithm to conduct expert and router networks optimization by updating the 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT matrices, while only activates one expert during inference. Fig. 2 illustrates these two use cases. Additionally, we investigate the combination of these two approaches, reducing both the total number of experts and the number of active experts simultaneously (see Sec. 5.3).

5 Experiments

In this section, we validate the effectiveness of our method by considering two use cases: expert pruning and expert activation pruning. In Sec. 5.1, we introduce the experimental settings. In Sec. 5.2, we investigate the first use case, expert pruning, by applying EEP to reduce the total number of experts. In Sec. 5.3, we further explore expert activation pruning, applying EEP to maintain performance while reducing the number of active experts by changing the top-2 routing weights to top-1. We also examine a composite case where both the total number of experts and the number of active experts are reduced. In Sec. 5.4, we present the experimental results on larger and more diverse datasets, as well as performance on out-of-distribution datasets, to validate the generalization ability of EEP. In Sec. 5.5, we profile memory usage and inference speed to demonstrate that our method achieves significant improvements compared to the full SMoE models. In Sec. 5.6 we provide insights on the observation of fewer experts but higher performance. More results, including experiments on larger datasets and other models, can be found in App. D.

5.1 Experimental settings

Our main results are based on the popular SMoE models Mixtral 8×\times×7B [20]. We also include a larger model, Mixtral 8×\times×22B [20], to demonstrate the generalization of our methods. We use the "Instruct" version of these models for the generation tasks. We select tasks from the SuperGLUE dataset, as well as several other generation tasks, including SQuAD [41] and DROP [13]. For each individual dataset, we randomly sample a subset from the training set to conduct evolutionary search and use the test set for evaluation. Additional details can be found in App. A.

Evaluation. We adopt a generation-based evaluation approach for all datasets. Specifically, we use the instruction fine-tuned model to generate answers directly in response to the given questions and apply template matching to determine the correctness of the answers. Our evaluation protocol primarily follows the implementation of OpenCompass [11] for the design of question prompts, types of templates, and matching criteria, with a few modifications to better suit the Mixtral family of models. All experiments use the same evaluation settings. Examples of prompts and model outputs can be found in App. E and App. F.

Baselines. Since our method aims to compress the instruction fine-tuned SMoE models on downstream tasks, we consider the zero-shot performance as our main baseline to show that EEP can achieve a significant decrease on the memory footprint and/or computation overhead during the inference time while maintain or even achieve better performance. For the use case of decreasing the total number of experts, we additionally compare EEP with four other types of baseline to demonstrate the effectiveness of the designed search space and the evolutionary-search-based tuning method: (1) Random selection of pruned experts, (2&3) Pruning the experts with the lowest frequency of being activated or the lowest soft activation values [37], and (4) NAEE [34], which exhaustively evaluates the loss between the full model and all pruning choices for each layer and select the one with the lowest loss. For the use case of decreasing the active number of experts, we select the dynamic skip** method proposed by NAEE [34] as an additional baseline. More details are given in App. A.

5.2 Reducing the total number of experts

We apply EEP to search for the optimal pruning configuration, parameterized by the router map** matrix 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT and the expert merging matrix 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT, for maintaining 4 experts and 2 experts. EEP (Prune Only) indicates the results from solely conducting the expert pruning phase as described in Sec. 4.2. In contrast, EEP (Prune + Merge) shows the results after the complete evolutionary search process. The results are shown in Tab. 1, and we discuss them below. Random is conducted 30 times, and we present the mean results here, deferring the complete results to Sec. D.4.

Table 1: Results of expert pruning on Mixtral 8×\times×7B-Instruct. Bold values indicate the best across all methods; underlined values show the best without parameter updates (i.e., excluding EEP (Prune+Merge)).
Expert Method COPA MultiRC WIC WSC RTE BoolQ CB ReCoRD DROP SQuAD Avg.
Num=8 Full Model 89.0 83.0 51.8 63.5 73.2 77.4 51.7 50.3 30.6 53.4 62.4
Num=4 Random 63.8 49.4 37.6 43.3 45.1 50.2 38.7 35.1 27.4 58.3 44.9
Frequency [37] 63.0 74.8 36.0 34.6 18.1 71.0 30.4 41.6 29.9 58.2 45.8
Soft Activation [37] 73.0 30.6 51.4 37.5 41.9 40.4 17.9 36.8 33.3 10.2 37.3
NAEE [34] 87.0 76.0 52.6 64.5 61.7 77.2 51.7 50.4 30.6 53.0 60.5
EEP (Prune Only) 95.0 81.2 57.8 67.3 74.0 82.8 69.6 60.0 37.3 75.2 70.3
EEP (Prune+Merge) 99.0 84.6 65.0 73.1 76.9 84.8 75.0 63.6 39.7 80.6 74.2
Num=2 Random 36.8 22.3 13.6 15.0 28.4 15.5 38.6 16.9 18.3 36.9 24.2
Frequency [37] 51.0 17.6 8.8 1.9 48.4 30.6 35.7 10.4 14.9 9.2 24.9
Soft Activation [37] 33.0 18.2 49.4 18.5 15.2 1.8 32.1 4.4 11.7 50.0 23.4
NAEE [34] 75.0 42.4 48.4 49.0 54.5 49.8 19.6 42.0 31.2 58.2 47.0
EEP (Prune Only) 76.0 63.8 51.8 63.5 64.3 70.6 58.9 47.2 37.1 64.0 59.7
EEP (Prune+Merge) 93.0 71.6 58.6 65.4 69.0 75.6 66.1 47.2 38.4 70.2 65.6

EEP fully exploits expert-wise redundancy on downstream tasks. Based on the results obtained from the pruning phase of EEP, retaining only 4 experts allows the model to achieve better performance and lower computational costs simultaneously on most datasets, except for MultiRC. Even with a particularly low budget of retaining only 2 experts, EEP can still achieve comparable or even better performance than the full model on five datasets, with some datasets showing significant improvements over the best baseline (e.g., 58.9 vs. 51.7 on CB and 64.0 vs. 53.4 on SQuAD). For the remaining datasets, model collapse is avoided.

EEP is more effective than other baseline methods for selecting pruned experts. Comparing the results of other methods, we find that EEP is more effective for identifying the optimal pruning pattern. Random sampling of experts results in low mean accuracy and high variance. Pruning experts based on selection frequency also performs poorly on most datasets and has a high probability of collapse under high sparsity. NAEE can nearly maintain the performance of the full model when retaining four experts. However, EEP surpasses all methods by a large margin across all datasets.

Expert merging brings significant improvements after pruning. As shown in the last row for each pruning rate in Tab. 1, the results after expert merging exceed those obtained through the expert pruning phase alone. Specifically, expert merging achieves a general improvement on almost all datasets. On WIC, CB, and SQuAD under both pruning rates, and on WSC when four experts are retained, the accuracy improvement reaches 5%similar-to\sim7%, demonstrating its effectiveness in restoring the knowledge of pruned experts and enhancing individual experts. Additionally, we find expert merging to be an effective method for fine-tuning SMoE LLMs (i.e., kee** the number of total and active experts); the results of this are presented in Tab. 9.

Generality across models. With the promising results of Mixtral 8×\times×7B-Instruct model, we further apply EEP to a larger model: Mixtral 8×\times×22B-Instruct [20], Qwen1.5-MoE-A2.7B-Chat [4], and Qwen2-MoE-A14B-Chat [40]. We conduct experiments on fewer datasets due to the constraint of computational resource. Results are shown at Tab. 2, Tab. 7, and Tab. 8, respectively. EEP also achieves a strong improvement and above observations are still held, which indicates the scaling-up ability of EEP towards large SMoE models.

Table 2: Results of expert pruning on Mixtral 8×228228\times 228 × 22B-Instruct. Bold values indicate the best across all methods; underlined values show the best without parameter updates (i.e., excluding EEP (Prune+Merge)).
Budget Method WIC WSC BoolQ CB SQuAD Avg.
Num=8 Full Model 68.2 81.7 90.2 46.5 45.8 66.5
Num=4 Random 27.0 30.2 37.8 34.6 37.2 33.4
Frequency [37] 0.0 38.5 76.6 57.1 50.6 30.6
Soft Activation [37] 25.2 60.6 6.4 60.7 54.2 41.4
NAEE [34] 64.0 68.3 78.4 33.9 52.4 59.4
EEP (Prune Only) 70.2 84.2 89.6 75.0 71.4 78.1
EEP (Prune+Merge) 72.2 87.5 89.6 78.6 74.0 80.4
Num=2 Random 13.9 10.1 11.0 24.9 15.6 15.1
Frequency [37] 0.0 0.0 0.0 0.0 0.0 0.0
Soft Activation [37] 2.4 1.9 3.6 19.6 52.6 16.0
NAEE [34] 34.0 32.7 45.0 16.1 50.0 30.6
EEP (Prune Only) 57.8 63.5 76.0 50.0 71.0 63.7
EEP (Prune+Merge) 59.6 65.4 76.4 58.9 75.0 67.1

5.3 Reducing the number of active experts

Next, we present the experimental results for the second use case: decreasing the number of active experts. We modify the number of active experts by changing the top-k from k=2𝑘2k=2italic_k = 2 to 1111 while applying EEP to restore model performance. We evaluate our method with two different total numbers of experts (8 and 4). The results are presented in Tab. 3. We summarize the observations below.

EEP can improve individual experts through expert merging, allowing a single expert to handle the inference. Kee** the total number of experts at 8 and reducing the number of active experts to 1 consistently leads to a decline in baseline performance. However, by optimizing the model with EEP, we introduce a reliable improvement that mitigates this gap, resulting in comparable or even better performance than the full model. It is important to note that when the total number of experts is maintained, there is no expert pruning phase; only expert merging is applied for EEP.

The two use cases can be combined through EEP. By retaining fewer experts and simultaneously reducing the number of active experts, we achieve significant savings in both GPU memory and inference time (see Sec. 5.5). EEP can be directly applied in this scenario. Results show that with 4 total experts and 1 active expert, EEP achieves performance comparable to or even better than the full model.

Table 3: Results of active expert pruning on Mixtral 8×7878\times 78 × 7B. Bold values show the best performance. “Active” indicates the average number of experts active per token. Avg. stands for average.
Total Active Method WIC WSC BoolQ CB SQuAD Avg.
8 2 Full Model 51.8 63.5 77.4 51.7 53.4 59.6
1 Full Model 50.8 48.1 66.0 48.2 43.8 51.4
1.4similar-to\sim1.5 Dyn [34] 50.0 59.6 72.8 46.4 44.8 54.7
1 EEP 59.2 70.2 79.0 66.1 51.8 65.3
4 1 NAEE [34] 48.6 20.2 56.2 33.9 51.8 42.1
1.4similar-to\sim1.5 NAEE+Dyn [34] 43.4 61.5 36.2 53.6 53.4 49.6
1 EEP 55.8 70.2 74.4 64.3 72.0 67.3

5.4 In-distribution and out-of-distribution generalization on diverse datasets

In this section, we further test EEP on a larger dataset, MMLU, to validate the generalization ability of EEP. We randomly split all 57 datasets in MMLU into two subsets containing 50 and 7 datasets, as the base dataset and the out-of-distribution (OOD) test dataset, respectively. We further divide each dataset in the larger subset into training and validation sets. We conduct our EEP on the training sets and use both the validation sets and the OOD test dataset to evaluate the performance of the searched patterns. Results are shown in Tab. 4. We find that EEP outperforms baseline methods on both the base dataset and the OOD test dataset. This indicates that EEP possesses the ability to handle large and diverse datasets and exhibits a certain level of generalization capability.

Table 4: Results of expert pruning on Mixtral 8×8\times8 ×7B-Instruct on MMLU dataset. Bold values indicate the best performance; underlined values show the best without updating remaining parameters.
Budget Method IID (50 val. sets) OOD (7 unseen datasets)
Num=8 Full Model 60.7 72.6
Num=6 Random 53.0±plus-or-minus\pm±9.6 64.6±plus-or-minus\pm±10.0
Frequency [37] 35.2 35.0
Soft Activation [37] 54.3 65.6
NAEE [34] 57.5 69.4
EEP (Prune Only) 59.6 71.4
EEP (Prune+Merge) 61.8 71.3
Num=4 Random 45.1±plus-or-minus\pm±6.1 50.3±plus-or-minus\pm±10.7
Frequency [37] 26.6 25.2
Soft Activation [37] 46.7 53.1
NAEE [34] 53.5 63.6
EEP (Prune Only) 55.4 62.4
EEP (Prune+Merge) 56.9 64.6

5.5 Improvements in memory usage and inference speed

We profile the memory overhead and inference speed of Mixtral 8×7878\times 78 × 7B model for the two use cases. We conduct tests on SQuAD with a batch size of 256 using two NVIDIA A100 GPU cards. We report the peak memory usage and the wall-time acceleration ratio in Tab. 5. As shown in Tab. 5, retaining only 4 and 2 experts from the whole model decreases the memory overhead by 47% and 71%, respectively. Additionally, reducing the total number of experts improves inference speed due to higher parallelism, achieving a speedup of 1.11×\times× and 1.18×\times× with 4 and 2 experts, respectively. In the use case of reducing active experts, an acceleration ratio of 1.24×\times× is achieved. Finally, when combining the two use cases with 4 total experts and 1 active expert per token, EEP saves 47% of GPU memory and achieves a 1.41×\times× increase in inference speed. The profiling results indicate that EEP can significantly reduce the computational cost and memory consumption of SMoE LLMs.

Table 5: Profiling the memory footprint and inference speedup of Mixtral 8×7878\times 78 × 7B.
Total Active Method Speedup GPU Mem(GB)
8 2 Full Model 1.0×\times× 88.6
1 EEP 1.24×\times×
4 2 EEP 1.11×\times× 46.6
1 EEP 1.41×\times×
2 2 EEP 1.18×\times× 25.6
Refer to caption
(a) Activation (1/0 means activated/not activated) correlation before and after pruning.
Refer to caption
(b) Accumulated activation times before and after pruning.
Refer to caption
(c) Accumulated routing weights before and after pruning.
Figure 4: Statistics of the expert activation patterns before and after the Expert Pruning Phase. The data represents the first transformer block of Mixtral 8×7878\times 78 × 7B-Instruct on the SQuAD dataset. In (a), four retained experts are re-indexed from 0 to 3 for clarity.

5.6 Why fewer experts leads to better performance

At first glance, it may seem counterintuitive that reducing the number of experts can improve performance as shown in Tabs. 1 and 2, especially when the remaining parameters are not retrained. Our hypothesis is that the router network operates differently after expert pruning, leading to this improvement. Typically, the router network is implemented as a smaller network, such as a one-layer perceptron. This makes it challenging to accurately partition the high-dimensional hidden space among experts. The issue of imbalanced activation has been identified in several works [14, 9]. If the router network does not function optimally before pruning, there may be potential for improvement by enabling the router to focus on a smaller subset of experts.

Although it is difficult to directly evaluate the router network’s performance, we have observed that its behavior changes significantly after pruning. This change occurs because the pruning process eliminates some experts, and the routing weights for the remaining experts are normalized to sum to one. In Fig. 4, we observe distinct patterns in the accumulated activation times of the experts, their accumulated routing weights, and the activation correlation across experts. More demonstration of the expert activation pattern can be found in Sec. D.6.

6 Conclusion

In this work, we present EEP, a gradient-free evolutionary search method optimized for pruning within an efficienct parameter space. Through extensive experiments on various downstream datasets, we demonstrate that EEP achieves superior performance and greater sparsity compared to baseline methods. Additionally, we make a novel observation that the performance of SMoE models on downstream tasks can be enhanced through pruning, even without updating the remaining parameters. We discuss the potential reasons for this phenomenon, suggesting that pruning may lead to a more effective routing mechanism by reducing the complexity the router network needs to manage.

Limitations. Although we demonstrated promising results, our approach still requires a potentially costly search process. We leave the optimization of search cost to future work.

Acknowledgement

This work was supported by National Natural Science Foundation of China (No. 62325405, 62104128, U19B2019, U21B2031, 61832007, 62204164), Flemish Government (AI Research Program) and the Research Foundation - Flanders (FWO) through project number G0G2921N, Tsinghua EE Xilinx AI Research Fund, and Bei**g National Research Center for Information Science and Technology (BNRist). We thank for all the support from Infinigence-AI.

References

  • Aghli and Ribeiro [2021] Nima Aghli and Eraldo Ribeiro. Combining weight pruning and knowledge distillation for cnn compression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3191–3198, 2021.
  • Akiba et al. [2024] Takuya Akiba, Makoto Shing, Yu** Tang, Qi Sun, and David Ha. Evolutionary optimization of model merging recipes. arXiv preprint arXiv:2403.13187, 2024.
  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikoł aj Bińkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan. Flamingo: a visual language model for few-shot learning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 23716–23736. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/960a172bc7fbf0177ccccbb411a7d800-Paper-Conference.pdf.
  • Bai et al. [2023] **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  • Beltagy et al. [2020] Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  • Blalock et al. [2020] Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. What is the state of neural network pruning? Proceedings of machine learning and systems, 2:129–146, 2020.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.
  • Chen et al. [2022] Tianyu Chen, Shaohan Huang, Yuan Xie, Binxing Jiao, Daxin Jiang, Haoyi Zhou, Jianxin Li, and Furu Wei. Task-specific expert pruning for sparse mixture-of-experts. arXiv preprint arXiv:2206.00277, 2022.
  • Chi et al. [2022] Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal, Payal Bajaj, Xia Song, Xian-Ling Mao, Heyan Huang, and Furu Wei. On the representation collapse of sparse mixture of experts. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=mWaYC6CZf5.
  • Child et al. [2019] Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  • Contributors [2023] OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023.
  • Dai et al. [2022] Damai Dai, Li Dong, Shuming Ma, Bo Zheng, Zhifang Sui, Baobao Chang, and Furu Wei. StableMoE: Stable routing strategy for mixture of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7085–7095, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.489. URL https://aclanthology.org/2022.acl-long.489.
  • Dua et al. [2019] Dheeru Dua, Yizhong Wang, Pradeep Dasigi, Gabriel Stanovsky, Sameer Singh, and Matt Gardner. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1246. URL https://aclanthology.org/N19-1246.
  • Fedus et al. [2022] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
  • Frantar and Alistarh [2023] Elias Frantar and Dan Alistarh. SparseGPT: Massive language models can be accurately pruned in one-shot. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 10323–10337. PMLR, 23–29 Jul 2023.
  • He et al. [2018] Yihui He, Ji Lin, Zhijian Liu, Hanrui Wang, Li-Jia Li, and Song Han. Amc: Automl for model compression and acceleration on mobile devices. In Proceedings of the European conference on computer vision (ECCV), pages 784–800, 2018.
  • Hendrycks et al. [2021] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  • Hou et al. [2020] Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 9782–9793. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/6f5216f8d89b086c18298e043bfe48ed-Paper.pdf.
  • Hwang et al. [2023] Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al. Tutel: Adaptive mixture-of-experts at scale. Proceedings of Machine Learning and Systems, 5, 2023.
  • Jiang et al. [2024] Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  • Kaplan et al. [2020] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  • Kharitonov [2019] Eugene Kharitonov. Federated online learning to rank with evolution strategies. WSDM ’19, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450359405. doi: 10.1145/3289600.3290968. URL https://doi.org/10.1145/3289600.3290968.
  • Kim et al. [2023] Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors, Advances in Neural Information Processing Systems, volume 36, pages 39648–39677. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/7cc1005ec73cfbaac9fa21192b622507-Paper-Conference.pdf.
  • Koishekenov et al. [2023] Yeskendir Koishekenov, Alexandre Berard, and Vassilina Nikoulina. Memory-efficient NLLB-200: Language-specific expert pruning of a massively multilingual machine translation model. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3567–3585, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.198. URL https://aclanthology.org/2023.acl-long.198.
  • Komatsuzaki et al. [2023] Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=T5nUQDrM4u.
  • Kwon et al. [2022] Woosuk Kwon, Sehoon Kim, Michael W. Mahoney, Joseph Hassoun, Kurt Keutzer, and Amir Gholami. A fast post-training pruning framework for transformers. In Alice H. Oh, Alekh Agarwal, Danielle Belgrave, and Kyunghyun Cho, editors, Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=0GRBKLBjJE.
  • Lepikhin et al. [2021] Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan** Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. {GS}hard: Scaling giant models with conditional computation and automatic sharding. In International Conference on Learning Representations, 2021.
  • Lewis et al. [2021] Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. Base layers: Simplifying training of large, sparse models. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 6265–6274. PMLR, 18–24 Jul 2021. URL https://proceedings.mlr.press/v139/lewis21a.html.
  • Liu et al. [2023a] Enshu Liu, Xuefei Ning, Zinan Lin, Huazhong Yang, and Yu Wang. Oms-dpm: Optimizing the model schedule for diffusion probabilistic models. In International Conference on Machine Learning, pages 21915–21936. PMLR, 2023a.
  • Liu et al. [2023b] Enshu Liu, Xuefei Ning, Zinan Lin, Huazhong Yang, and Yu Wang. OMS-DPM: Optimizing the model schedule for diffusion probabilistic models. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 21915–21936. PMLR, 23–29 Jul 2023b. URL https://proceedings.mlr.press/v202/liu23ab.html.
  • Liu et al. [2024a] Enshu Liu, Xuefei Ning, Huazhong Yang, and Yu Wang. A unified sampling framework for solver searching of diffusion probabilistic models. In The Twelfth International Conference on Learning Representations, 2024a. URL https://openreview.net/forum?id=W2d3LZbhhI.
  • Liu et al. [2024b] Enshu Liu, Junyi Zhu, Zinan Lin, Xuefei Ning, Matthew B Blaschko, Sergey Yekhanin, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Linear combination of saved checkpoints makes consistency and diffusion models better. arXiv preprint arXiv:2404.02241, 2024b.
  • Lu et al. [2022] Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 2507–2521. Curran Associates, Inc., 2022. URL https://proceedings.neurips.cc/paper_files/paper/2022/file/11332b6b6cf4485b84afadb1352d3a9a-Paper-Conference.pdf.
  • Lu et al. [2024] Xudong Lu, Qi Liu, Yuhui Xu, Aojun Zhou, Siyuan Huang, Bo Zhang, Junchi Yan, and Hongsheng Li. Not all experts are equal: Efficient expert pruning and skip** for mixture-of-experts large language models. arXiv preprint arXiv:2402.14800, 2024.
  • Ma et al. [2023] Xinyin Ma, Gongfan Fang, and Xinchao Wang. LLM-pruner: On the structural pruning of large language models. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=J8Ajf9WfXP.
  • Malladi et al. [2023] Sadhika Malladi, Tianyu Gao, Eshaan Nichani, Alex Damian, Jason D. Lee, Danqi Chen, and Sanjeev Arora. Fine-tuning language models with just forward passes. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=Vota6rFhBQ.
  • Muzio et al. [2024] Alexandre Muzio, Alex Sun, and Churan He. Seer-moe: Sparse expert efficiency through regularization for mixture-of-experts. arXiv preprint arXiv:2404.05089, 2024.
  • Ning et al. [2020] Xuefei Ning, Tianchen Zhao, Wenshuo Li, Peng Lei, Yu Wang, and Huazhong Yang. Dsa: More efficient budgeted pruning via differentiable sparsity allocation. In European Conference on Computer Vision, pages 592–607. Springer, 2020.
  • Polino et al. [2018] Antonio Polino, Razvan Pascanu, and Dan Alistarh. Model compression via distillation and quantization. arXiv preprint arXiv:1802.05668, 2018.
  • Qwen Team [2024] Qwen Team. Hello qwen2. https://qwenlm.github.io/blog/qwen2/, 2024. Accessed: 2024-06-20.
  • Rajpurkar et al. [2016] Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors, Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1264. URL https://aclanthology.org/D16-1264.
  • Sajjad et al. [2023] Hassan Sajjad, Fahim Dalvi, Nadir Durrani, and Preslav Nakov. On the effect of drop** layers of pre-trained transformer models. Computer Speech & Language, 77:101429, 2023.
  • Salimans et al. [2017] Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • Shazeer [2020] Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  • Shazeer et al. [2017] Noam Shazeer, *Azalia Mirhoseini, *Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2017. URL https://openreview.net/forum?id=B1ckMDqlg.
  • Shen et al. [2023] Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with chatGPT and its friends in hugging face. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=yHdTscY6Ci.
  • Sun et al. [2024] Mingjie Sun, Zhuang Liu, Anna Bair, and J Zico Kolter. A simple and effective pruning approach for large language models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=PxoFut3dWW.
  • Syed et al. [2023] Aaquib Syed, Phillip Huang Guo, and Vijaykaarti Sundarapandiyan. Prune and tune: Improving efficient pruning techniques for massive language models, 2023. URL https://openreview.net/forum?id=cKlgcx7nSZ.
  • Tao et al. [2023] Chaofan Tao, Lu Hou, Haoli Bai, Jiansheng Wei, Xin Jiang, Qun Liu, ** Luo, and Ngai Wong. Structured pruning for efficient generative pre-trained language models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023, pages 10880–10895, Toronto, Canada, July 2023. Association for Computational Linguistics.
  • Team [2023] Mosaic Research Team. Introducing DBRX: A new state-of-the-art open LLM, 2023. https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm (Accessed: 2024-05-18).
  • Touvron et al. [2023] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  • Trofin et al. [2021] Mircea Trofin, Yundi Qian, Eugene Brevdo, Zinan Lin, Krzysztof Choromanski, and David Li. Mlgo: a machine learning guided compiler optimizations framework. arXiv preprint arXiv:2101.04808, 2021.
  • Wan et al. [2023] Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, et al. Efficient large language models: A survey. arXiv preprint arXiv:2312.03863, 1, 2023.
  • Wang et al. [2019] Chaoqi Wang, Roger Grosse, Sanja Fidler, and Guodong Zhang. EigenDamage: Structured pruning in the Kronecker-factored eigenbasis. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors, Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 6566–6575. PMLR, 09–15 Jun 2019.
  • Wierstra et al. [2014] Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. Natural evolution strategies. Journal of Machine Learning Research, 15(27):949–980, 2014. URL http://jmlr.org/papers/v15/wierstra14a.html.
  • Wortsman et al. [2022] Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvari, Gang Niu, and Sivan Sabato, editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR, 17–23 Jul 2022. URL https://proceedings.mlr.press/v162/wortsman22a.html.
  • xAI team [2024] xAI team. Grok: A new era of ai-powered personal assistance, 2024. https://x.ai/blog/grok (Accessed: 2024-05-18).
  • Xia et al. [2022] Mengzhou Xia, Zexuan Zhong, and Danqi Chen. Structured pruning learns compact and accurate models. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1513–1528, Dublin, Ireland, May 2022. Association for Computational Linguistics.
  • Xiao et al. [2024] Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=NG7sS51zVF.
  • Yang et al. [2018] Tien-Ju Yang, Andrew Howard, Bo Chen, Xiao Zhang, Alec Go, Mark Sandler, Vivienne Sze, and Hartwig Adam. Netadapt: Platform-aware neural network adaptation for mobile applications. In Proceedings of the European conference on computer vision (ECCV), pages 285–300, 2018.
  • Zeng et al. [2023] Andy Zeng, Maria Attarian, brian ichter, Krzysztof Marcin Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael S Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=G2Q2Mh3avow.
  • Zhang et al. [2019] Linfeng Zhang, Jiebo Song, Anni Gao, **gwei Chen, Chenglong Bao, and Kaisheng Ma. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3713–3722, 2019.
  • Zhang et al. [2022] Zhengyan Zhang, Yankai Lin, Zhiyuan Liu, Peng Li, Maosong Sun, and Jie Zhou. MoEfication: Transformer feed-forward layers are mixtures of experts. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Findings of the Association for Computational Linguistics: ACL 2022, pages 877–890, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.71. URL https://aclanthology.org/2022.findings-acl.71.
  • Zhou et al. [2022] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yan** Huang, Vincent Zhao, Andrew M Dai, zhifeng Chen, Quoc V Le, and James Laudon. Mixture-of-experts with expert choice routing. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35, pages 7103–7114. Curran Associates, Inc., 2022.
  • Zhou et al. [2024] Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, et al. A survey on efficient inference for large language models. arXiv preprint arXiv:2404.14294, 2024.

Appendix A Additional Details on Experimental Settings

A.1 Ours setting

Search Space. As mentioned in Sec. 5, to avoid optimizing too many parameters, we split the weights of all experts into several groups. The merging coefficients 𝑾EMsubscript𝑾EM\bm{W}_{\text{EM}}bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT and 𝑾RMsubscript𝑾RM\bm{W}_{\text{RM}}bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT within the same group are shared. Most of our main results are obtained by uniformly splitting all weights into four groups based on their depth, except for the experiments on the RTE, ReCoR, and DROP datasets in Tab. 1. We find that for these datasets, setting each layer as an independent group performs significantly better than using only four groups during the pruning phase. More detailed results can be found in Sec. D.5. For other datasets, we maintain the current setting without exploring other configurations, as it consistently yields good performance.

Search Process. We apply a two-stage search method as discussed in Sec. 4.2. The pruning phase consists of 40 iterations, followed by 160 iterations for the expert merging phase. At each iteration, we evaluate the accuracy on the training set and use this metric as the score for all individuals of merging coefficients in the population. Examples of the performance curve over the search iterations are provided in Sec. D.5.

Selected Datasets for OOD Evaluation. In Sec. 5.4, we randomly select 7 datasets for OOD test. These datasets are: (1) lukaemon_mmlu_electrical_engineering, (2) lukaemon_mmlu_professional_accounting, (3) lukaemon_mmlu_high_school_macroeconomics, (4) lukaemon_mmlu_high_school_computer_science, (5) lukaemon_mmlu_business_ethics, (6) lukaemon_mmlu_miscellaneous, and (7) lukaemon_mmlu_high_school_psychology.

A.2 Baselines

To evaluate the effectiveness of reducing the total number of experts, we compare our method against four baseline approaches: (1) Random selection of pruned experts, (2) pruning experts with the lowest frequency of activation, (3) pruning experts with the lowest soft activation values, and (4) NAEE [34], which exhaustively evaluates the discrepancy between the full model and all pruning choices for each layer and selects the one with the lowest discrepancy. For reducing the number of active experts, we adopt the dynamic skip** scheme from NAEE as a baseline approach.

For random selection, we uniformly sample a corresponding number of experts from all 8 experts in each layer. The full results with error margins for random selection are presented in Tab. 11.

For the frequency-based method, we run the model on the training set and count the number of times each expert is activated. We then prune the experts with the lowest frequency in each layer.

For the soft activation method, we run the model on the training set and accumulate the router weighting (soft activation value) for each expert. We then prune the experts with the lowest accumulated values in each layer.

For NAEE, we enumerate all pruning choices for each layer and select the one with the smallest output discrepancy compared to the full model. We use a batch of calibration data with a size of 64 to calculate the discrepancy. For the dynamic skip** scheme, we run the model on the entire training set to determine the median value of the ratio between the two largest routing weights for each layer. During validation, we dynamically skip the expert with the second-largest routing weight if the ratio between its weight and the largest weight is below the threshold. This results in an average of approximately 1.5 active experts.

Appendix B Size of current SMoE LLMs

Tab. 6 shows the basic parameter information of modern SMoE Large LLMs.

Table 6: Active Parameters, Total Parameters, and Parameters of the Experts for Various Models
Model Active Parameters Total Parameters Parameters of Experts
Mixtral 8x7B 13B 47B 45B
Mixtral 8x22B 39B 141B 136B
Grok-1 79B 314B 313B
DBRX 36B 132B 128B
Qwen 1.5-MoE-A2.7B 2.7B 14.3B 13.2B
Qwen 2-57B-A14B 14B 57B 49B

Appendix C Algorithm Details

Alg. 1 presents the details of EEP. The notations are consistent with those in Sec. 4.2. For the CrossoverCrossover\mathrm{Crossover}roman_Crossover operation, we combine the merging coefficients of the parent models along the dimension of the retained experts. For the MutateMutate\mathrm{Mutate}roman_Mutate operation, we perturb the merging coefficients. Specifically, during the pruning phase, we randomly replace the pruned experts with other experts and set the router weights accordingly. In the expert merging phase, we perturb the merging coefficients element-wise by adding Gaussian noise.

Algorithm 1 Evolutionary Search of EEP
0:   Θ={𝜽1l,𝜽2l,,𝜽El}l=1LΘsuperscriptsubscriptsuperscriptsubscript𝜽1𝑙superscriptsubscript𝜽2𝑙superscriptsubscript𝜽𝐸𝑙𝑙1𝐿\Theta=\{\bm{\theta}_{1}^{l},\bm{\theta}_{2}^{l},\cdots,\bm{\theta}_{E}^{l}\}_% {l=1}^{L}roman_Θ = { bold_italic_θ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_italic_θ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , ⋯ , bold_italic_θ start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT: Full set of expert weights across all L𝐿Litalic_L SMoE blocks.  \mathcal{F}caligraphic_F: The metric evaluator.
0:   P𝑃Pitalic_P: The whole Population of matrix configurations.  CP𝐶𝑃CPitalic_C italic_P: The Candidate Parents set of each loop, from which a parent configuration is selected.  NG𝑁𝐺NGitalic_N italic_G: The Next Generation newly mutated from the parent configurations in each loop.  𝑾𝑾\bm{W}bold_italic_W = {𝑾EMl,𝑾RMl}l=1Lsuperscriptsubscriptsuperscriptsubscript𝑾EM𝑙superscriptsubscript𝑾RM𝑙𝑙1𝐿\{\bm{W}_{\text{EM}}^{l},\bm{W}_{\text{RM}}^{l}\}_{l=1}^{L}{ bold_italic_W start_POSTSUBSCRIPT EM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT , bold_italic_W start_POSTSUBSCRIPT RM end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_l = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_L end_POSTSUPERSCRIPT: Full set of the search parameters across all L𝐿Litalic_L SMoE blocks.
0:   Epoch: Number of loops for the entire search process.  MCPsubscriptM𝐶𝑃\textbf{M}_{CP}M start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT: Maximum size of the candidate parents set CP𝐶𝑃CPitalic_C italic_P.  Iter: Maximum number of mutations in each loop.
0:  
1:  P𝑃P\leftarrow\varnothingitalic_P ← ∅
2:  Initialize a set of random matrices 𝑾initsubscript𝑾init\boldsymbol{\bm{W}}_{\text{init}}bold_italic_W start_POSTSUBSCRIPT init end_POSTSUBSCRIPT, ensuring that each row is a one-hot vector.
3:  PP{(𝑾init,(𝑾init))}𝑃𝑃subscript𝑾𝑖𝑛𝑖𝑡subscript𝑾𝑖𝑛𝑖𝑡P\leftarrow P\cup\{({\bm{W}}_{init},\mathcal{F}({\bm{W}}_{init}))\}italic_P ← italic_P ∪ { ( bold_italic_W start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT , caligraphic_F ( bold_italic_W start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT ) ) }
4:  for r=Expert Pruning Phase,Expert Matching Phase𝑟Expert Pruning PhaseExpert Matching Phaser=\textit{Expert Pruning Phase},\;\textit{Expert Matching Phase}italic_r = Expert Pruning Phase , Expert Matching Phase do
5:     for t=1,,Iters𝑡1Iterst=1,\cdots,\textit{Iters}italic_t = 1 , ⋯ , Iters do
6:        NG𝑁𝐺NG\leftarrow\varnothingitalic_N italic_G ← ∅
7:        for i=1,,Epochs𝑖1Epochsi=1,\cdots,\textit{Epochs}italic_i = 1 , ⋯ , Epochs do
8:           CP{𝑾i|(𝑾iΘ)CP\leftarrow\{{\bm{W}}_{i}|\mathcal{F}({\bm{W}}_{i}\cdot\Theta)italic_C italic_P ← { bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | caligraphic_F ( bold_italic_W start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ roman_Θ ) ranks within the top min(MCP,|P|)𝑚𝑖𝑛subscriptM𝐶𝑃𝑃min(\textbf{M}_{CP},|P|)italic_m italic_i italic_n ( M start_POSTSUBSCRIPT italic_C italic_P end_POSTSUBSCRIPT , | italic_P | ) in P}P\}italic_P }
9:           𝑾f,𝑾mRandom SampleCPRandom Samplesubscript𝑾𝑓subscript𝑾𝑚𝐶𝑃{\bm{W}}_{f},{\bm{W}}_{m}\xleftarrow{\text{Random Sample}}CPbold_italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT start_ARROW overRandom Sample ← end_ARROW italic_C italic_P
10:           𝑾newMutate(Crossover(𝑾f,𝑾m))subscript𝑾𝑛𝑒𝑤MutateCrossoversubscript𝑾𝑓subscript𝑾𝑚{\bm{W}}_{new}\leftarrow\mathrm{Mutate}(\mathrm{Crossover}({\bm{W}}_{f},{\bm{W% }}_{m}))bold_italic_W start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ← roman_Mutate ( roman_Crossover ( bold_italic_W start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , bold_italic_W start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) )
11:           NGNG{(𝑾new,(𝑾new))}𝑁𝐺𝑁𝐺subscript𝑾𝑛𝑒𝑤subscript𝑾𝑛𝑒𝑤NG\leftarrow NG\cup\{({\bm{W}}_{new},\mathcal{F}({\bm{W}}_{new}))\}italic_N italic_G ← italic_N italic_G ∪ { ( bold_italic_W start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT , caligraphic_F ( bold_italic_W start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) ) }
12:        end for
13:        PPNG𝑃𝑃𝑁𝐺P\leftarrow P\cup NGitalic_P ← italic_P ∪ italic_N italic_G
14:     end for
15:  end for
16:  𝑾argmin𝑾P(𝑾)superscript𝑾subscript𝑾𝑃𝑾\bm{W}^{*}\leftarrow\mathop{\arg\min}\limits_{\begin{subarray}{c}\bm{W}\in P% \end{subarray}}\mathcal{F}(\bm{W})bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ← start_BIGOP roman_arg roman_min end_BIGOP start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_italic_W ∈ italic_P end_CELL end_ROW end_ARG end_POSTSUBSCRIPT caligraphic_F ( bold_italic_W )
17:  return  𝑾superscript𝑾\bm{W}^{*}bold_italic_W start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

Appendix D Additional Results

D.1 Results with other models

In this section, we further apply EEP to the Qwen 1.5 [4] and Qwen 2 [40] SMoE models. Results can be found in Tab. 7 and Tab. 8. The same observations in Sec. 5 hold for these models: (1) EEP selects better pruning patterns than other baseline methods without updating the remaining parameters, and (2) expert merging brings improvements in most cases.

For the Qwen1.5-MoE-A2.7B-Chat [4], we notice that other methods are prone to collapse. Conversely, the situation is the opposite for the Qwen2-MoE-A14B-Chat model [40]. Most baseline methods can maintain the performance of the full model with an extremely low number of experts retained. In face, we observe that the experts in the Qwen2-MoE-A14B-Chat model are specifically homogeneous, as the model’s performance is largely maintained even when only one random expert is activated per token. However, according to the information provided in their technical report, both Qwen1.5-MoE-A2.7B andQwen2-MoE-A14B employ upcycling and 64 experts per layer. We thus speculate that other training configurations, such as sizes and optimizer hyperparameters, lead to different final statuses. Nevertheless, EEP always achieves comparable or better performance than the full model and outperforms all baseline methods across settings, demonstrating its adaptability to different SMoE models.

Table 7: Results of expert pruning on Qwen1.5-MoE-A2.7B-Chat. Bold values indicate the best performance; underlined values show the best without updating remaining parameters. For NAEE, due to the excessive number of combinatorial possibilities, we only randomly select 5k of them for each layer.
Budget Method WIC WSC BoolQ CB SQuAD Avg.
Num=60 Full Model 51.4 46.2 73.6 32.1 68.6 54.4
Num=30 Random 3.7±plus-or-minus\pm±12.1 7.6±plus-or-minus\pm±14.3 8.1±plus-or-minus\pm±12.9 5.6±plus-or-minus\pm±8.4 19.5±plus-or-minus\pm±23.0 8.9
Frequency [37] 55.6 9.6 2.4 0.0 17.9 21.7
Soft Activation [37] 51.4 30.8 0.4 44.6 28.0 31.0
NAEE [34] 0.0 0.0 1.6 0.0 34.6 7.2
EEP (Prune Only) 59.8 59.6 78.0 71.4 70.6 67.9
EEP (Prune+Merge) 62.6 66.3 81.4 76.9 71.4 71.7
Num=15 Random 1.4±plus-or-minus\pm±5.9 0.5±plus-or-minus\pm±1.3 2.0±plus-or-minus\pm±4.1 4.3±plus-or-minus\pm±10.6 1.1±plus-or-minus\pm±3.4 1.9
Frequency [37] 0.0 0.0 7.8 16.1 0.0 4.9
Soft Activation [37] 26.2 3.9 0.0 0.0 25.4 11.1
NAEE [34] 0.0 1.0 5.2 0.0 0.0 1.2
EEP (Prune Only) 51.0 36.5 45.4 60.7 57.6 50.2
EEP (Prune+Merge) 54.4 63.5 58.2 58.9 76.9 62.7
Table 8: Results of expert pruning on Qwen2-MoE-A14B-Chat. Bold values indicate the best performance; underlined values show the best without updating remaining parameters. For NAEE, due to the excessive number of pruning patterns, we only randomly select 2k of them for each layer.
Budget Method WIC WSC BoolQ CB SQuAD Avg.
Num=64 Full Model 60.2 68.3 88.8 67.9 74.4 71.9
Num=8 Random 55.3±plus-or-minus\pm±7.1 61.6±plus-or-minus\pm±5.6 78.7±plus-or-minus\pm±7.3 35.4±plus-or-minus\pm±17.6 79.7±plus-or-minus\pm±2.4 62.1
Frequency [37] 58.8 59.6 79.4 46.4 78.2 64.5
Soft Activation [37] 60.8 64.4 82.6 14.3 75.2 59.5
NAEE [34] 56.6 60.6 82.6 41.1 81.2 64.4
EEP (Prune Only) 61.8 72.1 85.8 76.8 85.6 76.4
EEP (Prune+Merge) 63.4 75.0 85.8 85.7 87.0 79.4
Num=4 Random 56.5±plus-or-minus\pm±1.9 59.8±plus-or-minus\pm±5.2 79.1±plus-or-minus\pm±4.0 32.1±plus-or-minus\pm±15.0 78.0±plus-or-minus\pm±2.4 61.1
Frequency [37] 56.8 60.6 83.2 17.9 80.0 59.7
Soft Activation [37] 59.2 61.5 81.6 17.9 77.6 59.6
NAEE [34] 55.0 61.5 75.8 21.4 79.6 58.7
EEP (Prune Only) 62.0 65.4 84.6 69.6 80.6 72.4
EEP (Prune+Merge) 63.8 72.1 85.8 80.4 84.2 77.3
Num=2 Random 56.4±plus-or-minus\pm±1.4 58.2±plus-or-minus\pm±3.7 77.8±plus-or-minus\pm±4.5 26.5±plus-or-minus\pm±9.6 76.4±plus-or-minus\pm±1.9 59.1
Frequency [37] 58.0 60.6 79.6 42.9 72.4 62.7
Soft Activation [37] 57.4 65.4 71.4 62.5 76.8 66.7
NAEE [34] 55.6 56.7 73.4 16.1 75.0 55.4
EEP (Prune Only) 59.2 68.3 83.4 67.9 82.0 72.2
EEP (Prune+Merge) 61.0 70.2 84.4 76.8 83.8 75.2
Num=1 Random 56.6±plus-or-minus\pm±1.3 56.3±plus-or-minus\pm±2.7 78.7±plus-or-minus\pm±1.5 23.5±plus-or-minus\pm±5.9 75.2±plus-or-minus\pm±1.6 58.1
Frequency [37] 52.2 62.5 78.6 35.7 77.0 61/
Soft Activation [37] 57.8 63.5 77.4 42.9 76.0 63.5
NAEE [34] 57.6 56.7 78.6 16.1 73.6 56.5
EEP (Prune Only) 57.8 65.4 82.6 57.1 81.4 68.5
EEP (Prune+Merge) 59.4 69.2 84.0 82.1 82.8 75.5

D.2 Fine-tuning using EEP

EEP can also be applied to fine-tune the model without pruning. As shown in Tab. 9, the effectiveness of EEP in fine-tuning demonstrates the efficiency of expert merging. Notably, EEP does not compute gradients and can therefore be executed on devices capable of inference.

Table 9: Results of fine-tuning on Mixtral 8×7878\times 78 × 7B using EEP.
Method WSC WIC RTE BoolQ CB Record SQuAD DROP Average
Baseline 63.5 51.8 73.2 77.4 51.7 50.3 53.4 30.6 56.5
EEP 78.8 69.2 78.7 86.2 80.4 63.0 78.4 51.5 73.2

D.3 Profiling Results

We notice that the speedup ratio brought by pruning experts is influenced by the batch size. Additionally, in different stages of the generation process, the speedup ratio is also different. Therefore, we report more detailed profiling results of Mixtral 8×7878\times 78 × 7B model in Tab. 10.

Table 10: Profiling the inference speedup of Mixtral 8×7878\times 78 × 7B.
Total Active Method Prefill Speedup Decode Speedup
BS=1 BS=32 BS=256 BS=1 BS=32 BS=256
8 2 Full Model 1.0×\times× 1.0×\times× 1.0×\times× 1.0×\times× 1.0×\times× 1.0×\times×
1 EEP 1.05×\times× 1.58×\times× 1.63×\times× 1.34×\times× 1.06×\times× 1.02×\times×
4 2 EEP 1.47×\times× 1.02×\times× 1.03×\times× 1.05×\times× 1.60×\times× 1.29×\times×
1 EEP 1.75×\times× 1.77×\times× 1.72×\times× 1.37×\times× 1.60×\times× 1.33×\times×
2 2 EEP 2.00×\times× 1.20×\times× 1.03×\times× 1.15×\times× 2.43×\times× 1.53×\times×

D.4 Random search

We demonstrate the full results of the random pruning baseline with error margin in Tab. 11 and Tab. 12. From the results we can find that random pruning is extremely unstable, especially under low expert number budget, which indicates the challenge of the expert pruning.

Table 11: Error margin of ramdom pruning on Mixtral 8×7878\times 78 × 7B.
Expert Method COPA MultiRC WIC WSC RTE BoolQ CB ReCoRD DROP SQuAD
Num=8 Full Model 89.0 83.0 51.8 63.5 73.2 77.4 51.7 50.3 30.6 53.4
Num=4 Random 63.8±plus-or-minus\pm±17.5 49.4±plus-or-minus\pm±18.0 37.6±plus-or-minus\pm±17.9 43.3±plus-or-minus\pm±20.8 45.1±plus-or-minus\pm±11.9 50.2±plus-or-minus\pm±21.3 38.7±plus-or-minus\pm±13.8 35.1±plus-or-minus\pm±12.7 27.4±plus-or-minus\pm±4.6 58.3±plus-or-minus\pm±11.6
Num=2 Random 36.8±plus-or-minus\pm±14.6 22.3±plus-or-minus\pm±8.4 13.6±plus-or-minus\pm±14.8 15.0±plus-or-minus\pm±18.1 28.4±plus-or-minus\pm±13.4 15.5±plus-or-minus\pm±17.1 38.6±plus-or-minus\pm±10.8 16.9±plus-or-minus\pm±7.4 18.3±plus-or-minus\pm±3.2 36.9±plus-or-minus\pm±12.6
Table 12: Results of random pruning on Mixtral 8×228228\times 228 × 22B.
Budget Method WIC WSC BoolQ CB SQuAD
Num=8 Full Model 68.2 81.7 90.2 46.5 45.8
Num=4 Random 27.0±plus-or-minus\pm±24.7 30.2±plus-or-minus\pm±23.7 37.8±plus-or-minus\pm±32.7 34.6±plus-or-minus\pm±14.1 37.2±plus-or-minus\pm±26.2
Num=2 Random 13.9±plus-or-minus\pm±15.1 10.1±plus-or-minus\pm±13.2 11.0±plus-or-minus\pm±12.9 24.9±plus-or-minus\pm±15.6 15.6±plus-or-minus\pm±20.3

D.5 Ablation study

The hyperparameters of EEP include the number of groups that share the same coefficients, and the number of search iterations.

Number of Groups. We uniformly split all expert weights into a number of groups. We evaluate the results when there are 4 groups (the merging coefficients are shared across layers within the group) and 32 groups (i.e., the merging coefficients of each layer are effectively independent) on RTE, ReCoRD, and DROP. Results are shown in Tab. 13. We observe that more groups achieve much better performance in the pruning phase, especially when the number of experts is extremely low. However, dividing weights into more groups introduces more parameters to optimize, which may be detrimental to the expert merging phase. It is validated that the improvements brought by expert merging with 4 groups are larger than those with 32 groups. Taking all these factors into account, we use 32 groups for these three datasets and keep 4 groups for the rest of the experiments.

Table 13: Results with different number of coefficient groups.
Group Number Expert Method RTE DROP ReCoRD
4 Num=4 Prune Only 62.8 35.5 59.2
Prune+Merge 71.5 38.9 63.2
Num=2 Prune Only 53.8 25.3 36.0
Prune+Merge 61.7 27.5 38.8
32 Num=4 Prune Only 74.0 37.3 60.0
Prune+Merge 76.9 39.7 63.6
Num=2 Prune Only 64.3 37.1 47.2
Prune+Merge 69.0 38.4 47.2

Search Iterations. We plot the Accuracy-Iteration curve in Fig. 5. We report the best accuracy among all evaluated merging coefficients at each iteration. From the figure, we can see that the evolutionary search in the pruning phase is effective and efficient, finding good pruning configurations from poor initialization within only 40 iterations. The expert merging phase can further improve performance based on the pruning results.

Refer to caption
(a) Accuracy-Iteration curve on CB dataset.
Refer to caption
(b) Accuracy-Iteration curve on BoolQ dataset.
Figure 5: Accuracy-Iteration curves on different datasets. The model is Mixtral 8×7878\times 78 × 7B and the total number of expert is 4.

D.6 Router Pattern

In Sec. 5.6, we demonstrate the changes in expert activation patterns using the statistics from the first transformer block in a Mixtral 8 ×\times× 7B-Instruct model. Additionally, in this section, we provide the statistics for the 15th transformer block Fig. 6 and the 31st transformer block Fig. 7.

Refer to caption
(a) Activation correlation before and after pruning.
Refer to caption
(b) Accumulated activation times before and after pruning.
Refer to caption
(c) Accumulated routing weights before and after pruning.
Figure 6: Statistics of the expert activation patterns before and after pruning. The data represents the 15-th transformer block of Mixtral 8×7878\times 78 × 7B-Instruct on the SQuAD dataset. In (a), four retained experts are re-indexed from 0 to 3 for clarity.
Refer to caption
(a) Activation correlation before and after pruning.
Refer to caption
(b) Accumulated activation times before and after pruning.
Refer to caption
(c) Accumulated routing weights before and after pruning.
Figure 7: Statistics of the expert activation patterns before and after pruning. The data represents the 31-th transformer block of Mixtral 8×7878\times 78 × 7B-Instruct on the SQuAD dataset. In (a), four retained experts are re-indexed from 0 to 3 for clarity.

D.7 Demonstration of Searched Patterns

We demonstrate the final searched patterns (pruning + merging) in Fig. 8. There is always one highlighted block in each row, which corresponds to the primarily retained experts in the pruning phase, while other values are close to zero. This shows that the merging matrix does not deviate significantly from the discrete matrix obtained in the pruning phase. However, these slight changes bring significant improvements. Additionally, we observe negative coefficients in some positions, indicating that the knowledge from certain experts may not benefit the downstream task.

Refer to caption
(a) Visualization of the searched expert merging matrix.
Refer to caption
(b) Visualization of the searched router map** matrix.
Figure 8: Visualization of the searched patterns on the CB dataset.

Appendix E Prompt

We list the prompt we used for each dataset in Tab. 14. We follow the default prompt in the Opencompass codebase [11].

Table 14: Prompts for all datasets.
Dataset Prompt
WIC Sentence 1: <sentence1>\\\backslash\nSentence 2: <sentence2>
Are ’<word>’ in the above two sentences the same?\\\backslash\nA. Yes\\\backslash\nB. No\\\backslash\nAnswer:
A/B
WSC Passage: <text>\\\backslash\n
Does the pronoun # <span2> # refer to * <span1> *?\\\backslash\nA. Yes\\\backslash\nB. No\\\backslash\nAnswer:
A/B
RTE <premise>\\\backslash\n<hypothesis>\\\backslash\n
Is the sentence below entailed by the sentence above?\\\backslash\nA. Yes\\\backslash\nB. No\\\backslash\nAnswer:
A/B
BoolQ <passage>\\\backslash\n
Question: question\\\backslash\nA. Yes\\\backslash\nB. No\\\backslash\nAnswer:
A/B
CB <premise>\\\backslash\n<hypothesis>\\\backslash\n
What is the relation between the two sentences?\\\backslash\nA. Contradiction\\\backslash\nB. Entailment\\\backslash\nC. Neutral\\\backslash\nAnswer:
A/B/C
ReCoRD Passage: <text>\\\backslash\nResult: <question>\\\backslash\n
Question: What entity does ____ refer to in the result? Give me the entity name:
DROP \\\backslash\n\\\backslash\nText: <prompt>\\\backslash\n
Question: <question>\\\backslash\nAnswer:
SQuAD <context>\\\backslash\nAccording to the above passage, answer the following question.
If it is impossible to answer according to the passage, answer ‘impossible to answer‘:\\\backslash\n
Question: <question>

Appendix F Examples of model outputs, and metric evaluations

In this section, we provide examples of different approaches’ output in Fig. 9, Fig. 10 and Fig. 11.

User: <s>[INST]There is also a growing number of new forms of procurement that involve relationship contracting where the emphasis is … \\\backslash\nAccording to the above passage, answer the following question. If it is impossible to answer according to the passage, answer ‘impossible to answer‘:\\\backslash\nQuestion: Focus on what is to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry[/INST]
Full Model: The focus is on cooperation to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. Evaluation: ✓
Expert Num=4:
EEP: The focus is on cooperation to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. Evaluation: ✓
NAEE: The focus is to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. This is achieved through the emphasis on a co-operative relationship between the principal and contractor and other stake. Evaluation: ×
Frequency: The focus on co-operation in new forms of procurement is to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. Evaluation: ✓
Soft Activation: The focus in these new forms of procurement is to ameliorate (improve) the many problems that arise from the often highly competitive and adversarial (hostile) practices within the construction industry.\\\backslash\n\\\backslash\nThe problems that Evaluation: ×
Expert Num=2:
EEP: Focus on co-operation is to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. Evaluation: ✓
NAEE: The focus is to ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry. is not a clear or concise enough statement to be able to answer the question. Evaluation: ×
Frequency: I ameliorate the many problems that arise from the often highly competitive and adversarial practices within the construction industry.\\\backslash\n\\\backslash\nAccording to the passage, answer:\\\backslash\n\\\backslash\n> Focus on what is to ameliorate the many problems Evaluation: ×
Soft Activation: \\\backslash\n\\\backslash\n —————————————————————————————————————————————————————————————————————————————————————————————————————————————————————————- Evaluation: ×
Figure 9: Example of Mixtral 8 ×\times× 7B-Instruct on SQuAD.
User: <s>[INST]For the 2012–13 school year annual tuition was $38,000, with a total cost of attendance of $57,000. Beginning 2007, families with incomes below $60,000 pay nothing for their children to attend, including room and board…\\\backslash\nAccording to the above passage, answer the following question. If it is impossible to answer according to the passage, answer ‘impossible to answer‘:\\\backslash\nQuestion: How much did students pay in total to go to Harvard in 2007?[/INST]
Full Model: It is impossible to answer according to the passage. The passage only states that families with incomes below $60,000 pay nothing for their children to attend Harvard starting from the 2007-2008 school year.
Evaluation: ✓
Expert Num=4:
EEP: It is impossible to answer according to the passage. The passage does not provide information about the total amount paid by students in 2007. Evaluation: ✓
NAEE: The passage states that families with incomes below $60,000 pay nothing for their children to attend Harvard in 2007, and families with incomes between $60,000 to $80,0 Evaluation: ×
Frequency: It is not possible to answer this question because the total cost of attendance for the year 2007 is not provided in the passage. The passage only mentions that families with incomes below $60,000 pay nothing for Evaluation: ×*
Soft Activation: It is impossible to answer this question according to the passage. The passage only provides total cost information for one year (2012-13) and the total amount of grants for 2009, but it does not provide information Evaluation: ✓
Expert Num=2:
EEP: Impossible to answer. The passage does not provide information on the total cost of attendance for students in 2007. Evaluation: ✓
NAEE: The passage states that families with incomes below $60,000 pay nothing for their children to attend Harvard in 2007, and families with incomes between $60,000 to $80,0. Evaluation: ×
Frequency: The average cost for the 2007 students is $120,000.\\\backslash\n\\\backslash\n 2007 students paid $120,000 for the 2009 students.\\\backslash\n\\\backslash\n Evaluation: ×
Soft Activation: The average cost of the students’s 2007 is $38,000, with a total cost of $57,000.\\\backslash\n\\\backslash\n 2008: How much did the students’s Evaluation: ×
Figure 10: Example of Mixtral 8 ×\times× 7B-Instruct on SQuAD. * means the answer is actually right but was marked as wrong due to the mismatch with the template.
User: <s>[INST]Economist Simon Kuznets argued that levels of economic inequality are in large part the result of stages of development. According to Kuznets, countries with low levels of development have relatively equal distributions of wealth… \\\backslash\nAccording to the above passage, answer the following question. If it is impossible to answer according to the passage, answer ‘impossible to answer‘:\\\backslash\nQuestion: What profession does Simon Kuznets have?[/INST]
Full Model: The passage does not provide information on Simon Kuznets’ profession. It only mentions his argument about the relationship between economic development and inequality. Evaluation: ×
Expert Num=4:
EEP: Economist Evaluation: ✓
NAEE: Economist\\\backslash\n\\\backslash\nQuestion: What is Simon Kuznets’ view on the relationship between economic development and income inequality?\\\backslash\nAnswer: According to Simon Kuznets, as a country develops and acquires more capital, it leads Evaluation: ✓
Frequency: The passage does not provide information about the profession of Simon Kuznets. It only mentions that he is an economist and has argued that levels of economic inequality are in large part the result of stages of development. Evaluation: ×
Soft Activation: The passage does not mention Simon Kuznets’s profession. It only mentions his argument about economic development and inequality. Evaluation: ×
Expert Num=2:
EEP: Economist Evaluation: ✓
NAEE: Simon Kuznets is not a real economist, so it is impossible to answer this question. Evaluation: ✓*
Frequency: Simon Kuznits has a high level of development and a high level of development. According to the passage, the development of the country is in large part the result of the development of the country. According to the passage, the development of the Evaluation: ×
Soft Activation: Simon Kuznets is a real estate agent.\\\backslash\n\\\backslash\n‘impossible to answer‘ is a real estate agent.\\\backslash\n\\\backslash\n‘impossible to answer‘ is a real estate agent.\\\backslash\n\\\backslash\n‘impossible to answer‘ is a real Evaluation: ×
Figure 11: Example of Mixtral 8 ×\times× 7B-Instruct on SQuAD. * means that the answer is actually incorrect but was marked as correct due to flaws in the evaluation method.