footnotetext: Work done during an internship at Kuaishou Technology. Correspondence to Xi Li.

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Longrong Yang1,∗, Dong Sheng3, Chaoxiang Cai2, Fan Yang3, Size Li3, Di Zhang3, Xi Li1,†
1College of Computer Science and Technology, Zhejiang University
2School of Software Technology, Zhejiang University
3Kuaishou Technology
Abstract

The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token. However, the predictions are based solely on sample features and do not truly reveal the optimization direction of tokens. This can lead to severe optimization conflicts between different tokens within an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis. Specifically, we first use token-level gradients to identify conflicting tokens in experts. Then, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate the effectiveness of our method. The code will be publicly available at https://github.com/longrongyang/STGC.

1 Introduction

Large Vision-Language Models (LVLMs) have recently demonstrated significant advancements by integrating visual processing modules into Large Language Models (LLMs), bringing LLMs closer to Artificial General Intelligence. Many recent works (Zhang et al., 2023a; Bai et al., 2023b; Zhang et al., 2023b; Zhao et al., 2023; Chen et al., 2023b) show that large model size and large dataset size are especially important to enhance intelligence, i.e., the scaling law. Even when the size is big enough, models exhibit “emergent abilities,” which are not present in small models but are only present in large models. Thus, a series of studies (Li et al., 2022; Dai et al., 2023; Liu et al., 2023b) have expanded the capacity of LVLMs to 13 billion parameters, leading to state-of-the-art performance on various tasks.

Under realistic applications, deploying such large models requires considerable computational resources, making it extremely expensive. A popular solution for reducing the inference cost is the Mixture-of-Experts (MoE) architecture. The MoE, a form of sparsely activated model, has been verified by many works (Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022) to achieve comparable performance with dense models when activating fewer parameters under inference. This characteristic has recently made the MoE gain traction. In the MoE, a fundamental problem is the routing of tokens. To route tokens to different experts, existing methods (Lin et al., 2024; Dai et al., 2024) typically use a router, such as a linear layer, to predict the probability of each token belonging to different experts. The tokens are then dispatched to the expert with the Top-k𝑘kitalic_k predicted probability. Additionally, to prevent load imbalance, existing methods usually incorporate a load-balancing loss, which aims to equalize the distribution of tokens among various experts.

Refer to caption
Figure 1: (a) In this work, our goal is to reduce gradient conflicts among tokens within an expert. (b) Our method achieves this goal by reducing the routing scores of the identified conflicting tokens on their corresponding experts, thus encouraging these tokens to be assigned to other experts rather than their current ones. This strategy promotes further specialization of experts and leads to an increase in model performance.

Beyond balancing the load, another important goal of routing tokens to different experts is to reduce interference among diverse datasets. To achieve this, some recent works have performed clustering on instruction embeddings (Gou et al., 2023), grou** similar samples sent to the same expert for preliminary sample-level division. However, because routing during training is at the token level, existing methods struggle with conflicts between tokens within samples. Meanwhile, samples with similar features can have distinct optimization objectives, thus leading to conflicts. The gradient directly indicates the optimization direction, so in this work, we explore data interference in MoE through the lens of token-level gradients. As shown in Figure 1 (a), our basic idea is to reduce gradient conflicts among tokens within an expert, to address severe data interference during the learning of LVLMs under complex and real-world scenarios.

To address the token conflict problem within the MoE, we propose a novel regularization loss based on token-level gradients. Our method consists of two steps. Specifically: (i)𝑖(i)( italic_i ) Conflicting Token Identification. After processing a batch of data, we perform a backward pass to obtain the token-level gradients for each expert, without updating any model parameters. Within an expert, we define the average gradient of all tokens as the average gradient, representing the holistic optimization direction of the expert. Tokens with gradients having negative cosine similarity to the average gradient are identified as conflicting tokens. (ii) Conflict Elimination Loss. For the conflicting tokens, we record their routing scores predicted by the router. We then reduce the routing probabilities of these tokens to other experts. As shown in Figure 1 (b), this strategy encourages routing conflicting tokens to other experts, reducing interference among diverse data.

In conclusion, our contribution can be summarized as:

  • Beyond relying on sample-level cues, we propose using token-level gradients to identify conflicts among tokens within an expert.

  • We propose a novel conflict elimination loss to resolve conflicts among tokens within an expert, promoting the further specialization of experts.

  • Designed as a plug-in, our method can be seamlessly integrated into existing Large Vision-Language Models (LVLMs). Extensive experiments confirm its effectiveness.

2 Related Works

2.1 Large Vision-language Model

Large Language Models (LLMs) have demonstrated strong instruction following and generalization capabilities. To maintain these capabilities while incorporating visual information, Large Vision-Language Models (LVLMs) such as GPT-4 and LLaVA utilize frozen visual encoders and trainable visual projectors to integrate visual data into LLMs. They typically encode visual information into visual tokens and use these tokens to condition the adaptation of language tokens within LLMs (OpenAI, 2023; Touvron et al., 2023a; Wei et al., 2022; Touvron et al., 2023b; Zheng et al., 2023; Team, 2023; Sun et al., 2023; Du et al., 2021; Bai et al., 2023a; Yang et al., 2023; Penedo et al., 2023; Taori et al., 2023). Recent works have focused on improving performance through two types of methods. The first type optimizes training strategies, e.g.,  (Bai et al., 2023b; Chen et al., 2023a). Most works belong to the second type, focusing on enhancing visual components, including expanding visual instruction-tuning datasets (Liu et al., 2023a; Zhang et al., 2023b), improving image encoders (Chen et al., 2023d; Bai et al., 2023b), and aligning the input and projection layers (Lin et al., 2023; Cha et al., 2023; Alayrac et al., 2022; Dai et al., 2023; Ye et al., 2023; Zhao et al., 2023). These efforts, particularly the expansion of visual instruction-tuning datasets and the increase in model scales, have significantly enhanced the visual understanding abilities of LVLMs.

2.2 Mixture-of-Experts (MoE)

The Mixture-of-Experts (MoE) is a hybrid model, consisting of multiple sub-models known as experts, and has shown potential in scaling up models (Shazeer et al., 2017). The key concept of MoE lies in the use of a router to determine the token set that each expert handles, aiming for reducing interference among tokens from different types of samples. Early MoE works have utilized the hard routing mode, where each expert is typically assigned a specific role. For example, a series of works (Bao et al., 2022; Long et al., 2023; Satar et al., 2022; Wang et al., 2022; Shen et al., 2023) consider language and vision gaps in multi-modal data (Liang et al., 2022), decoupling experts by modal category and assigning a specific role to each expert. The key feature of hard routers is that they eliminate the need to learn routing assignments. The hard routing has also been widely applied in task-specific MoEs (Li et al., 2023c; Zhu et al., 2022; Ma et al., 2023; Kudugunta et al., 2021).

Then, soft routers enable a dynamic allocation of tokens among different experts, allowing each expert to focus on its expertise and achieving model sparsity. Recent LLM (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022) and LVLM works have mainly focused on soft routers. For instance, Gshard (Lepikhin et al., 2020) incorporates MoE into transformers and achieves excellent performance. Lifelong-MoE (Chen et al., 2023c) uses MoE to mitigate the challenge of catastrophic forgetting in lifelong learning. MoE-LLaVA (Lin et al., 2024) and LLaVA-MoLE (Chen et al., 2024) utilize MoE and its variants to empower LVLMs. The approach of expert segmentation increases the number of experts to achieve greater specialization. DeepSeekMoE (Dai et al., 2024) and QwenMoE (Bai et al., 2023a) segment experts by splitting the FFN intermediate hidden dimension. Cluster-based methods, such as MoCLE (Gou et al., 2023), cluster samples and then route those in the same cluster to the same expert. DEMIX (Gururangan et al., 2021) clusters samples according to their task type, ensuring that samples in the same cluster are routed to the same expert. Existing methods mainly operate at the sample level and rely solely on either features or labels, making it challenging to address conflicts between tokens within the same sample. This work aims to use token-level gradients for identifying and solving token optimization conflicts within an expert in the MoE.

3 Methodology

Refer to caption
Figure 2: Our pipeline. (a) Conflicting Token Identification. When the gradient of a token has a sufficiently low cosine similarity to the average gradient of its assigned expert, this token is marked as a conflicting token. (b) Conflict Elimination Loss. We propose a loss term aimed at discouraging the routing of conflicting tokens to their assigned experts, achieved by reducing the routing scores for these tokens for their assigned experts.

3.1 Overview

Large Vision-Language Model: A Large Vision-Language Model (LVLM) aims to effectively integrate the capabilities of the pre-trained LLM and a visual model. Specifically, given a RGB image 𝐯H×W×3𝐯superscript𝐻𝑊3\mathbf{v}\in\mathbb{R}^{H\times W\times 3}bold_v ∈ blackboard_R start_POSTSUPERSCRIPT italic_H × italic_W × 3 end_POSTSUPERSCRIPT, where H𝐻Hitalic_H and W𝑊Witalic_W are its height and width, the vision encoder processes the input image to obtain a visual token sequence 𝒵=[z1,z2,,zP]P×C𝒵subscript𝑧1subscript𝑧2subscript𝑧𝑃superscript𝑃𝐶\mathcal{Z}=[z_{1},z_{2},\cdots,z_{P}]\in\mathbb{R}^{P\times C}caligraphic_Z = [ italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_z start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT, where P𝑃Pitalic_P is the sequence length of visual tokens, calculated as P=H×W142𝑃𝐻𝑊superscript142P=\frac{H\times W}{14^{2}}italic_P = divide start_ARG italic_H × italic_W end_ARG start_ARG 14 start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. A visual projection layer is then used to map 𝒵P×C𝒵superscript𝑃𝐶\mathcal{Z}\in\mathbb{R}^{P\times C}caligraphic_Z ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_C end_POSTSUPERSCRIPT to 𝒱P×D𝒱superscript𝑃𝐷\mathcal{V}\in\mathbb{R}^{P\times D}caligraphic_V ∈ blackboard_R start_POSTSUPERSCRIPT italic_P × italic_D end_POSTSUPERSCRIPT, where D𝐷Ditalic_D represents the hidden layer size of Large Language Model (LLM). Similarly, the text undergoes word embedding by layer g𝑔gitalic_g and is projected to obtain the sequence tokens 𝒯=[t1,t2,,tN]N×D𝒯subscript𝑡1subscript𝑡2subscript𝑡𝑁superscript𝑁𝐷\mathcal{T}=[t_{1},t_{2},\cdots,t_{N}]\in\mathbb{R}^{N\times D}caligraphic_T = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_N × italic_D end_POSTSUPERSCRIPT, where N𝑁Nitalic_N represents the sequence length of text tokens. Subsequently, the visual and text tokens are concatenated together and fed into a large language model. This model consists of stacked multi-head self-attention (MSA) and feed-forward neural networks (FFN), with layer normalization (LN) and residual connections typically used within each block:

𝐱0=[v1,v2,,vP,,t1,t2,,tN],subscript𝐱0subscript𝑣1subscript𝑣2subscript𝑣𝑃subscript𝑡1subscript𝑡2subscript𝑡𝑁\mathbf{x}_{0}=[v_{1},v_{2},\cdots,v_{P},\cdots,t_{1},t_{2},\cdots,t_{N}],bold_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ] , (1)
𝐱=MSA(LN(𝐱1))+𝐱1,{1,,L},formulae-sequencesuperscriptsubscript𝐱MSALNsubscript𝐱1subscript𝐱11𝐿\mathbf{x}_{\ell}^{\prime}=\mathrm{MSA}(\mathrm{LN}(\mathbf{x}_{\ell-1}))+% \mathbf{x}_{\ell-1},\ell\in\{1,\ldots,L\},bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = roman_MSA ( roman_LN ( bold_x start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUBSCRIPT roman_ℓ - 1 end_POSTSUBSCRIPT , roman_ℓ ∈ { 1 , … , italic_L } , (2)
𝐱=FFN(LN(𝐱))+𝐱,{1,,L},formulae-sequencesubscript𝐱FFNLNsubscriptsuperscript𝐱subscriptsuperscript𝐱1𝐿\mathbf{x}_{\ell}=\mathrm{FFN}(\mathrm{LN}(\mathbf{x^{\prime}}_{\ell}))+% \mathbf{x^{\prime}}_{\ell},\ell\in\{1,\ldots,L\},bold_x start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT = roman_FFN ( roman_LN ( bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT ) ) + bold_x start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_ℓ end_POSTSUBSCRIPT , roman_ℓ ∈ { 1 , … , italic_L } , (3)
𝒴=LN(𝐱L),𝒴LNsubscript𝐱𝐿\mathcal{Y}=\mathrm{LN}(\mathbf{x}_{L}),caligraphic_Y = roman_LN ( bold_x start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) , (4)

where L𝐿Litalic_L is the layer number of LLM. The LVLM model generates an output text sequence 𝒴=[y1,y2,,yK]K×D𝒴subscript𝑦1subscript𝑦2subscript𝑦𝐾superscript𝐾𝐷\mathcal{Y}=[y_{1},y_{2},\cdots,y_{K}]\in\mathbb{R}^{K\times D}caligraphic_Y = [ italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_y start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT italic_K × italic_D end_POSTSUPERSCRIPT by progressively generating each element, where K=P+D𝐾𝑃𝐷K=P+Ditalic_K = italic_P + italic_D represents the total length of the output text sequence. Then, the outputs are optimized through a generative loss in an auto-regressive manner. The loss is formulated as:

main=i=1Dlogpθ(𝒴[P+i]𝒱,𝒯[:i1]),subscriptmainsuperscriptsubscript𝑖1𝐷logsubscript𝑝𝜃conditionalsuperscript𝒴delimited-[]𝑃𝑖𝒱superscript𝒯delimited-[]:absent𝑖1\mathcal{L}_{\text{main}}=-\sum_{i=1}^{D}\text{log}\ p_{\theta}\left(\mathcal{% Y}^{[P+i]}\mid\mathcal{V},\mathcal{T}^{[:i-1]}\right),caligraphic_L start_POSTSUBSCRIPT main end_POSTSUBSCRIPT = - ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_D end_POSTSUPERSCRIPT log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( caligraphic_Y start_POSTSUPERSCRIPT [ italic_P + italic_i ] end_POSTSUPERSCRIPT ∣ caligraphic_V , caligraphic_T start_POSTSUPERSCRIPT [ : italic_i - 1 ] end_POSTSUPERSCRIPT ) , (5)

where θ𝜃\thetaitalic_θ is a trainable parameter. The auto-regressive loss for the token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is abbreviated as n(θ)subscript𝑛𝜃\mathcal{L}_{n}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ).

MoE: The Mixture-of-Expert (MoE) layer is used to replace the FFN layer, e.g.,  (Dai et al., 2024). A MoE layer consists of multiple FFNs, each representing an expert, i.e., =[e1,e2,,eE]subscript𝑒1subscript𝑒2subscript𝑒𝐸\mathcal{E}=[e_{1},e_{2},\cdots,e_{E}]caligraphic_E = [ italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_e start_POSTSUBSCRIPT italic_E end_POSTSUBSCRIPT ], where E𝐸Eitalic_E is the number of experts. The router is typically a linear layer that predicts the probability of each token being assigned to each expert, and we formulate this process as:

pmoe(𝐱)i=ezmoe(𝐱)ij=1Eezmoe(𝐱)j,subscript𝑝moesubscript𝐱𝑖superscript𝑒subscript𝑧moesubscript𝐱𝑖superscriptsubscript𝑗1𝐸superscript𝑒subscript𝑧moesubscript𝐱𝑗p_{\text{moe}}(\mathbf{x})_{i}=\frac{e^{z_{\text{moe}}(\mathbf{x})_{i}}}{\sum_% {j=1}^{E}e^{z_{\text{moe}}(\mathbf{x})_{j}}},italic_p start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (6)

where zmoe(𝐱)=𝐖𝐱subscript𝑧moe𝐱𝐖𝐱z_{\text{moe}}(\mathbf{x})=\mathbf{W}\cdot\mathbf{x}italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) = bold_W ⋅ bold_x and pmoe(𝐱)isubscript𝑝moesubscript𝐱𝑖p_{\text{moe}}(\mathbf{x})_{i}italic_p start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the routing score of 𝐱𝐱\mathbf{x}bold_x for the i𝑖iitalic_i-th expert. The matrix 𝐖D×E𝐖superscript𝐷𝐸\mathbf{W}\in\mathbb{R}^{D\times E}bold_W ∈ blackboard_R start_POSTSUPERSCRIPT italic_D × italic_E end_POSTSUPERSCRIPT represents the lightweight training parameters for routing. We calculate a weighted sum of the outputs from the Top-k𝑘kitalic_k experts with the highest softmax probabilities, where the weighting of each expert is related to the routing score:

wmoe(𝐱)isubscript𝑤moesubscript𝐱𝑖\displaystyle w_{\text{moe}}(\mathbf{x})_{i}italic_w start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =ezmoe(𝐱)ij=1kezmoe(𝐱)j,absentsuperscript𝑒subscript𝑧moesubscript𝐱𝑖superscriptsubscript𝑗1𝑘superscript𝑒subscript𝑧moesubscript𝐱𝑗\displaystyle=\frac{e^{z_{\text{moe}}(\mathbf{x})_{i}}}{\sum_{j=1}^{k}e^{z_{% \text{moe}}(\mathbf{x})_{j}}},= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG , (7)
MoE(𝐱)MoE𝐱\displaystyle\mathrm{MoE}(\mathbf{x})roman_MoE ( bold_x ) =i=1kwmoe(𝐱)iei(𝐱),absentsuperscriptsubscript𝑖1𝑘subscript𝑤moesubscript𝐱𝑖subscript𝑒𝑖𝐱\displaystyle=\sum_{i=1}^{k}w_{\text{moe}}(\mathbf{x})_{i}\cdot e_{i}(\mathbf{% x}),= ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_k end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) ,

where wmoe(𝐱)isubscript𝑤moesubscript𝐱𝑖w_{\text{moe}}(\mathbf{x})_{i}italic_w start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( bold_x ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the weight of the i𝑖iitalic_i-th expert for 𝐱𝐱\mathbf{x}bold_x, and ei(𝐱)subscript𝑒𝑖𝐱e_{i}(\mathbf{x})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( bold_x ) is the output of the i𝑖iitalic_i-th expert. We express n(θ)subscript𝑛𝜃\mathcal{L}_{n}(\theta)caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ ) as n(θei,θ)subscript𝑛subscript𝜃subscript𝑒𝑖superscript𝜃\mathcal{L}_{n}(\theta_{e_{i}},\theta^{\prime})caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), where θeisubscript𝜃subscript𝑒𝑖\theta_{e_{i}}italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes the i𝑖iitalic_i-th expert, and θsuperscript𝜃\theta^{\prime}italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT represents all other parameters except for θeisubscript𝜃subscript𝑒𝑖\theta_{e_{i}}italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT.

Our Method: In this work, our goal is to propose a novel learning strategy for the Mixture-of-Experts (MoE) to reduce interference among diverse data. Specifically, as illustrated in Figure 2, we model the interference among tokens within an expert using token-level gradients, and then design a novel loss function that requires tokens with conflicting gradients to be handled by different experts. The details of these modules will be introduced in the subsequent sections.

3.2 Conflicting Token Identification

For the MoE, the key to reducing interference among diverse data is preventing optimization conflicts between tokens within an expert. One approach is to cluster samples based on their features and use the cluster results to decide which expert the samples should be assigned to. Alternatively, decisions are made based on the specific task associated with each sample. However, these methods have two main limitations: (i)𝑖(i)( italic_i ) They operate at the sample level, whereas the routing is at the token level; routing all tokens within a sample to the same expert does not effectively address conflicts between tokens within the sample. (ii)𝑖𝑖(ii)( italic_i italic_i ) The optimization direction is jointly influenced by features and labels, but these methods rely on only one of these factors. To address these issues, we propose using token-level gradients, which can accurately depict optimization directions at the token level, to identify optimization conflicts between tokens within an expert.

First, we introduce the negative impact brought by the gradient conflicts. Without loss of generality, we discuss two distinct text tokens, tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and tnsubscript𝑡superscript𝑛t_{n^{\prime}}italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, as shown in Figure 2 (a). Assume that both tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and tnsubscript𝑡superscript𝑛t_{n^{\prime}}italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT are processed by the expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Let 𝐠n=θein(θei,θ)subscript𝐠𝑛subscriptsubscript𝜃subscript𝑒𝑖subscript𝑛subscript𝜃subscript𝑒𝑖superscript𝜃\mathbf{g}_{n}=\nabla_{\theta_{e_{i}}}\mathcal{L}_{n}(\theta_{e_{i}},\theta^{% \prime})bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) denote the gradient of the token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with respect to the expert θeisubscript𝜃subscript𝑒𝑖\theta_{e_{i}}italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT. A small change in θeisubscript𝜃subscript𝑒𝑖\theta_{e_{i}}italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in the direction of 𝐠nsubscript𝐠𝑛-\mathbf{g}_{n}- bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is given by θeiθeiδ𝐠nsubscript𝜃subscripteisubscript𝜃subscriptei𝛿subscript𝐠𝑛\theta_{\mathrm{e_{i}}}\leftarrow\theta_{\mathrm{e_{i}}}-\delta\mathbf{g}_{n}italic_θ start_POSTSUBSCRIPT roman_e start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ← italic_θ start_POSTSUBSCRIPT roman_e start_POSTSUBSCRIPT roman_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_δ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, with a step size δ𝛿\deltaitalic_δ. The effect of this change on the performance of another token tnsubscript𝑡superscript𝑛t_{n^{\prime}}italic_t start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is measured by:

Δn=n(θeiδ𝐠n,θ)n(θei,θ)=δ𝐠n𝐠n+o(δ),Δsubscriptsuperscript𝑛subscriptsuperscript𝑛subscript𝜃subscript𝑒𝑖𝛿subscript𝐠𝑛superscript𝜃subscriptsuperscript𝑛subscript𝜃subscript𝑒𝑖superscript𝜃𝛿subscript𝐠𝑛subscript𝐠superscript𝑛𝑜𝛿\Delta\mathcal{L}_{n^{\prime}}=\mathcal{L}_{n^{\prime}}(\theta_{e_{i}}-\delta% \mathbf{g}_{n},\theta^{\prime})-\mathcal{L}_{n^{\prime}}(\theta_{e_{i}},\theta% ^{\prime})=-\delta\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}+o(\delta),roman_Δ caligraphic_L start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_δ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - caligraphic_L start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = - italic_δ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_o ( italic_δ ) , (8)

where the second equality is obtained by first-order Taylor approximation. Likewise, the effect of an update of θeisubscript𝜃subscript𝑒𝑖\theta_{e_{i}}italic_θ start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT in the direction of the negative gradient of token nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (i.e., 𝐠nsubscript𝐠superscript𝑛-\mathbf{g}_{n^{\prime}}- bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT) on the performance of token n𝑛nitalic_n is Δi=δ𝐠n𝐠n+o(δ)Δsubscript𝑖𝛿subscript𝐠𝑛subscript𝐠superscript𝑛𝑜𝛿\Delta\mathcal{L}_{i}=-\delta\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}+o(\delta)roman_Δ caligraphic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = - italic_δ bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_o ( italic_δ ). Thus, the model update for token n𝑛nitalic_n is considered to negatively affect token nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT when 𝐠n𝐠n<0subscript𝐠𝑛subscript𝐠superscript𝑛0\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}<0bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ⋅ bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < 0, since it increases the loss of token nsuperscript𝑛n^{\prime}italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, and vice versa. We define 𝐠nsubscript𝐠𝑛\mathbf{g}_{n}bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝐠nsubscript𝐠superscript𝑛\mathbf{g}_{n^{\prime}}bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT as conflicting gradients when their cosine similarity cosϕnn<τsubscriptitalic-ϕ𝑛superscript𝑛𝜏\cos{\phi_{nn^{\prime}}}<\tauroman_cos italic_ϕ start_POSTSUBSCRIPT italic_n italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT < italic_τ, where τ𝜏\tauitalic_τ is a threshold and ϕnnsubscriptitalic-ϕ𝑛superscript𝑛\phi_{nn^{\prime}}italic_ϕ start_POSTSUBSCRIPT italic_n italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is the angle between 𝐠nsubscript𝐠𝑛\mathbf{g}_{n}bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝐠nsubscript𝐠superscript𝑛\mathbf{g}_{n^{\prime}}bold_g start_POSTSUBSCRIPT italic_n start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. Gradient conflicts can cause the optimizer to struggle to converge to a desirable solution, especially when there is a large difference in gradient magnitudes.

We then define the conflicting token. The parameters of the expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are updated using the average gradient. Let the tokens processed by the expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be denoted as {t1,,tNei}subscript𝑡1subscript𝑡subscript𝑁subscript𝑒𝑖\{t_{1},\cdots,t_{N_{e_{i}}}\}{ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , ⋯ , italic_t start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT }, the average gradient on the expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is represented as:

𝐠mean=n=1Nei𝐠nNei.subscript𝐠𝑚𝑒𝑎𝑛superscriptsubscript𝑛1subscript𝑁subscript𝑒𝑖subscript𝐠𝑛subscript𝑁subscript𝑒𝑖\mathbf{g}_{mean}=\frac{\sum_{n=1}^{N_{e_{i}}}\mathbf{g}_{n}}{N_{e_{i}}}.bold_g start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT = divide start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUPERSCRIPT bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG . (9)

The average gradient indicates the direction of parameter updates for the expert at each iteration. When the gradient of a token and the average gradient are conflicting gradients, it suggests that the token is detrimental to the learning of the expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, so his token should be considered for assignment to another expert. A formal definition of a conflicting token is provided as follows:

Definition 1 (Conflicting Token)

The token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is said to a conflicting token if 𝐠nsubscript𝐠𝑛\mathbf{g}_{n}bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝐠meansubscript𝐠𝑚𝑒𝑎𝑛\mathbf{g}_{mean}bold_g start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT are conflicting gradients, where 𝐠meansubscript𝐠𝑚𝑒𝑎𝑛\mathbf{g}_{mean}bold_g start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT is the average gradient of all tokens in the expert of tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Lastly, we detail our method for identifying conflicting tokens, as illustrated in Figure 2(a). Initially, we unfreeze only the expert layer in the MoE and compute the main loss. We then perform back-propagation to calculate the token-level gradients for each token within the expert layer. Subsequently, we calculate the average gradient, as well as the cosine similarity between the gradient of each token and the average gradient. Lastly, when the cosine similarity is less than τ𝜏\tauitalic_τ, we mark the token as a conflicting token. Identifying these tokens allows us to use the precise optimization direction represented by the gradients to reduce interference among diverse data in the next section.

3.3 Conflict Elimination Loss

The learning of a conflicting token tends to increase the loss of most other tokens within its corresponding expert. Thus, once a conflicting token is identified, it should be reassigned to a different expert for processing. To achieve this goal, we propose a simple yet effective regularization loss by constraining the routing scores predicted by the router, as shown in Figure 2(b).

Specifically, for each expert within every layer, we first identify the conflicting tokens using token-level gradients. Then, the router predicts the probability of each token being assigned to different experts. For a conflicting token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, we record the routing logits zmoe(tn)subscript𝑧moesubscript𝑡𝑛z_{\text{moe}}(t_{n})italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), the routing scores pmoe(tn)subscript𝑝moesubscript𝑡𝑛p_{\text{moe}}(t_{n})italic_p start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), and the expert ID idmoe𝑖subscript𝑑moeid_{\text{moe}}italic_i italic_d start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT it is currently assigned to. Using the expert ID idmoe𝑖subscript𝑑moeid_{\text{moe}}italic_i italic_d start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT, we calculate the loss:

zmoe(tn)subscriptsuperscript𝑧moesubscript𝑡𝑛\displaystyle z^{\prime}_{\text{moe}}(t_{n})italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) =zmoe(tn),absentsubscript𝑧moesubscript𝑡𝑛\displaystyle=-z_{\text{moe}}(t_{n}),= - italic_z start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) , (10)
pmoe(tn)isubscriptsuperscript𝑝moesubscriptsubscript𝑡𝑛𝑖\displaystyle p^{\prime}_{\text{moe}}(t_{n})_{i}italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT =ezmoe(tn)ij=1Eezmoe(tn)j,absentsuperscript𝑒subscriptsuperscript𝑧moesubscriptsubscript𝑡𝑛𝑖superscriptsubscript𝑗1𝐸superscript𝑒subscriptsuperscript𝑧moesubscriptsubscript𝑡𝑛𝑗\displaystyle=\frac{e^{z^{\prime}_{\text{moe}}(t_{n})_{i}}}{\sum_{j=1}^{E}e^{z% ^{\prime}_{\text{moe}}(t_{n})_{j}}},= divide start_ARG italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG ,
tokensubscripttoken\displaystyle\mathcal{L}_{\text{token}}caligraphic_L start_POSTSUBSCRIPT token end_POSTSUBSCRIPT =1NEn=1Ni=1Elog(pmoe(tn)i)qmoe(tn)i,absent1𝑁𝐸superscriptsubscript𝑛1𝑁superscriptsubscript𝑖1𝐸logsubscriptsuperscript𝑝moesubscriptsubscript𝑡𝑛𝑖subscript𝑞moesubscriptsubscript𝑡𝑛𝑖\displaystyle=\frac{1}{N\cdot E}\sum_{n=1}^{N}\sum_{i=1}^{E}\text{log}(p^{% \prime}_{\text{moe}}(t_{n})_{i})\cdot q_{\text{moe}}(t_{n})_{i},= divide start_ARG 1 end_ARG start_ARG italic_N ⋅ italic_E end_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT log ( italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_q start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ,

Where N𝑁Nitalic_N is the count of all conflicting tokens, E𝐸Eitalic_E is the number of experts, and pmoe(tn)subscriptsuperscript𝑝moesubscript𝑡𝑛p^{\prime}_{\text{moe}}(t_{n})italic_p start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) represents the inverted routing score for the token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. The qmoe(tn)subscript𝑞moesubscript𝑡𝑛{q}_{\text{moe}}(t_{n})italic_q start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) define one-hot vectors, with qmoe(tn)idmoe=1subscript𝑞moesubscriptsubscript𝑡𝑛𝑖subscript𝑑moe1{q}_{\text{moe}}(t_{n})_{id_{\text{moe}}}=1italic_q start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i italic_d start_POSTSUBSCRIPT moe end_POSTSUBSCRIPT end_POSTSUBSCRIPT = 1. This loss is designed to encourage the reassignment of conflicting tokens to different experts. When k𝑘kitalic_k in Top-k𝑘kitalic_k exceeds 1, a token may be assigned to multiple experts. Our method focuses on considering all tokens within each expert, regardless of whether a token has also been assigned to other experts.

3.4 Total Loss

To encourage experts to handle tokens in a balanced manner, the differentiable load balancing loss, as introduced in (Fedus et al., 2022), is typically defined for each MoE layer as follows:

aux=Ei=1Ei𝒫i,subscriptaux𝐸superscriptsubscript𝑖1𝐸subscript𝑖subscript𝒫𝑖\mathcal{L}_{\text{aux}}=E\cdot\sum_{i=1}^{E}\mathcal{F}_{i}\cdot\mathcal{P}_{% i},caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT = italic_E ⋅ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT caligraphic_F start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ⋅ caligraphic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , (11)

where \mathcal{F}caligraphic_F represents the fraction of tokens processed by each expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and 𝒫𝒫\mathcal{P}caligraphic_P represents the average routing probabilities assigned to expert eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT.

In conclusion, the total loss is given by:

total=main+αaux+βtoken,subscripttotalsubscriptmain𝛼subscriptaux𝛽subscripttoken\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{main}}+\alpha\cdot\mathcal{L}_{% \text{aux}}+\beta\cdot\mathcal{L}_{\text{token}},caligraphic_L start_POSTSUBSCRIPT total end_POSTSUBSCRIPT = caligraphic_L start_POSTSUBSCRIPT main end_POSTSUBSCRIPT + italic_α ⋅ caligraphic_L start_POSTSUBSCRIPT aux end_POSTSUBSCRIPT + italic_β ⋅ caligraphic_L start_POSTSUBSCRIPT token end_POSTSUBSCRIPT , (12)

where α𝛼\alphaitalic_α and β𝛽\betaitalic_β are hyper-parameters.

4 Experiments

Table 1: Comparison among different LVLMs on image understanding benchmarks. “Act.”, “V”, “S”, “Q”, “P”, and “M” represent activated parameters, Vicuna (Chiang et al., 2023), StableLM (Team, ), Qwen (Bai et al., 2023a), Phi-2 (Microsoft, 2023), and MobileLLaMA (Chu et al., 2023), respectively. Evaluation Benchmarks include VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT (Goyal et al., 2017a); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al., 2018); SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT: ScienceQA-IMG (Lu et al., 2022); VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT: TextVQA (Singh et al., 2019b); POPE (Li et al., 2023a); MME (Fu et al., 2023); MMB: MMBench (Liu et al., 2023d); MM-Vet (Yu et al., 2023a). donates that there is some overlap in the training data. The best results are indicated by boldface.
Method LLM Act. Image Question Answering Benchmark Toolkit
VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MME MMB MM-Vet
Dense Model
LLaVA-1.5 V-13B 13B 80.0 63.3 53.6 71.6 61.3 85.9 1531.3 67.7 35.4
Qwen-VL Q-7B 6.7B 78.8 59.3 35.2 67.1 63.8 - - 38.2 -
LLaVA-1.5 V-7B 6.7B 78.5 62.0 50.0 66.8 58.2 85.9 1510.7 63.4 30.5
TinyGPT-V P-2.7B 2.7B - 33.6 33.4 - - - - - -
MobileVLM M-2.7B 2.7B - 59.0 - 61.0 47.5 84.9 1288.9 59.6 -
LLaVA-Phi P-2.7B 2.7B 71.4 - 35.9 68.4 48.6 85.0 1335.1 59.8 28.9
Sparse Model
MoE-LLaVA S-1.6B 2.0B 76.7 60.3 36.2 62.6 50.1 85.7 1318.2 60.2 26.9
Our Method S-1.6B 2.0B 76.9 60.9 37.7 62.6 50.7 85.9 1355.1 60.7 28.2
MoE-LLaVA P-2.7B 3.6B 77.6 61.4 43.9 68.5 51.4 86.3 1423.0 65.2 34.3
Our Method P-2.7B 3.6B 78.0 62.1 47.2 68.1 52.3 86.9 1429.2 66.7 33.3
Table 2: Zero-shot object hallucination evaluation results. “Yes” means the proportion of positive responses to the given question.
Method LLM Act. Adersarial Popular Random
Acc F1-Score Yes Acc F1-Score Yes Acc F1-Score Yes
Dense Model
mPLUG-Owl L-7B 6.7B 82.4 81.6 45.2 85.5 84.3 42.1 86.3 85.3 42.3
MM-GPT L-7B 6.7B 50.0 66.7 100.0 50.0 66.7 100.0 50.0 66.7 100.0
LLaVA-1.5 V-13B 13B 85.5 84.4 43.3 87.4 86.2 41.3 88.0 87.1 41.7
Sparse Model
MoE-LLaVA S-1.6B 2.0B 86.9 85.7 41.7 85.3 84.2 43.5 88.0 87.1 41.6
Our Method S-1.6B 2.0B 85.0 84.1 44.4 87.2 86.1 42.2 88.2 87.4 42.1
MoE-LLaVA P-2.7B 3.6B 85.9 84.9 43.2 87.5 86.4 41.8 88.5 87.7 41.8
Our Method P-2.7B 3.6B 86.5 85.5 43.4 88.0 86.9 41.9 89.0 88.2 41.8

4.1 Experimental Setup

Benchmark: In this work, we follow existing works (Liu et al., 2023c; Lin et al., 2024) to evaluate our method. Our method is only used in the instruction tuning stage, using the LLaVA 1.5-mix-665k dataset (Liu et al., 2023c), a collection of academic-task-oriented and other recent benchmarks specifically designed for instruction-following Language Model Models. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the model visual perception capabilities through open-ended short answers. The VizWiz dataset (Gurari et al., 2018), containing 8,000 images, evaluates the model zero-shot generalization on visual questions asked by visually impaired people. ScienceQA (Lu et al., 2022), a multiple-choice benchmark, evaluates the model zero-shot generalization on scientific question answering. TextVQA (Singh et al., 2019a) focuses on text-rich visual question answering tasks.

For recent benchmarks proposed for instruction-following LMMs, POPE (Li et al., 2023b) evaluates the degree of hallucination in model responses on three sampled subsets of COCO (Lin et al., 2014): Random, Common, and Adversarial. MME (Fu et al., 2023) assesses the model visual perception with yes/no questions. MMBench (Liu et al., 2023d) evaluates the robustness of model answers with all-round shuffling on multiple choice answers. MM-Vet (Yu et al., 2023b) evaluates the model capabilities in engaging in visual conversations on a diverse range of tasks, and assess the correctness and helpfulness of the responses using the GPT-4 evaluation framework.

Baseline: Our main baseline is MoE-LLaVA (Lin et al., 2024) in this work. MoE-LLaVA incorporates a Mixture-of-Experts (MoE) into Large Vision-Language Models and has proposed a three-stage training scheme for the MoE. It trains only the MoE in the third stage, i.e., the instruction tuning stage. MoE-LLaVA has 4 experts and selects the Top-2 experts to handle tokens, and we refer to this configuration as MoE-4-Top-2. Building on MoE-LLaVA, we add a novel regularization loss tokensubscripttoken\mathcal{L}_{\text{token}}caligraphic_L start_POSTSUBSCRIPT token end_POSTSUBSCRIPT in this work during the instruction tuning stage to enhance the MoE. For the language model backbone, we use StableLM-1.6B and Phi2-2.7B, following MoE-LLaVA (Lin et al., 2024).

4.2 Image Understanding Evaluation

Image Question Answering: We evaluate the performance of our method on five image question-answering benchmarks, as shown in Table 1, and report the number of activated parameters as a measure of efficiency. Compared to MoE-LLaVA (Lin et al., 2024), our method demonstrates superior image understanding capabilities, increasing performance by 0.2%, 0.6%, 1.5%, and 0.6% on VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT, GQA, VisWiz, and VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT, respectively, when using StableLM-1.6B as the language model backbone. When the language model backbone is set to Phi2-2.7B, we also observe a similarly convincing performance increase on most datasets.

Benchmark Toolkit: To comprehensively evaluate the multi-modal understanding capabilities of our method, we assess its performance across four benchmark toolkits. These toolkits typically involve open-ended answers and serve as tools to verify the model ability to engage in natural language questioning. As shown in Table 1, our method surpasses the baseline MoE-LLaVA (Lin et al., 2024) by 0.2%, 0.5% and 1.3% on POPE, MMB, and MM-Vet, respectively, when using StableLM-1.6B as the language model backbone. These experimental results further demonstrate the superiority of our method over existing MoE systems.

4.3 Object Hallucination Evaluation

We adopt the POPE evaluation pipeline (Li et al., 2023a), a polling-based query method, to assess the object hallucination capabilities of our method. With 2.2 billion activated parameters, our method surpasses MoE-LLaVA (Lin et al., 2024) by 1.0% in adversarial sampling, 1.5% in popular sampling, and 0.8% in random sampling, as presented in Table 2. This demonstrates that our method can provide more accurate feedback relevant to the given questions.

4.4 Ablation Study

In this section, we complete all experiments using StableLM-1.6B as the language model backbone.

Table 3: Ablation study about Conflicting Token Identification. Settings for results in Table 1 are highlighted in blue. The best results are indicated by boldface.
Strategy GQA VisWiz VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB MM-Vet
cluster-based 56.1 35.4 48.7 60.6 25.2
gradient-based 60.9 37.7 50.7 60.7 28.2
(a)
Threshold GQA VisWiz VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB MM-Vet
0.1 60.6 35.1 50.9 61.3 25.6
0.0 60.9 37.7 50.7 60.7 28.2
-0.1 60.6 34.9 50.5 61.4 25.9
(b)

Study about Conflicting Token Identification: Our method uses token-level gradients as the cue to identify conflicting tokens. To verify the superiority of token-level gradients as the cue over sample-level features, we conduct experiments on the following approaches: (i) Sample-level expert labels, based on clustering for instruction embeddings (similar to (Chen et al., 2024)). (ii) Token-level expert labels, based on the token-level gradients in each expert (our method). Besides, when the gradient 𝐠nsubscript𝐠𝑛\mathbf{g}_{n}bold_g start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT of the token tnsubscript𝑡𝑛t_{n}italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and and the average gradient 𝐠meansubscript𝐠𝑚𝑒𝑎𝑛\mathbf{g}_{mean}bold_g start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT satisfy the condition cosϕnmean<τsubscriptitalic-ϕ𝑛𝑚𝑒𝑎𝑛𝜏\cos{\phi_{nmean}}<\tauroman_cos italic_ϕ start_POSTSUBSCRIPT italic_n italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT < italic_τ, we flag the token as a conflicting token. We discuss different thresholds for identifying conflicting tokens: τ{0.1,0.0,0.1}𝜏0.10.00.1\tau\in\{0.1,0.0,-0.1\}italic_τ ∈ { 0.1 , 0.0 , - 0.1 }.

As shown in Table 3, we find: (i)𝑖(i)( italic_i ) The performance using token-level gradients to identify conflicting tokens is significantly higher than using sample-level embedding clusters. For example, our method achieves a 4.8% higher performance improvement over the cluster-based scheme on GQA. (ii) When τ=0𝜏0\tau=0italic_τ = 0, the performance on most datasets is the best. The results are consistent with the common belief, i.e., gradients are considered conflicting when their cosine similarity is less than zero.

Table 4: Ablation study about Conflict Elimination Loss. Settings for results in Table 1 are highlighted in blue. The best results are indicated by boldface.
Layer GQA VisWiz VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB MM-Vet
0-24 60.9 37.7 50.7 60.7 28.2
0-12 60.7 35.2 50.9 61.9 25.5
12-24 60.5 34.9 50.7 61.6 26.7
(c)
β𝛽\betaitalic_β GQA VisWiz VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT MMB MM-Vet
0.5 60.5 35.6 50.5 61.4 26.9
1.0 60.9 37.7 50.7 60.7 28.2
2.0 60.6 35.9 50.9 60.6 27.2
(d)

Study about Conflict Elimination Loss: We propose the Conflict Elimination Loss to reduce interference among diverse data types. We explore the impact of applying the loss at different layers, with a total of 24 layers: (i)𝑖(i)( italic_i ) All layers (0-23). (ii)𝑖𝑖(ii)( italic_i italic_i ) The second half of the layers (12-23). (iii)𝑖𝑖𝑖(iii)( italic_i italic_i italic_i ) The first half of the layers (0-11). We also consider different loss weightings β{0.5,1.0,2.0}𝛽0.51.02.0\beta\in\{0.5,1.0,2.0\}italic_β ∈ { 0.5 , 1.0 , 2.0 }.

As shown in Table 4, we find: (i)𝑖(i)( italic_i ) Applying the proposed loss to all layers yields the best performance on most datasets, verifying its importance across all layers. (ii)𝑖𝑖(ii)( italic_i italic_i ) The proposed loss is not sensitive to different loss weightings β𝛽\betaitalic_β, with the highest performance on most datasets when β𝛽\betaitalic_β is set as 1.0.

Table 5: Robustness of our proposed method under different MoE configures. In Table 1 and Table 2, we have discussed Top-2. We now discuss Top-1, i.e., selecting one expert from four experts. The best results are indicated by boldface.
Method LLM Act. Image Question Answering Benchmark Toolkit
VQAv2v2{}^{\text{v2}}start_FLOATSUPERSCRIPT v2 end_FLOATSUPERSCRIPT GQA VisWiz SQAII{}^{\text{I}}start_FLOATSUPERSCRIPT I end_FLOATSUPERSCRIPT VQATT{}^{\text{T}}start_FLOATSUPERSCRIPT T end_FLOATSUPERSCRIPT POPE MME MMB MM-Vet
MoE-LLaVA S-1.6B 2.0B 74.5 58.6 25.7 55.8 45.0 85.2 1245.3 56.2 27.2
Our Method S-1.6B 2.0B 74.9 59.4 27.4 57.5 46.5 85.8 1276.8 56.8 28.5

Robustness Verification under Various MoE Configures: The MoE needs to select the Top-k𝑘kitalic_k experts to handle tokens, and the selection of k𝑘kitalic_k is especially important for MoE performance. Our method serves as a plug-in module, robust to the hyper-parameter k𝑘kitalic_k. Thus, in this section, we present experimental results for the MoE configuration Top-1, in which only the top-scoring expert out of four is chosen to process tokens.

As shown in Table 5, our method offers a stable increase in performance when setting Top-1, i.e., selecting one expert from four experts. This verifies that our method is feasible for use in different MoE expert activation configures.

Refer to caption
Figure 3: We randomly sample 3000 instances from the training dataset and deploy both the well-trained baseline model and our method. (a) We examine the mean routing score for all conflicting tokens on their corresponding experts. (b) We calculate the percentage of conflicting tokens among different layers, determined by the ratio of conflicting token count to total token count.

Statistical Verification. In this section, we randomly sample 3000 instances from the training dataset and deploy both the well-trained baseline model and our method. As shown in Figure 3, we can find: (i)𝑖(i)( italic_i ) Our method significantly reduces the mean routing score from 0.3866 to 0.3349, thus encouraging conflicting tokens to be assigned to other experts instead of the current ones. (ii)𝑖𝑖(ii)( italic_i italic_i ) Each layer contains approximately 20% conflicting tokens.

5 Conclusion and Limitations

Our study reveals that there are still severe token optimization conflicts within an expert for the MoE, leading to sub-optimal learning for the experts. To reduce optimization conflicts among tokens, we propose employing token-level gradients to identify conflicting tokens, and then adding a novel conflict elimination loss based on the routing scores. Our method acts as a plug-in, which can be easily integrated into existing Large Vision-Language Models. Extensive experiments demonstrate the superior performance of our approach across diverse datasets.

A main limitation of this work is that the performance increase brought by the proposed strategy is not sufficiently significant now. The possible reason is that our method needs to assume that there are severe conflicts in the training data. However, the 665k data we are using are not diverse enough. We have observed a more significant performance increase in the large private dataset within the company. Due to computer resource constraints, we plan to expand the size of the public dataset we are using for further experiments in the next few weeks.

References

  • Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  • Bai et al. (2023a) **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
  • Bai et al. (2023b) **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
  • Bao et al. (2022) Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
  • Cha et al. (2023) Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
  • Chen et al. (2023a) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
  • Chen et al. (2023b) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023b.
  • Chen et al. (2024) Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024.
  • Chen et al. (2023c) Wuyang Chen, Yanqi Zhou, Nan Du, Yan** Huang, James Laudon, Zhifeng Chen, and Claire Cui. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pp.  5383–5395. PMLR, 2023c.
  • Chen et al. (2023d) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023d.
  • Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  • Chu et al. (2023) Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
  • Dai et al. (2024) Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
  • Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
  • Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  • Gou et al. (2023) Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
  • Goyal et al. (2017a) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017a.
  • Goyal et al. (2017b) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6904–6913, 2017b.
  • Gurari et al. (2018) Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  3608–3617, 2018.
  • Gururangan et al. (2021) Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. Demix layers: Disentangling domains for modular language modeling. arXiv preprint arXiv:2108.05036, 2021.
  • Hudson & Manning (2019) Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  6700–6709, 2019.
  • Komatsuzaki et al. (2022) Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
  • Kudugunta et al. (2021) Sneha Kudugunta, Yan** Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021.
  • Lepikhin et al. (2020) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan** Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  • Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp.  12888–12900. PMLR, 2022.
  • Li et al. (2023a) Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023a.
  • Li et al. (2023b) Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
  • Li et al. (2023c) Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, and Yongbin Li. Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839, 2023c.
  • Liang et al. (2022) Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
  • Lin et al. (2023) Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng **, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  • Lin et al. (2024) Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng **, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
  • Liu et al. (2023a) Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
  • Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
  • Liu et al. (2023c) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
  • Liu et al. (2023d) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
  • Long et al. (2023) Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. Multiway-adapater: Adapting large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516, 2023.
  • Lu et al. (2022) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
  • Ma et al. (2023) Guangyuan Ma, Xing Wu, Peng Wang, and Songlin Hu. Cot-mote: Exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195, 2023.
  • Microsoft (2023) Microsoft. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models, 2023.
  • OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
  • Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
  • Satar et al. (2022) Burak Satar, Hongyuan Zhu, Hanwang Zhang, and Joo Hwee Lim. Rome: Role-aware mixture-of-expert transformer for text-to-video retrieval. arXiv preprint arXiv:2206.12845, 2022.
  • Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  • Shen et al. (2023) Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
  • Singh et al. (2019a) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019a.
  • Singh et al. (2019b) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8317–8326, 2019b.
  • Sun et al. (2023) Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, et al. Moss: Training conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7, 2023.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  • Team (2023) InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
  • (53) Stability AI Language Team. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  • Wang et al. (2022) Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
  • Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  • Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  • Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
  • Yu et al. (2023a) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023a.
  • Yu et al. (2023b) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
  • Zhang et al. (2023a) Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
  • Zhang et al. (2023b) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
  • Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
  • Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
  • Zhu et al. (2022) **guo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
  • Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan** Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.