^†^†footnotetext: ^∗Work done during an internship at Kuaishou Technology. ^†Correspondence to Xi Li.

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

Longrong Yang^1,∗, Dong Sheng³, Chaoxiang Cai², Fan Yang³, Size Li³, Di Zhang³, Xi Li^1,†
¹College of Computer Science and Technology, Zhejiang University
²School of Software Technology, Zhejiang University
³Kuaishou Technology

Abstract

The Mixture-of-Experts (MoE) has gained increasing attention in the study of Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLMs encourage different experts to handle different tokens, and thus they employ a router to predict the routing for each token. However, the predictions are based solely on sample features and do not truly reveal the optimization direction of tokens. This can lead to severe optimization conflicts between different tokens within an expert. To address this problem, this paper proposes a novel method based on token-level gradient analysis. Specifically, we first use token-level gradients to identify conflicting tokens in experts. Then, we add a specialized loss tailored to eliminate conflicts among tokens within each expert. Our method can serve as a plug-in for diverse Large Vision-Language Models, and extensive experimental results demonstrate the effectiveness of our method. The code will be publicly available at https://github.com/longrongyang/STGC.

1 Introduction

Large Vision-Language Models (LVLMs) have recently demonstrated significant advancements by integrating visual processing modules into Large Language Models (LLMs), bringing LLMs closer to Artificial General Intelligence. Many recent works (Zhang et al., 2023a; Bai et al., 2023b; Zhang et al., 2023b; Zhao et al., 2023; Chen et al., 2023b) show that large model size and large dataset size are especially important to enhance intelligence, i.e., the scaling law. Even when the size is big enough, models exhibit “emergent abilities,” which are not present in small models but are only present in large models. Thus, a series of studies (Li et al., 2022; Dai et al., 2023; Liu et al., 2023b) have expanded the capacity of LVLMs to 13 billion parameters, leading to state-of-the-art performance on various tasks.

Under realistic applications, deploying such large models requires considerable computational resources, making it extremely expensive. A popular solution for reducing the inference cost is the Mixture-of-Experts (MoE) architecture. The MoE, a form of sparsely activated model, has been verified by many works (Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022) to achieve comparable performance with dense models when activating fewer parameters under inference. This characteristic has recently made the MoE gain traction. In the MoE, a fundamental problem is the routing of tokens. To route tokens to different experts, existing methods (Lin et al., 2024; Dai et al., 2024) typically use a router, such as a linear layer, to predict the probability of each token belonging to different experts. The tokens are then dispatched to the expert with the Top- $k$ predicted probability. Additionally, to prevent load imbalance, existing methods usually incorporate a load-balancing loss, which aims to equalize the distribution of tokens among various experts.

Refer to caption — Figure 1: (a) In this work, our goal is to reduce gradient conflicts among tokens within an expert. (b) Our method achieves this goal by reducing the routing scores of the identified conflicting tokens on their corresponding experts, thus encouraging these tokens to be assigned to other experts rather than their current ones. This strategy promotes further specialization of experts and leads to an increase in model performance.

Beyond balancing the load, another important goal of routing tokens to different experts is to reduce interference among diverse datasets. To achieve this, some recent works have performed clustering on instruction embeddings (Gou et al., 2023), grou** similar samples sent to the same expert for preliminary sample-level division. However, because routing during training is at the token level, existing methods struggle with conflicts between tokens within samples. Meanwhile, samples with similar features can have distinct optimization objectives, thus leading to conflicts. The gradient directly indicates the optimization direction, so in this work, we explore data interference in MoE through the lens of token-level gradients. As shown in Figure 1 (a), our basic idea is to reduce gradient conflicts among tokens within an expert, to address severe data interference during the learning of LVLMs under complex and real-world scenarios.

To address the token conflict problem within the MoE, we propose a novel regularization loss based on token-level gradients. Our method consists of two steps. Specifically: $(i)$ Conflicting Token Identification. After processing a batch of data, we perform a backward pass to obtain the token-level gradients for each expert, without updating any model parameters. Within an expert, we define the average gradient of all tokens as the average gradient, representing the holistic optimization direction of the expert. Tokens with gradients having negative cosine similarity to the average gradient are identified as conflicting tokens. (ii) Conflict Elimination Loss. For the conflicting tokens, we record their routing scores predicted by the router. We then reduce the routing probabilities of these tokens to other experts. As shown in Figure 1 (b), this strategy encourages routing conflicting tokens to other experts, reducing interference among diverse data.

In conclusion, our contribution can be summarized as:

•

Beyond relying on sample-level cues, we propose using token-level gradients to identify conflicts among tokens within an expert.
•

We propose a novel conflict elimination loss to resolve conflicts among tokens within an expert, promoting the further specialization of experts.
•

Designed as a plug-in, our method can be seamlessly integrated into existing Large Vision-Language Models (LVLMs). Extensive experiments confirm its effectiveness.

2 Related Works

2.1 Large Vision-language Model

Large Language Models (LLMs) have demonstrated strong instruction following and generalization capabilities. To maintain these capabilities while incorporating visual information, Large Vision-Language Models (LVLMs) such as GPT-4 and LLaVA utilize frozen visual encoders and trainable visual projectors to integrate visual data into LLMs. They typically encode visual information into visual tokens and use these tokens to condition the adaptation of language tokens within LLMs (OpenAI, 2023; Touvron et al., 2023a; Wei et al., 2022; Touvron et al., 2023b; Zheng et al., 2023; Team, 2023; Sun et al., 2023; Du et al., 2021; Bai et al., 2023a; Yang et al., 2023; Penedo et al., 2023; Taori et al., 2023). Recent works have focused on improving performance through two types of methods. The first type optimizes training strategies, e.g., (Bai et al., 2023b; Chen et al., 2023a). Most works belong to the second type, focusing on enhancing visual components, including expanding visual instruction-tuning datasets (Liu et al., 2023a; Zhang et al., 2023b), improving image encoders (Chen et al., 2023d; Bai et al., 2023b), and aligning the input and projection layers (Lin et al., 2023; Cha et al., 2023; Alayrac et al., 2022; Dai et al., 2023; Ye et al., 2023; Zhao et al., 2023). These efforts, particularly the expansion of visual instruction-tuning datasets and the increase in model scales, have significantly enhanced the visual understanding abilities of LVLMs.

2.2 Mixture-of-Experts (MoE)

The Mixture-of-Experts (MoE) is a hybrid model, consisting of multiple sub-models known as experts, and has shown potential in scaling up models (Shazeer et al., 2017). The key concept of MoE lies in the use of a router to determine the token set that each expert handles, aiming for reducing interference among tokens from different types of samples. Early MoE works have utilized the hard routing mode, where each expert is typically assigned a specific role. For example, a series of works (Bao et al., 2022; Long et al., 2023; Satar et al., 2022; Wang et al., 2022; Shen et al., 2023) consider language and vision gaps in multi-modal data (Liang et al., 2022), decoupling experts by modal category and assigning a specific role to each expert. The key feature of hard routers is that they eliminate the need to learn routing assignments. The hard routing has also been widely applied in task-specific MoEs (Li et al., 2023c; Zhu et al., 2022; Ma et al., 2023; Kudugunta et al., 2021).

Then, soft routers enable a dynamic allocation of tokens among different experts, allowing each expert to focus on its expertise and achieving model sparsity. Recent LLM (Shazeer et al., 2017; Lepikhin et al., 2020; Fedus et al., 2022; Zoph et al., 2022; Komatsuzaki et al., 2022) and LVLM works have mainly focused on soft routers. For instance, Gshard (Lepikhin et al., 2020) incorporates MoE into transformers and achieves excellent performance. Lifelong-MoE (Chen et al., 2023c) uses MoE to mitigate the challenge of catastrophic forgetting in lifelong learning. MoE-LLaVA (Lin et al., 2024) and LLaVA-MoLE (Chen et al., 2024) utilize MoE and its variants to empower LVLMs. The approach of expert segmentation increases the number of experts to achieve greater specialization. DeepSeekMoE (Dai et al., 2024) and QwenMoE (Bai et al., 2023a) segment experts by splitting the FFN intermediate hidden dimension. Cluster-based methods, such as MoCLE (Gou et al., 2023), cluster samples and then route those in the same cluster to the same expert. DEMIX (Gururangan et al., 2021) clusters samples according to their task type, ensuring that samples in the same cluster are routed to the same expert. Existing methods mainly operate at the sample level and rely solely on either features or labels, making it challenging to address conflicts between tokens within the same sample. This work aims to use token-level gradients for identifying and solving token optimization conflicts within an expert in the MoE.

3 Methodology

3.1 Overview

Large Vision-Language Model: A Large Vision-Language Model (LVLM) aims to effectively integrate the capabilities of the pre-trained LLM and a visual model. Specifically, given a RGB image $\mathbf{v}\in\mathbb{R}^{H\times W\times 3}$ , where $H$ and $W$ are its height and width, the vision encoder processes the input image to obtain a visual token sequence $\mathcal{Z}=[z_{1},z_{2},\cdots,z_{P}]\in\mathbb{R}^{P\times C}$ , where $P$ is the sequence length of visual tokens, calculated as $P=\frac{H\times W}{14^{2}}$ . A visual projection layer is then used to map $\mathcal{Z}\in\mathbb{R}^{P\times C}$ to $\mathcal{V}\in\mathbb{R}^{P\times D}$ , where $D$ represents the hidden layer size of Large Language Model (LLM). Similarly, the text undergoes word embedding by layer $g$ and is projected to obtain the sequence tokens $\mathcal{T}=[t_{1},t_{2},\cdots,t_{N}]\in\mathbb{R}^{N\times D}$ , where $N$ represents the sequence length of text tokens. Subsequently, the visual and text tokens are concatenated together and fed into a large language model. This model consists of stacked multi-head self-attention (MSA) and feed-forward neural networks (FFN), with layer normalization (LN) and residual connections typically used within each block:

\mathbf{x}_{0}=[v_{1},v_{2},\cdots,v_{P},\cdots,t_{1},t_{2},\cdots,t_{N}],

(1)

\mathbf{x}_{\ell}^{\prime}=\mathrm{MSA}(\mathrm{LN}(\mathbf{x}_{\ell-1}))+% \mathbf{x}_{\ell-1},\ell\in\{1,\ldots,L\},

(2)

\mathbf{x}_{\ell}=\mathrm{FFN}(\mathrm{LN}(\mathbf{x^{\prime}}_{\ell}))+% \mathbf{x^{\prime}}_{\ell},\ell\in\{1,\ldots,L\},

(3)

\mathcal{Y}=\mathrm{LN}(\mathbf{x}_{L}),

(4)

where $L$ is the layer number of LLM. The LVLM model generates an output text sequence $\mathcal{Y}=[y_{1},y_{2},\cdots,y_{K}]\in\mathbb{R}^{K\times D}$ by progressively generating each element, where $K=P+D$ represents the total length of the output text sequence. Then, the outputs are optimized through a generative loss in an auto-regressive manner. The loss is formulated as:

\mathcal{L}_{\text{main}}=-\sum_{i=1}^{D}\text{log}\ p_{\theta}\left(\mathcal{% Y}^{[P+i]}\mid\mathcal{V},\mathcal{T}^{[:i-1]}\right),

(5)

where $\theta$ is a trainable parameter. The auto-regressive loss for the token $t_{n}$ is abbreviated as $\mathcal{L}_{n}(\theta)$ .

MoE: The Mixture-of-Expert (MoE) layer is used to replace the FFN layer, e.g., (Dai et al., 2024). A MoE layer consists of multiple FFNs, each representing an expert, i.e., $\mathcal{E}=[e_{1},e_{2},\cdots,e_{E}]$ , where $E$ is the number of experts. The router is typically a linear layer that predicts the probability of each token being assigned to each expert, and we formulate this process as:

p_{\text{moe}}(\mathbf{x})_{i}=\frac{e^{z_{\text{moe}}(\mathbf{x})_{i}}}{\sum_% {j=1}^{E}e^{z_{\text{moe}}(\mathbf{x})_{j}}},

(6)

where $z_{\text{moe}}(\mathbf{x})=\mathbf{W}\cdot\mathbf{x}$ and $p_{\text{moe}}(\mathbf{x})_{i}$ is the routing score of $\mathbf{x}$ for the $i$ -th expert. The matrix $\mathbf{W}\in\mathbb{R}^{D\times E}$ represents the lightweight training parameters for routing. We calculate a weighted sum of the outputs from the Top- $k$ experts with the highest softmax probabilities, where the weighting of each expert is related to the routing score:

	$\displaystyle w_{\text{moe}}(\mathbf{x})_{i}$	$\displaystyle=\frac{e^{z_{\text{moe}}(\mathbf{x})_{i}}}{\sum_{j=1}^{k}e^{z_{% \text{moe}}(\mathbf{x})_{j}}},$		(7)
	$\displaystyle\mathrm{MoE}(\mathbf{x})$	$\displaystyle=\sum_{i=1}^{k}w_{\text{moe}}(\mathbf{x})_{i}\cdot e_{i}(\mathbf{% x}),$		(7)

where $w_{\text{moe}}(\mathbf{x})_{i}$ represents the weight of the $i$ -th expert for $\mathbf{x}$ , and $e_{i}(\mathbf{x})$ is the output of the $i$ -th expert. We express $\mathcal{L}_{n}(\theta)$ as $\mathcal{L}_{n}(\theta_{e_{i}},\theta^{\prime})$ , where $\theta_{e_{i}}$ denotes the $i$ -th expert, and $\theta^{\prime}$ represents all other parameters except for $\theta_{e_{i}}$ .

Our Method: In this work, our goal is to propose a novel learning strategy for the Mixture-of-Experts (MoE) to reduce interference among diverse data. Specifically, as illustrated in Figure 2, we model the interference among tokens within an expert using token-level gradients, and then design a novel loss function that requires tokens with conflicting gradients to be handled by different experts. The details of these modules will be introduced in the subsequent sections.

3.2 Conflicting Token Identification

For the MoE, the key to reducing interference among diverse data is preventing optimization conflicts between tokens within an expert. One approach is to cluster samples based on their features and use the cluster results to decide which expert the samples should be assigned to. Alternatively, decisions are made based on the specific task associated with each sample. However, these methods have two main limitations: $(i)$ They operate at the sample level, whereas the routing is at the token level; routing all tokens within a sample to the same expert does not effectively address conflicts between tokens within the sample. $(ii)$ The optimization direction is jointly influenced by features and labels, but these methods rely on only one of these factors. To address these issues, we propose using token-level gradients, which can accurately depict optimization directions at the token level, to identify optimization conflicts between tokens within an expert.

First, we introduce the negative impact brought by the gradient conflicts. Without loss of generality, we discuss two distinct text tokens, $t_{n}$ and $t_{n^{\prime}}$ , as shown in Figure 2 (a). Assume that both $t_{n}$ and $t_{n^{\prime}}$ are processed by the expert $e_{i}$ . Let $\mathbf{g}_{n}=\nabla_{\theta_{e_{i}}}\mathcal{L}_{n}(\theta_{e_{i}},\theta^{% \prime})$ denote the gradient of the token $t_{n}$ with respect to the expert $\theta_{e_{i}}$ . A small change in $\theta_{e_{i}}$ in the direction of $-\mathbf{g}_{n}$ is given by $\theta_{\mathrm{e_{i}}}\leftarrow\theta_{\mathrm{e_{i}}}-\delta\mathbf{g}_{n}$ , with a step size $\delta$ . The effect of this change on the performance of another token $t_{n^{\prime}}$ is measured by:

\Delta\mathcal{L}_{n^{\prime}}=\mathcal{L}_{n^{\prime}}(\theta_{e_{i}}-\delta% \mathbf{g}_{n},\theta^{\prime})-\mathcal{L}_{n^{\prime}}(\theta_{e_{i}},\theta% ^{\prime})=-\delta\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}+o(\delta),

(8)

where the second equality is obtained by first-order Taylor approximation. Likewise, the effect of an update of $\theta_{e_{i}}$ in the direction of the negative gradient of token $n^{\prime}$ (i.e., $-\mathbf{g}_{n^{\prime}}$ ) on the performance of token $n$ is $\Delta\mathcal{L}_{i}=-\delta\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}+o(\delta)$ . Thus, the model update for token $n$ is considered to negatively affect token $n^{\prime}$ when $\mathbf{g}_{n}\cdot\mathbf{g}_{n^{\prime}}<0$ , since it increases the loss of token $n^{\prime}$ , and vice versa. We define $\mathbf{g}_{n}$ and $\mathbf{g}_{n^{\prime}}$ as conflicting gradients when their cosine similarity $\cos{\phi_{nn^{\prime}}}<\tau$ , where $\tau$ is a threshold and $\phi_{nn^{\prime}}$ is the angle between $\mathbf{g}_{n}$ and $\mathbf{g}_{n^{\prime}}$ . Gradient conflicts can cause the optimizer to struggle to converge to a desirable solution, especially when there is a large difference in gradient magnitudes.

We then define the conflicting token. The parameters of the expert $e_{i}$ are updated using the average gradient. Let the tokens processed by the expert $e_{i}$ be denoted as $\{t_{1},\cdots,t_{N_{e_{i}}}\}$ , the average gradient on the expert $e_{i}$ is represented as:

\mathbf{g}_{mean}=\frac{\sum_{n=1}^{N_{e_{i}}}\mathbf{g}_{n}}{N_{e_{i}}}.

(9)

The average gradient indicates the direction of parameter updates for the expert at each iteration. When the gradient of a token and the average gradient are conflicting gradients, it suggests that the token is detrimental to the learning of the expert $e_{i}$ , so his token should be considered for assignment to another expert. A formal definition of a conflicting token is provided as follows:

Definition 1 (Conflicting Token)

The token $t_{n}$ is said to a conflicting token if $\mathbf{g}_{n}$ and $\mathbf{g}_{mean}$ are conflicting gradients, where $\mathbf{g}_{mean}$ is the average gradient of all tokens in the expert of $t_{n}$ .

Lastly, we detail our method for identifying conflicting tokens, as illustrated in Figure 2(a). Initially, we unfreeze only the expert layer in the MoE and compute the main loss. We then perform back-propagation to calculate the token-level gradients for each token within the expert layer. Subsequently, we calculate the average gradient, as well as the cosine similarity between the gradient of each token and the average gradient. Lastly, when the cosine similarity is less than $\tau$ , we mark the token as a conflicting token. Identifying these tokens allows us to use the precise optimization direction represented by the gradients to reduce interference among diverse data in the next section.

3.3 Conflict Elimination Loss

The learning of a conflicting token tends to increase the loss of most other tokens within its corresponding expert. Thus, once a conflicting token is identified, it should be reassigned to a different expert for processing. To achieve this goal, we propose a simple yet effective regularization loss by constraining the routing scores predicted by the router, as shown in Figure 2(b).

Specifically, for each expert within every layer, we first identify the conflicting tokens using token-level gradients. Then, the router predicts the probability of each token being assigned to different experts. For a conflicting token $t_{n}$ , we record the routing logits $z_{\text{moe}}(t_{n})$ , the routing scores $p_{\text{moe}}(t_{n})$ , and the expert ID $id_{\text{moe}}$ it is currently assigned to. Using the expert ID $id_{\text{moe}}$ , we calculate the loss:

$\displaystyle z^{\prime}_{\text{moe}}(t_{n})$	$\displaystyle=-z_{\text{moe}}(t_{n}),$	(10)
$\displaystyle p^{\prime}_{\text{moe}}(t_{n})_{i}$	$\displaystyle=\frac{e^{z^{\prime}_{\text{moe}}(t_{n})_{i}}}{\sum_{j=1}^{E}e^{z% ^{\prime}_{\text{moe}}(t_{n})_{j}}},$
$\displaystyle\mathcal{L}_{\text{token}}$	$\displaystyle=\frac{1}{N\cdot E}\sum_{n=1}^{N}\sum_{i=1}^{E}\text{log}(p^{% \prime}_{\text{moe}}(t_{n})_{i})\cdot q_{\text{moe}}(t_{n})_{i},$

Where $N$ is the count of all conflicting tokens, $E$ is the number of experts, and $p^{\prime}_{\text{moe}}(t_{n})$ represents the inverted routing score for the token $t_{n}$ . The ${q}_{\text{moe}}(t_{n})$ define one-hot vectors, with ${q}_{\text{moe}}(t_{n})_{id_{\text{moe}}}=1$ . This loss is designed to encourage the reassignment of conflicting tokens to different experts. When $k$ in Top- $k$ exceeds 1, a token may be assigned to multiple experts. Our method focuses on considering all tokens within each expert, regardless of whether a token has also been assigned to other experts.

3.4 Total Loss

To encourage experts to handle tokens in a balanced manner, the differentiable load balancing loss, as introduced in (Fedus et al., 2022), is typically defined for each MoE layer as follows:

\mathcal{L}_{\text{aux}}=E\cdot\sum_{i=1}^{E}\mathcal{F}_{i}\cdot\mathcal{P}_{% i},

(11)

where $\mathcal{F}$ represents the fraction of tokens processed by each expert $e_{i}$ , and $\mathcal{P}$ represents the average routing probabilities assigned to expert $e_{i}$ .

In conclusion, the total loss is given by:

\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{main}}+\alpha\cdot\mathcal{L}_{% \text{aux}}+\beta\cdot\mathcal{L}_{\text{token}},

(12)

where $\alpha$ and $\beta$ are hyper-parameters.

4 Experiments

Table 1: Comparison among different LVLMs on image understanding benchmarks. “Act.”, “V”, “S”, “Q”, “P”, and “M” represent activated parameters, Vicuna (Chiang et al., 2023), StableLM (Team, ), Qwen (Bai et al., 2023a), Phi-2 (Microsoft, 2023), and MobileLLaMA (Chu et al., 2023), respectively. Evaluation Benchmarks include VQA

{}^{\text{v2}}

(Goyal et al., 2017a); GQA (Hudson & Manning, 2019); VisWiz (Gurari et al., 2018); SQA

{}^{\text{I}}

: ScienceQA-IMG (Lu et al., 2022); VQA

{}^{\text{T}}

: TextVQA (Singh et al., 2019b); POPE (Li et al., 2023a); MME (Fu et al., 2023); MMB: MMBench (Liu et al., 2023d); MM-Vet (Yu et al., 2023a). ^∗ donates that there is some overlap in the training data. The best results are indicated by boldface.

Method	LLM	Act.	Image Question Answering					Benchmark Toolkit
Method	LLM	Act.	VQA ${}^{\text{v2}}$	GQA	VisWiz	SQA ${}^{\text{I}}$	VQA ${}^{\text{T}}$	POPE	MME	MMB	MM-Vet
Dense Model
LLaVA-1.5	V-13B	13B	80.0^∗	63.3^∗	53.6	71.6	61.3	85.9	1531.3	67.7	35.4
Qwen-VL	Q-7B	6.7B	78.8^∗	59.3^∗	35.2	67.1	63.8	-	-	38.2	-
LLaVA-1.5	V-7B	6.7B	78.5^∗	62.0^∗	50.0	66.8	58.2	85.9	1510.7	63.4	30.5
TinyGPT-V	P-2.7B	2.7B	-	33.6^∗	33.4	-	-	-	-	-	-
MobileVLM	M-2.7B	2.7B	-	59.0^∗	-	61.0	47.5	84.9	1288.9	59.6	-
LLaVA-Phi	P-2.7B	2.7B	71.4^∗	-	35.9	68.4	48.6	85.0	1335.1	59.8	28.9
Sparse Model
MoE-LLaVA	S-1.6B	2.0B	76.7^∗	60.3^∗	36.2	62.6	50.1	85.7	1318.2	60.2	26.9
Our Method	S-1.6B	2.0B	76.9^∗	60.9^∗	37.7	62.6	50.7	85.9	1355.1	60.7	28.2
MoE-LLaVA	P-2.7B	3.6B	77.6^∗	61.4^∗	43.9	68.5	51.4	86.3	1423.0	65.2	34.3
Our Method	P-2.7B	3.6B	78.0^∗	62.1^∗	47.2	68.1	52.3	86.9	1429.2	66.7	33.3

Table 2: Zero-shot object hallucination evaluation results. “Yes” means the proportion of positive responses to the given question.

Method	LLM	Act.	Adersarial			Popular			Random
Method	LLM	Act.	Acc	F1-Score	Yes	Acc	F1-Score	Yes	Acc	F1-Score	Yes
Dense Model
mPLUG-Owl	L-7B	6.7B	82.4	81.6	45.2	85.5	84.3	42.1	86.3	85.3	42.3
MM-GPT	L-7B	6.7B	50.0	66.7	100.0	50.0	66.7	100.0	50.0	66.7	100.0
LLaVA-1.5	V-13B	13B	85.5	84.4	43.3	87.4	86.2	41.3	88.0	87.1	41.7
Sparse Model
MoE-LLaVA	S-1.6B	2.0B	86.9	85.7	41.7	85.3	84.2	43.5	88.0	87.1	41.6
Our Method	S-1.6B	2.0B	85.0	84.1	44.4	87.2	86.1	42.2	88.2	87.4	42.1
MoE-LLaVA	P-2.7B	3.6B	85.9	84.9	43.2	87.5	86.4	41.8	88.5	87.7	41.8
Our Method	P-2.7B	3.6B	86.5	85.5	43.4	88.0	86.9	41.9	89.0	88.2	41.8

4.1 Experimental Setup

Benchmark: In this work, we follow existing works (Liu et al., 2023c; Lin et al., 2024) to evaluate our method. Our method is only used in the instruction tuning stage, using the LLaVA 1.5-mix-665k dataset (Liu et al., 2023c), a collection of academic-task-oriented and other recent benchmarks specifically designed for instruction-following Language Model Models. For academic-task-oriented benchmarks, VQA-v2 (Goyal et al., 2017b) and GQA (Hudson & Manning, 2019) assess the model visual perception capabilities through open-ended short answers. The VizWiz dataset (Gurari et al., 2018), containing 8,000 images, evaluates the model zero-shot generalization on visual questions asked by visually impaired people. ScienceQA (Lu et al., 2022), a multiple-choice benchmark, evaluates the model zero-shot generalization on scientific question answering. TextVQA (Singh et al., 2019a) focuses on text-rich visual question answering tasks.

For recent benchmarks proposed for instruction-following LMMs, POPE (Li et al., 2023b) evaluates the degree of hallucination in model responses on three sampled subsets of COCO (Lin et al., 2014): Random, Common, and Adversarial. MME (Fu et al., 2023) assesses the model visual perception with yes/no questions. MMBench (Liu et al., 2023d) evaluates the robustness of model answers with all-round shuffling on multiple choice answers. MM-Vet (Yu et al., 2023b) evaluates the model capabilities in engaging in visual conversations on a diverse range of tasks, and assess the correctness and helpfulness of the responses using the GPT-4 evaluation framework.

Baseline: Our main baseline is MoE-LLaVA (Lin et al., 2024) in this work. MoE-LLaVA incorporates a Mixture-of-Experts (MoE) into Large Vision-Language Models and has proposed a three-stage training scheme for the MoE. It trains only the MoE in the third stage, i.e., the instruction tuning stage. MoE-LLaVA has 4 experts and selects the Top-2 experts to handle tokens, and we refer to this configuration as MoE-4-Top-2. Building on MoE-LLaVA, we add a novel regularization loss $\mathcal{L}_{\text{token}}$ in this work during the instruction tuning stage to enhance the MoE. For the language model backbone, we use StableLM-1.6B and Phi2-2.7B, following MoE-LLaVA (Lin et al., 2024).

4.2 Image Understanding Evaluation

Image Question Answering: We evaluate the performance of our method on five image question-answering benchmarks, as shown in Table 1, and report the number of activated parameters as a measure of efficiency. Compared to MoE-LLaVA (Lin et al., 2024), our method demonstrates superior image understanding capabilities, increasing performance by 0.2%, 0.6%, 1.5%, and 0.6% on VQA ${}^{\text{v2}}$ , GQA, VisWiz, and VQA ${}^{\text{T}}$ , respectively, when using StableLM-1.6B as the language model backbone. When the language model backbone is set to Phi2-2.7B, we also observe a similarly convincing performance increase on most datasets.

Benchmark Toolkit: To comprehensively evaluate the multi-modal understanding capabilities of our method, we assess its performance across four benchmark toolkits. These toolkits typically involve open-ended answers and serve as tools to verify the model ability to engage in natural language questioning. As shown in Table 1, our method surpasses the baseline MoE-LLaVA (Lin et al., 2024) by 0.2%, 0.5% and 1.3% on POPE, MMB, and MM-Vet, respectively, when using StableLM-1.6B as the language model backbone. These experimental results further demonstrate the superiority of our method over existing MoE systems.

4.3 Object Hallucination Evaluation

We adopt the POPE evaluation pipeline (Li et al., 2023a), a polling-based query method, to assess the object hallucination capabilities of our method. With 2.2 billion activated parameters, our method surpasses MoE-LLaVA (Lin et al., 2024) by 1.0% in adversarial sampling, 1.5% in popular sampling, and 0.8% in random sampling, as presented in Table 2. This demonstrates that our method can provide more accurate feedback relevant to the given questions.

4.4 Ablation Study

In this section, we complete all experiments using StableLM-1.6B as the language model backbone.

Table 3: Ablation study about Conflicting Token Identification. Settings for results in Table 1 are highlighted in blue. The best results are indicated by boldface.

Strategy	GQA	VisWiz	VQA ${}^{\text{T}}$	MMB	MM-Vet
cluster-based	56.1	35.4	48.7	60.6	25.2
gradient-based	60.9	37.7	50.7	60.7	28.2

(a)

Threshold	GQA	VisWiz	VQA ${}^{\text{T}}$	MMB	MM-Vet
0.1	60.6	35.1	50.9	61.3	25.6
0.0	60.9	37.7	50.7	60.7	28.2
-0.1	60.6	34.9	50.5	61.4	25.9

(b)

Study about Conflicting Token Identification: Our method uses token-level gradients as the cue to identify conflicting tokens. To verify the superiority of token-level gradients as the cue over sample-level features, we conduct experiments on the following approaches: (i) Sample-level expert labels, based on clustering for instruction embeddings (similar to (Chen et al., 2024)). (ii) Token-level expert labels, based on the token-level gradients in each expert (our method). Besides, when the gradient $\mathbf{g}_{n}$ of the token $t_{n}$ and and the average gradient $\mathbf{g}_{mean}$ satisfy the condition $\cos{\phi_{nmean}}<\tau$ , we flag the token as a conflicting token. We discuss different thresholds for identifying conflicting tokens: $\tau\in\{0.1,0.0,-0.1\}$ .

As shown in Table 3, we find: $(i)$ The performance using token-level gradients to identify conflicting tokens is significantly higher than using sample-level embedding clusters. For example, our method achieves a 4.8% higher performance improvement over the cluster-based scheme on GQA. (ii) When $\tau=0$ , the performance on most datasets is the best. The results are consistent with the common belief, i.e., gradients are considered conflicting when their cosine similarity is less than zero.

Table 4: Ablation study about Conflict Elimination Loss. Settings for results in Table 1 are highlighted in blue. The best results are indicated by boldface.

Layer	GQA	VisWiz	VQA ${}^{\text{T}}$	MMB	MM-Vet
0-24	60.9	37.7	50.7	60.7	28.2
0-12	60.7	35.2	50.9	61.9	25.5
12-24	60.5	34.9	50.7	61.6	26.7

(c)

$\beta$	GQA	VisWiz	VQA ${}^{\text{T}}$	MMB	MM-Vet
0.5	60.5	35.6	50.5	61.4	26.9
1.0	60.9	37.7	50.7	60.7	28.2
2.0	60.6	35.9	50.9	60.6	27.2

(d)

Study about Conflict Elimination Loss: We propose the Conflict Elimination Loss to reduce interference among diverse data types. We explore the impact of applying the loss at different layers, with a total of 24 layers: $(i)$ All layers (0-23). $(ii)$ The second half of the layers (12-23). $(iii)$ The first half of the layers (0-11). We also consider different loss weightings $\beta\in\{0.5,1.0,2.0\}$ .

As shown in Table 4, we find: $(i)$ Applying the proposed loss to all layers yields the best performance on most datasets, verifying its importance across all layers. $(ii)$ The proposed loss is not sensitive to different loss weightings $\beta$ , with the highest performance on most datasets when $\beta$ is set as 1.0.

Table 5: Robustness of our proposed method under different MoE configures. In Table 1 and Table 2, we have discussed Top-2. We now discuss Top-1, i.e., selecting one expert from four experts. The best results are indicated by boldface.

Method	LLM	Act.	Image Question Answering					Benchmark Toolkit
Method	LLM	Act.	VQA ${}^{\text{v2}}$	GQA	VisWiz	SQA ${}^{\text{I}}$	VQA ${}^{\text{T}}$	POPE	MME	MMB	MM-Vet
MoE-LLaVA	S-1.6B	2.0B	74.5	58.6	25.7	55.8	45.0	85.2	1245.3	56.2	27.2
Our Method	S-1.6B	2.0B	74.9	59.4	27.4	57.5	46.5	85.8	1276.8	56.8	28.5

Robustness Verification under Various MoE Configures: The MoE needs to select the Top- $k$ experts to handle tokens, and the selection of $k$ is especially important for MoE performance. Our method serves as a plug-in module, robust to the hyper-parameter $k$ . Thus, in this section, we present experimental results for the MoE configuration Top-1, in which only the top-scoring expert out of four is chosen to process tokens.

As shown in Table 5, our method offers a stable increase in performance when setting Top-1, i.e., selecting one expert from four experts. This verifies that our method is feasible for use in different MoE expert activation configures.

Statistical Verification. In this section, we randomly sample 3000 instances from the training dataset and deploy both the well-trained baseline model and our method. As shown in Figure 3, we can find: $(i)$ Our method significantly reduces the mean routing score from 0.3866 to 0.3349, thus encouraging conflicting tokens to be assigned to other experts instead of the current ones. $(ii)$ Each layer contains approximately 20% conflicting tokens.

5 Conclusion and Limitations

Our study reveals that there are still severe token optimization conflicts within an expert for the MoE, leading to sub-optimal learning for the experts. To reduce optimization conflicts among tokens, we propose employing token-level gradients to identify conflicting tokens, and then adding a novel conflict elimination loss based on the routing scores. Our method acts as a plug-in, which can be easily integrated into existing Large Vision-Language Models. Extensive experiments demonstrate the superior performance of our approach across diverse datasets.

A main limitation of this work is that the performance increase brought by the proposed strategy is not sufficiently significant now. The possible reason is that our method needs to assume that there are severe conflicts in the training data. However, the 665k data we are using are not diverse enough. We have observed a more significant performance increase in the large private dataset within the company. Due to computer resource constraints, we plan to expand the size of the public dataset we are using for further experiments in the next few weeks.

References

Alayrac et al. (2022) Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
Bai et al. (2023a) **ze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023a.
Bai et al. (2023b) **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023b.
Bao et al. (2022) Hangbo Bao, Wenhui Wang, Li Dong, Qiang Liu, Owais Khan Mohammed, Kriti Aggarwal, Subhojit Som, Songhao Piao, and Furu Wei. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. Advances in Neural Information Processing Systems, 35:32897–32912, 2022.
Cha et al. (2023) Junbum Cha, Wooyoung Kang, Jonghwan Mun, and Byungseok Roh. Honeybee: Locality-enhanced projector for multimodal llm. arXiv preprint arXiv:2312.06742, 2023.
Chen et al. (2023a) Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Liu, Pengchuan Zhang, Raghuraman Krishnamoorthi, Vikas Chandra, Yunyang Xiong, and Mohamed Elhoseiny. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478, 2023a.
Chen et al. (2023b) Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023b.
Chen et al. (2024) Shaoxiang Chen, Zequn Jie, and Lin Ma. Llava-mole: Sparse mixture of lora experts for mitigating data conflicts in instruction finetuning mllms. arXiv preprint arXiv:2401.16160, 2024.
Chen et al. (2023c) Wuyang Chen, Yanqi Zhou, Nan Du, Yan** Huang, James Laudon, Zhifeng Chen, and Claire Cui. Lifelong language pretraining with distribution-specialized experts. In International Conference on Machine Learning, pp. 5383–5395. PMLR, 2023c.
Chen et al. (2023d) Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Zhong Muyan, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023d.
Chiang et al. (2023) Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
Chu et al. (2023) Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu Zhang, Bo Zhang, Xiaolin Wei, et al. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886, 2023.
Dai et al. (2024) Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y Wu, et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066, 2024.
Dai et al. (2023) Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
Du et al. (2021) Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding, Jiezhong Qiu, Zhilin Yang, and Jie Tang. Glm: General language model pretraining with autoregressive blank infilling. arXiv preprint arXiv:2103.10360, 2021.
Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. The Journal of Machine Learning Research, 23(1):5232–5270, 2022.
Fu et al. (2023) Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, **rui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
Gou et al. (2023) Yunhao Gou, Zhili Liu, Kai Chen, Lanqing Hong, Hang Xu, Aoxue Li, Dit-Yan Yeung, James T Kwok, and Yu Zhang. Mixture of cluster-conditional lora experts for vision-language instruction tuning. arXiv preprint arXiv:2312.12379, 2023.
Goyal et al. (2017a) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017a.
Goyal et al. (2017b) Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913, 2017b.
Gurari et al. (2018) Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo, Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P Bigham. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3608–3617, 2018.
Gururangan et al. (2021) Suchin Gururangan, Mike Lewis, Ari Holtzman, Noah A Smith, and Luke Zettlemoyer. Demix layers: Disentangling domains for modular language modeling. arXiv preprint arXiv:2108.05036, 2021.
Hudson & Manning (2019) Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 6700–6709, 2019.
Komatsuzaki et al. (2022) Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. Sparse upcycling: Training mixture-of-experts from dense checkpoints. arXiv preprint arXiv:2212.05055, 2022.
Kudugunta et al. (2021) Sneha Kudugunta, Yan** Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. Beyond distillation: Task-level mixture-of-experts for efficient inference. arXiv preprint arXiv:2110.03742, 2021.
Lepikhin et al. (2020) Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yan** Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
Li et al. (2022) Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrap** language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pp. 12888–12900. PMLR, 2022.
Li et al. (2023a) Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023a.
Li et al. (2023b) Yifan Li, Yifan Du, Kun Zhou, **peng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023b.
Li et al. (2023c) Yunshui Li, Binyuan Hui, ZhiChao Yin, Min Yang, Fei Huang, and Yongbin Li. Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts. arXiv preprint arXiv:2305.14839, 2023c.
Liang et al. (2022) Victor Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems, 35:17612–17625, 2022.
Lin et al. (2023) Bin Lin, Bin Zhu, Yang Ye, Munan Ning, Peng **, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
Lin et al. (2024) Bin Lin, Zhenyu Tang, Yang Ye, Jiaxi Cui, Bin Zhu, Peng **, Junwu Zhang, Munan Ning, and Li Yuan. Moe-llava: Mixture of experts for large vision-language models. arXiv preprint arXiv:2401.15947, 2024.
Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In ECCV, 2014.
Liu et al. (2023a) Fuxiao Liu, Kevin Lin, Linjie Li, Jianfeng Wang, Yaser Yacoob, and Lijuan Wang. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023a.
Liu et al. (2023b) Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023b.
Liu et al. (2023c) Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023c.
Liu et al. (2023d) Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023d.
Long et al. (2023) Zijun Long, George Killick, Richard McCreadie, and Gerardo Aragon Camarasa. Multiway-adapater: Adapting large-scale multi-modal models for scalable image-text retrieval. arXiv preprint arXiv:2309.01516, 2023.
Lu et al. (2022) Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521, 2022.
Ma et al. (2023) Guangyuan Ma, Xing Wu, Peng Wang, and Songlin Hu. Cot-mote: Exploring contextual masked auto-encoder pre-training with mixture-of-textual-experts for passage retrieval. arXiv preprint arXiv:2304.10195, 2023.
Microsoft (2023) Microsoft. Phi-2: The surprising power of small language models. https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models, 2023.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116, 2023.
Satar et al. (2022) Burak Satar, Hongyuan Zhu, Hanwang Zhang, and Joo Hwee Lim. Rome: Role-aware mixture-of-expert transformer for text-to-video retrieval. arXiv preprint arXiv:2206.12845, 2022.
Shazeer et al. (2017) Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
Shen et al. (2023) Sheng Shen, Zhewei Yao, Chunyuan Li, Trevor Darrell, Kurt Keutzer, and Yuxiong He. Scaling vision-language models with sparse mixture of experts. arXiv preprint arXiv:2303.07226, 2023.
Singh et al. (2019a) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019a.
Singh et al. (2019b) Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 8317–8326, 2019b.
Sun et al. (2023) Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Hang Yan, Xiangyang Liu, Yunfan Shao, Qiong Tang, Xingjian Zhao, et al. Moss: Training conversational language models from synthetic data. arXiv preprint arXiv:2307.15020, 7, 2023.
Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
Team (2023) InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities, 2023.
(53) Stability AI Language Team. Stable lm 2 1.6b. URL [https://huggingface.co/stabilityai/stablelm-2-1.6b](https://huggingface.co/stabilityai/stablelm-2-1.6b).
Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
Wang et al. (2022) Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442, 2022.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
Yang et al. (2023) Aiyuan Yang, Bin Xiao, Bingning Wang, Borong Zhang, Ce Bian, Chao Yin, Chenxu Lv, Da Pan, Dian Wang, Dong Yan, et al. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
Ye et al. (2023) Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye, Ming Yan, Yiyang Zhou, Junyang Wang, Anwen Hu, Pengcheng Shi, Yaya Shi, et al. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178, 2023.
Yu et al. (2023a) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023a.
Yu et al. (2023b) Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023b.
Zhang et al. (2023a) Pan Zhang, Xiaoyi Dong Bin Wang, Yuhang Cao, Chao Xu, Linke Ouyang, Zhiyuan Zhao, Shuangrui Ding, Songyang Zhang, Haodong Duan, Hang Yan, et al. Internlm-xcomposer: A vision-language large model for advanced text-image comprehension and composition. arXiv preprint arXiv:2309.15112, 2023a.
Zhang et al. (2023b) Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan Zhou, Nedim Lipka, Diyi Yang, and Tong Sun. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023b.
Zhao et al. (2023) Bo Zhao, Boya Wu, and Tiejun Huang. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087, 2023.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685, 2023.
Zhu et al. (2022) **guo Zhu, Xizhou Zhu, Wenhai Wang, Xiaohua Wang, Hongsheng Li, Xiaogang Wang, and Jifeng Dai. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
Zoph et al. (2022) Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yan** Huang, Jeff Dean, Noam Shazeer, and William Fedus. St-moe: Designing stable and transferable sparse expert models. arXiv preprint arXiv:2202.08906, 2022.