Self-MoE: Towards Compositional Large Language Models
with Self-Specialized Experts

Junmo Kang1,2   Leonid Karlinsky2   Hongyin Luo3   Zhen Wang4,5
Jacob Hansen3   James Glass3   David Cox2   Rameswar Panda2
Rogerio Feris2   Alan Ritter1   
1Georgia Institute of Technology   2MIT-IBM Watson AI Lab
3Massachusetts Institute of Technology  4UC San Diego  5MBZUAI
[email protected]
Work done during internship at MIT-IBM Watson AI Lab.
Abstract

We present Self-MoE, an approach that transforms a monolithic LLM into a compositional, modular system of self-specialized experts, named MiXSE (MiXture of Self-specialized Experts). Our approach leverages self-specialization, which constructs expert modules using self-generated synthetic data, each equipped with a shared base LLM and incorporating self-optimized routing. This allows for dynamic and capability-specific handling of various target tasks, enhancing overall capabilities, without extensive human-labeled data and added parameters. Our empirical results reveal that specializing LLMs may exhibit potential trade-offs in performances on non-specialized tasks. On the other hand, our Self-MoE demonstrates substantial improvements over the base LLM across diverse benchmarks such as knowledge, reasoning, math, and coding. It also consistently outperforms other methods, including instance merging and weight merging, while offering better flexibility and interpretability by design with semantic experts and routing. Our findings highlight the critical role of modularity and the potential of self-improvement in achieving efficient, scalable, and adaptable systems. Our code will be released upon acceptance.

Self-MoE: Towards Compositional Large Language Models
with Self-Specialized Experts


Junmo Kangthanks: Work done during internship at MIT-IBM Watson AI Lab.1,2   Leonid Karlinsky2   Hongyin Luo3   Zhen Wang4,5 Jacob Hansen3   James Glass3   David Cox2   Rameswar Panda2 Rogerio Feris2   Alan Ritter1 1Georgia Institute of Technology   2MIT-IBM Watson AI Lab 3Massachusetts Institute of Technology  4UC San Diego  5MBZUAI [email protected]


1 Introduction

The remarkable success of Large Language Models (LLMs) has been largely attributed to their generalist nature, allowing them to perform a wide variety of tasks (Brown et al., 2020; Touvron et al., 2023; Jiang et al., 2023; Team et al., 2024). Predominantly designed as monolithic architectures, these models rely extensively on large-scale data to embed generalized language capabilities across vast parameter spaces. While effective, this monolithic architecture, as illustrated in Figure 1, inherently suffers from significant drawbacks such as inefficiency in scaling (Zhang et al., 2024; Wan et al., 2024), susceptibility to forgetting previously learned information when adapted to specialized tasks (Kotha et al., 2024; Huang et al., 2024), and a lack of transparency which leads to the black-box nature (Zhao et al., 2023).

Refer to caption
Figure 1: Concept of Self-MoE, illustrating the transformation from a monolithic LLM to a compositional system, MiXSE, without extensive resources and addition of significant parameters. The results showcase MiXSE’s improved capabilities over the base LLM (e.g., Gemma-7B) across all domains, unlike the knowledge-specialized LLM that compromises other capabilities.

Meanwhile, the increasing demand to handle domain-specific or expert-level tasks has highlighted the need for specialization of LLMs (Cheng et al., 2024; Ling et al., 2023; Feng et al., 2024). However, effective tuning often relies on high-quality, human-annotated data, which is costly and challenging to scale (Kang et al., 2023b), especially in specialized domains where expertise is scarce and valuable (Wu et al., 2023). Self-specialization (Kang et al., 2023a) offers a promising alternative, aligning models with self-generated synthetic data. While this technique has proven effective in cross-task generalization within a target expert domain, we posit that it may compromise performance in areas outside the target domain.

In this paper, we explore the following question: How can we build compositional LLMs that enjoy versatile expertise, while using minimal resources? We introduce Self-MoE (Figure 1), an approach that transforms a monolithic model into a compositional (Zaharia et al., 2024) system, called MiXSE (MiXture of Self-specialized Experts). This approach differs from prior MoE work using LoRA (Hu et al., 2022), which either relies on human-labeled data (Wu et al., 2024) or assumes the existence of trained modules (Huang et al., 2023; Muqeeth et al., 2024). Instead, our Self-MoE constructs individual lightweight expert modules from scratch using synthetic data, inspired by the concept of self-specialization. Each module is integrated with the base LLM, and the entire system is enhanced by a self-optimized routing mechanism. In contrast to monolithic models, which often suffer from forgetting issues when adapted or merged under fixed, static parameters, our modular design preserves the integrity and semantics of each expert. This allows for dynamic, precise handling of various target domain tasks, boosting the model’s overall capability, adaptability, and interpretability.

Through extensive empirical studies conducted across a variety of popular domains, including knowledge, reasoning, math, and coding, we find that specialization often comes with trade-offs, typically degrading performance in non-targeted domains. However, our Self-MoE demonstrates substantial overall improvements over a base LLM across all target domains without compromising performance on other tasks. Notably, the compositional nature of our MiXSE appears to exploit synergies among experts, even outperforming all individual specialized experts.

Moreover, MiXSE clearly surpasses other strong baselines such as instance merging and weight merging, under similar settings, while offering better flexibility and interpretability. Detailed analyses highlight the critical role of the routing mechanism and the contribution of semantic experts in achieving these results. Our interpretable visualizations of routing distributions further elucidate how tasks are dynamically allocated to the most relevant experts. Lastly, we further validate that there are no issues related to forgetting unlike monolithic baselines, and that our approach can be applied to various model families and sizes.

Refer to caption
Figure 2: Overview of the Self-MoE approach to building a compound system of specialized experts and a router in a self-improving manner. In the Self-Specialization phase (left side), the base LLM is aligned with self-generated synthetic data for each target specialization, producing lightweight expert modules. The right side shows MiXSE where each self-specialized expert is dynamically engaged based on the decisions of the self-optimized router.

2 Problem Statement

The primary focus of this work is on self-improving LLMs’ target capabilities on the fly, specifically under settings constrained by minimal resources and without the addition of significant parameters. Traditional LLMs, which are generally monolithic, require expensive human-labeled data to be better specialized, thereby limiting their adaptability and scalability when resources are constrained. We hypothesize that a modular, compositional model utilizing self-generated synthetic data for self-improvement can dramatically improve specific target capability, adaptability, and interpretability while reducing dependency on expensive human-annotated datasets.

Specifically, given a base LLM Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a minimal set of seed data (e.g., 100) for each of the target capabilities {Ti}i=1nsuperscriptsubscriptsubscript𝑇𝑖𝑖1𝑛\{T_{i}\}_{i=1}^{n}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT (e.g., knowledge, math), our goal is to transform Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT into an enhanced compositional model ΘcompsubscriptΘ𝑐𝑜𝑚𝑝\Theta_{comp}roman_Θ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT where n𝑛nitalic_n target expert modules {ΔΘi}i=1nsuperscriptsubscriptΔsubscriptΘ𝑖𝑖1𝑛\{\Delta\Theta_{i}\}_{i=1}^{n}{ roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT are effectively integrated. Formally, the Self-MoE transformation function is defined as:

ftrans:(Θ0,{Ti}i=1n)Θcomp=Θ0{ΔΘi}i=1n:subscript𝑓𝑡𝑟𝑎𝑛𝑠subscriptΘ0superscriptsubscriptsubscript𝑇𝑖𝑖1𝑛subscriptΘ𝑐𝑜𝑚𝑝subscriptΘ0superscriptsubscriptΔsubscriptΘ𝑖𝑖1𝑛f_{trans}:(\Theta_{0},\ \{T_{i}\}_{i=1}^{n})\rightarrow\Theta_{comp}=\Theta_{0% }\ \cup\ \{\Delta\Theta_{i}\}_{i=1}^{n}italic_f start_POSTSUBSCRIPT italic_t italic_r italic_a italic_n italic_s end_POSTSUBSCRIPT : ( roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , { italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) → roman_Θ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∪ { roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT

Here, under our problem settings, the number of parameters of Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and ΘcompsubscriptΘ𝑐𝑜𝑚𝑝\Theta_{comp}roman_Θ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT should not be significantly different, necessitating that the expert modules ΔΘiΔsubscriptΘ𝑖\Delta\Theta_{i}roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT be lightweight (i.e., LoRA (Hu et al., 2022)). The available seed data are limited but can be reasonably collected (e.g., 100). Importantly, we do not assume the availability of larger/teacher models at one’s hand; instead, we aim to develop a method that enables self-improvement and is designed to be universally applicable.

3 Method: Self-MoE

In this section, we describe Self-MoE, our proposed framework designed to build a compositional model in which specialized expert modules and a routing component are learned in a self-training manner to cooperate effectively. At a high level, Self-MoE decomposes the monolithic structure of a base LLM into a dynamic mixture of self-specialized units, each equipped with distinct target capabilities. This section outlines the overall pipeline and architecture of Self-MoE, illustrated in Figure 2, which details both the self-specialization of individual target expert modules and their integration to form a compositional system, MiXSE (MiXture of Self-specialized Experts).

3.1 Building Expert Modules through Self-Specialization

The first step of Self-MoE is creating specialized modules {ΔΘi}i=1nsuperscriptsubscriptΔsubscriptΘ𝑖𝑖1𝑛\{\Delta\Theta_{i}\}_{i=1}^{n}{ roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT for each target expertise, while adhering to the desiderata discussed in Section 2. That is, the modules should be lightweight and self-improving. We employ the concept of self-specialization (Kang et al., 2023a) where a base LLM is aligned with self-generated synthetic data for target specialization, resulting in lightweight LoRA (Hu et al., 2022) experts.

Targeted Generation.

Self-specialization involves generating synthetic instruction-response data Di={(insti(1),respi(1)),(insti(2),respi(2)),}subscript𝐷𝑖subscriptsuperscriptinst1𝑖subscriptsuperscriptresp1𝑖subscriptsuperscriptinst2𝑖subscriptsuperscriptresp2𝑖D_{i}=\{({\text{{inst}}}^{(1)}_{i},{\text{{resp}}}^{(1)}_{i}),({\text{{inst}}}% ^{(2)}_{i},{\text{{resp}}}^{(2)}_{i}),...\}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { ( inst start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , resp start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , ( inst start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , resp start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , … } tailored to each target domain Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. We ensure the data is both diverse and highly relevant to the specialized tasks/domains each module will address. The generation includes the following steps:

(1) Seed Construction: First, given a target Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT identified, we prepare a small number of seed examples (e.g., 100) that capture essential characteristics and scenarios relevant to each target domain Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. While we exploit existing datasets for the purpose of demonstration, we posit manual annotation for such a small number should be reasonable in real-world applications. These seeds serve as the foundational dataset from which synthetic variations are generated.

(2) Instruction Brainstorming: Once the seed examples are established, the next step is to diversify the range of instructions (and corresponding input contexts) through a brainstorming process. Specifically, we prompt111The prompts can be found in Table 9-12 in Appendix. a base LLM Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to create new instructions following sequences of seed instructions given in-context.

(3) Response Generation: The final step involves generating corresponding responses for the newly created instructions. We use seed instruction-response pairs as in-context demonstrations to extract latent relevant knowledge from Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.

Self-Align with LoRA

With each specialized synthetic data Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in place, we now proceed with the self-alignment of Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to induce specialization, separately producing lightweight expert components ΔΘiΔsubscriptΘ𝑖\Delta\Theta_{i}roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Note that Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are self-generated by Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and used to specialize the same Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT using an adapter module ΔΘiΔsubscriptΘ𝑖\Delta\Theta_{i}roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, resulting in an specialized model Θspec=Θ0+ΔΘisubscriptΘ𝑠𝑝𝑒𝑐subscriptΘ0ΔsubscriptΘ𝑖\Theta_{spec}=\Theta_{0}+\Delta\Theta_{i}roman_Θ start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT = roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, we utilize Low-Rank Adaptation (LoRA) (Hu et al., 2022), which integrates additional trainable parameters that are specific to each domain Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT while kee** Θ0subscriptΘ0\Theta_{0}roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT intact. Within the corresponding ΘΘ\Thetaroman_Θ, we define θ𝜃\thetaitalic_θ as the weights at a certain layer where LoRA is attached. Let θspecd×ksubscript𝜃𝑠𝑝𝑒𝑐superscript𝑑𝑘\theta_{spec}\in\mathbb{R}^{d\times k}italic_θ start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_k end_POSTSUPERSCRIPT be updated weights at a specific LoRA layer which can be decomposed as:

θspecsubscript𝜃𝑠𝑝𝑒𝑐\displaystyle\theta_{spec}italic_θ start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT =θ0+Δθiabsentsubscript𝜃0Δsubscript𝜃𝑖\displaystyle=\theta_{0}+\Delta\theta_{i}= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
=θ0+θBiθAiabsentsubscript𝜃0subscript𝜃subscript𝐵𝑖subscript𝜃subscript𝐴𝑖\displaystyle=\theta_{0}+\theta_{B_{i}}\theta_{A_{i}}= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_θ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT

where θBid×ranksubscript𝜃subscript𝐵𝑖superscript𝑑𝑟𝑎𝑛𝑘\theta_{B_{i}}\in\mathbb{R}^{d\times rank}italic_θ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d × italic_r italic_a italic_n italic_k end_POSTSUPERSCRIPT and θAirank×ksubscript𝜃subscript𝐴𝑖superscript𝑟𝑎𝑛𝑘𝑘\theta_{A_{i}}\in\mathbb{R}^{rank\times k}italic_θ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_r italic_a italic_n italic_k × italic_k end_POSTSUPERSCRIPT, with rankmin(d,k)much-less-than𝑟𝑎𝑛𝑘𝑑𝑘rank\ll\min(d,k)italic_r italic_a italic_n italic_k ≪ roman_min ( italic_d , italic_k ). The forward pass becomes:

h=θspecx=θ0x+θBiθAixsubscript𝜃𝑠𝑝𝑒𝑐𝑥subscript𝜃0𝑥subscript𝜃subscript𝐵𝑖subscript𝜃subscript𝐴𝑖𝑥h=\theta_{spec}x=\theta_{0}x+\theta_{B_{i}}\theta_{A_{i}}xitalic_h = italic_θ start_POSTSUBSCRIPT italic_s italic_p italic_e italic_c end_POSTSUBSCRIPT italic_x = italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + italic_θ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x

This applies to all LoRA layers, and only ΔΘi={Δθi(1),Δθi(2),}ΔsubscriptΘ𝑖Δsuperscriptsubscript𝜃𝑖1Δsuperscriptsubscript𝜃𝑖2\Delta\Theta_{i}=\{\Delta\theta_{i}^{(1)},\Delta\theta_{i}^{(2)},...\}roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 1 ) end_POSTSUPERSCRIPT , roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( 2 ) end_POSTSUPERSCRIPT , … } is updated during training using Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. As a whole, this process of self-specialization can be defined as producing an expert module ΔΘiΔsubscriptΘ𝑖\Delta\Theta_{i}roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the i𝑖iitalic_i-th target along with the corresponding synthetic data Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Left in Figure 2):

fss:(Θ0,Ti)(ΔΘi,Di):subscript𝑓𝑠𝑠subscriptΘ0subscript𝑇𝑖ΔsubscriptΘ𝑖subscript𝐷𝑖f_{ss}:(\Theta_{0},T_{i})\rightarrow(\Delta\Theta_{i},D_{i})italic_f start_POSTSUBSCRIPT italic_s italic_s end_POSTSUBSCRIPT : ( roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) → ( roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

We iterate this process for each target domain, focusing on knowledge, reasoning, math, and coding.

3.2 Mixture of Self-Specialized Experts

After each expert module is individually specialized through the self-specialization process, they are integrated into a compound system ΘcompsubscriptΘ𝑐𝑜𝑚𝑝\Theta_{comp}roman_Θ start_POSTSUBSCRIPT italic_c italic_o italic_m italic_p end_POSTSUBSCRIPT, MiXSE (MiXture of Self-specialized Experts). MiXSE is designed to leverage the distinct capabilities of each module, orchestrating their cooperation to handle diverse tasks dynamically and efficiently. To achieve this benefit, a router module θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is also incorporated, which analyzes each input token to dynamically route to the most appropriate expert module based on the task at hand.

Specifically, within each layer, the output hhitalic_h for each input x𝑥xitalic_x is calculated by combining the contributions of the selected expert modules ΔθiΔsubscript𝜃𝑖\Delta\theta_{i}roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, weighted by their relevance as determined by the router:

h\displaystyle hitalic_h =θ0x+i=1nαiΔθixabsentsubscript𝜃0𝑥superscriptsubscript𝑖1𝑛subscript𝛼𝑖Δsubscript𝜃𝑖𝑥\displaystyle=\theta_{0}x+\sum\nolimits_{i=1}^{n}\alpha_{i}\Delta\theta_{i}x= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_x
=θ0x+i=1nαiΔθBiθAixabsentsubscript𝜃0𝑥superscriptsubscript𝑖1𝑛subscript𝛼𝑖Δsubscript𝜃subscript𝐵𝑖subscript𝜃subscript𝐴𝑖𝑥\displaystyle=\theta_{0}x+\sum\nolimits_{i=1}^{n}\alpha_{i}\Delta\theta_{B_{i}% }\theta_{A_{i}}x= italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT italic_x + ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT roman_Δ italic_θ start_POSTSUBSCRIPT italic_B start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_x

where α𝛼\alphaitalic_α represents a set of weights computed by the router (i.e., a linear layer) θrn×ksubscript𝜃𝑟superscript𝑛𝑘\theta_{r}\in\mathbb{R}^{n\times k}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_k end_POSTSUPERSCRIPT.

α=top-k(softmax(θrx))\alpha=\text{top-k(softmax}(\theta_{r}x))italic_α = top-k(softmax ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_x ) )

Note that we only take top-k probabilities and mask out the others to efficiently reduce computation. In essence, this also allows the pre-trained base weights θ0subscript𝜃0\theta_{0}italic_θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be sufficiently able to contribute, mitigating potential issues of over-specialization such as forgetting or diminished generalizability. The router θrsubscript𝜃𝑟\theta_{r}italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is a linear layer, shared across all LoRA layers, and is trained using the aggregated self-generated data D={Di}i=1n𝐷superscriptsubscriptsubscript𝐷𝑖𝑖1𝑛D=\{D_{i}\}_{i=1}^{n}italic_D = { italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT to learn how to optimally select modules for a given task:

L(θr)=𝐿subscript𝜃𝑟absent\displaystyle L(\theta_{r})=italic_L ( italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) =
𝔼(inst,resp)D[logPΘ0(resp|inst;θr,{ΔΘi}i=1n)]subscript𝔼similar-toinstresp𝐷delimited-[]𝑙𝑜𝑔subscript𝑃subscriptΘ0conditionalrespinstsubscript𝜃𝑟superscriptsubscriptΔsubscriptΘ𝑖𝑖=1𝑛\displaystyle-\mathbb{E}_{({\text{{inst}}},\ {\text{{resp}}})\sim D}[logP_{% \Theta_{0}}({\text{{resp}}}\ |\ {\text{{inst}}};\ \theta_{r},\{\Delta\Theta_{i% }\}_{i\text{=1}}^{n})]- blackboard_E start_POSTSUBSCRIPT ( inst , resp ) ∼ italic_D end_POSTSUBSCRIPT [ italic_l italic_o italic_g italic_P start_POSTSUBSCRIPT roman_Θ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( resp | inst ; italic_θ start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT , { roman_Δ roman_Θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i =1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ) ]

Here, we solely optimize the router to preserve the explicit semantic distinction of expert modules.

Method Knowledge Reasoning Math Coding Avg. (MMLU) (BBH) (GSM8K) (HumanEval) Base LLM 58.4 56.1 42.5 34.1 47.8 Specialized LLM for Each Capabiility Knowledge Self-Spec. 64.0 41.7 40.5 28.0 43.6 Reasoning Self-Spec. 60.1 60.2 41.0 28.7 47.5 Math Self-Spec. 59.3 58.9 50.0 36.0 51.1 Coding Self-Spec. 57.2 57.2 46.0 37.2 49.4 Merging Methods Instance Merging 62.6 57.6 53.5 36.0 52.4 TIES Merging 63.7 56.3 38.5 32.9 47.9 DARE Merging 37.7 59.6 45.0 34.8 44.3 MiXSE (Ours) 65.6  \uparrow 7.2 61.1  \uparrow 5.0 52.5  \uparrow 10.0 37.8  \uparrow 3.7 54.3  \uparrow 6.5

Table 1: Main results. All models are built upon the same base LLM, Gemma-7B, and take self-improving approaches. Corresponding aligned performances of self-specialization are underscored. Each column’s best performance is highlighted in bold, while the gains achieved by our MiXSE over the base LLM are indicated.

4 Experiments and Results

Datasets.

We evaluate Self-MoE across diverse domains that can be categorized as knowledge, reasoning, math, and coding. We use the following benchmark datasets:

MMLU (Massive Multitask Language Understanding, 0-shot unless otherwise stated) (Hendrycks et al., 2021a): A collection of 57 academic knowledge tasks.

BBH (BIG-Bench Hard, 3-shot) (Suzgun et al., 2022): A set of 27 challenging reasoning tasks.

GSM8K (Grade School Math 8K, 8-shot) (Cobbe et al., 2021): A diverse set of grade school math word problems.

HumanEval (0-shot) (Chen et al., 2021): A hand-written evaluation set for python programming problems.

To test generalization (Section 4.4), we additionally evaluate on MATH (4-shot) (Hendrycks et al., 2021b), MBPP (3-shot) (Austin et al., 2021), NaturalQuestions (5-shot) (Kwiatkowski et al., 2019), TriviaQA (5-shot) (Joshi et al., 2017), Hellaswag (0-shot) (Zellers et al., 2019), PIQA (0-shot) (Bisk et al., 2020), and TruthfulQA (0-shot) (Lin et al., 2022).

Baselines.

To assess the effectiveness of Self-MoE, we compare performance against several baselines:

Four Self-Specialized Models (Kang et al., 2023a): Trained on self-generated synthetic data for individual domains.

Instance Merging (Multi-Task Tuning): Leverages the aggregated synthetic data generated by self-specialization to train a model capable of handling multiple tasks simultaneously.

TIES (Yadav et al., 2023), DARE (Yu et al., 2024): Advanced weight merging methods integrating multiple expert strengths into a unified model.

We also contextualize these results with computationally intensive methods reported in the literature, despite indirect comparisons: BTM (Li et al., 2022), Sparse Upcycling (Komatsuzaki et al., 2023), BTX (Sukhbaatar et al., 2024), GLAN (Li et al., 2024a), Orca (Mitra et al., 2023), and Merlinite (Sudalairaj et al., 2024) in Appendix C.1.

Implementation Details.

We adopt Gemma-7B (Team et al., 2024) as a base LLM for our main experiments, and additionally apply Self-MoE to various models, such as LLaMA-2 7B & 13B (Touvron et al., 2023), Mistral 7B (Jiang et al., 2023), and LLaMA-3 8B (AI@Meta, 2024) in Section 4.5. We use 100 seeds to generate 5K synthetic data for each domain, resulting in 20K data. Each LoRA module contributes less than 0.3% to the parameters of the base model, and the router’s parameters are negligible, resulting in the added parameters of MiXSE amounting to only about 1%.

4.1 Main Results

In Table 1, we showcase comparative benchmark results of various approaches across four specialized domains: knowledge, reasoning, math, and coding. All baselines use self-generated synthetic data based on the same Base LLM, Gemma-7B, to ensure fair comparisons.

First, we confirm self-specialization markedly enhances target-specific expertise, compared to the base LLM. For instance, we can see substantial gains from corresponding specialized models (e.g., Knowledge Self-Spec. in the knowledge domain): 58.4 to 64.0 in knowledge, 56.1 to 60.2 in reasoning, and so on. However, this focused improvement sometimes comes at the cost of reduced performance in non-targeted areas, as evidenced by the drop in scores for the Knowledge Self-Spec. model in reasoning, math, and coding. This trade-off highlights the inherent limitation of over-specialization. In contrast, our MiXSE, demonstrates consistent improvements across all domains, due to its modular, compositional architecture that makes use of dynamic routing to leverage optimal experts. Surprisingly, it even outperforms all corresponding specialized models, indicating that it effectively synergizes the strengths of each specialization.

In comparison with other static merging methods like Instance Merging, TIES, and DARE, MiXSE stands out for its superior adaptability. While they attempt to combine the strengths of different specialization areas into a single model, they lack the dynamic flexibility that MiXSE offers. Notably, simple instance merging (i.e., multi-task learning), though effective in enhancing the base LLM across domains, still falls short of achieving the superior average performance of 54.3 seen with MiXSE. This validates the advantages of dynamic expert integration in a compositional system.

4.2 Ablation Study

Now that we have verified the effectiveness of MiXSE as a whole, we evaluate the impact of different configurations and components of the system, presented in Table 2. The configurations vary in terms of routing strategies and integration of experts, offering insights into the contributions of each element to the system’s overall effectiveness.

We start by examining the Top-k routing strategy, which plays a crucial role in our model. Our findings show that both the Top-1 and Top-2 expert configurations deliver the best performance. This suggests that identifying and leveraging the most relevant expert for a given task is typically sufficient and most effective. On a side note, the similar performances of the different configurations may highlight the robustness of our method. Given the similar performances, we prefer the Top-1 expert setup for better efficiency.

Interestingly, the results also indicate a drop in performance when using All Experts. This can be attributed to that involving all experts regardless of their relevance can introduce noise and dilute the specific contributions of the most pertinent experts. Additionally, involving more experts than necessary can increase computational overhead.

Furthermore, employing Random Routing serves as a useful setup to highlight the effectiveness of strategic expert selection of our Self-Optimized Router, which is a key component of our MiXSE. We observe that the performance significantly decreases under this configuration, highlighting the router’s role in dynamically tailoring the selection of experts according to the specific requirements of each task. The router’s ability to discern and activate the most suitable experts based on the context is critical for optimizing performance. Notably, this ability is learned by relying on a very small amount of seed data.

Another interesting finding comes from the configuration where experts and the router are jointly trained, which means that the semantic distinctions among experts may be diluted. This setup substantially decreases performance relative to scenarios where the router and experts are optimized independently. This decline validates that semantic experts play a crucial role in enhancing the system’s capability to handle tasks requiring specific expertise, while offering better interpretability (Section 4.3).

Configuration Knowledge Reasoning Math Coding Avg. (MMLU) (BBH) (GSM8K) (HumanEval) Base LLM 58.4 56.1 42.5 34.1 47.8 Top-k Routing w/ Top-1 Expert 65.6 61.1 52.5 37.8 54.3 w/ Top-2 Experts 65.5 60.9 52.5 38.4 54.3 w/ All Experts 65.4 58.9 54.0 33.5 53.0 Random Routing w/o Self-Optimized Router 59.9 58.5 48.0 36.6 50.8 Experts & Router Joint Training w/o Semantic Experts 64.5 58.1 46.0 33.5 50.5

Table 2: Analysis and ablation of the router in our MiXSE. Configurations vary to investigate the optimal number of experts used, to verify the possibility of self-learning for the router, and to see the importance of semantic distinctions among experts within the compositional system.
Refer to caption
Figure 3: Routing analysis that shows routing distributions over four domains for each benchmark, averaging the weights across tokens within individual tasks.

4.3 Routing Analysis

Understanding how MiXSE allocates tasks to its various experts is crucial for gauging both its efficiency and interpretability. By analyzing the routing distributions across four distinct domains, we aim to clearly see whether the system matches queries to the most suitable experts. Figure 3 presents the routing distributions used to solve each benchmark, where the weights are averaged across tokens and layers within individual tasks.

We first observe that the MiXSE’s router effectively selects the correct expert for each corresponding target. This is evident from the impressive alignment between tasks and the experts chosen by the router; for example, the knowledge expert predominantly handles knowledge tasks, while the coding expert is routed coding tasks. This highlights the router’s ability to learn and apply this routing automatically and consistently, making the system’s decisions interpretable and trustworthy.

Beyond the direct matching of tasks to domain-specific experts, the router also demonstrates its ability to exploit synergies between different areas of expertise. For instance, the reasoning expert is frequently involved in tasks across the knowledge, math, and coding, reflecting the system’s compositional use of expertise. This explains the reason for MiXSE’s superior performances across all domains even beyond all specialized modules in Table 1.

Category Benchmark Base Instance MiXSE LLM Merging Target Academic Knowledge MMLU 58.4 62.6 65.6 Reasoning BBH 56.1 57.6 61.1 Math GSM8K 42.5 53.5 52.5 Coding HumanEval 34.1 36.0 37.8 Target Average 47.8 52.4 54.3 Non-Target (In-Expertise) Math MATH 20.7 15.3 21.4 Coding MBPP 37.8 37.6 39.6 Non-Target (Out-of-Expertise) World Knowledge Natural Questions 24.2 22.3 24.5 TriviaQA 63.9 58.6 62.5 Commonsense Hellaswag 80.6 78.0 80.7 PIQA 81.1 80.1 81.2 Safety TruthfulQA 44.7 42.2 44.3 Non-Target Average 50.4 47.7 50.6

Table 3: Investigation on generalization and a forgetting issue of Self-MoE. Non-Target (In-Expertise) indicates where MiXSE does not directly specialize using seed data directly while relevant to targets. Non-Target (Out-of-Expertise) refers to irrelevant cases.

4.4 Generalizability Test

While Self-MoE has shown clear benefits in target benchmarks such as MMLU, BBH, GSM8K, and HumanEval, one may be curious about its generalizability to non-targets, or concerned with the potential issues of specialization such as forgetting. In Table 3, we conduct an investigation using non-targeted benchmarks that were not utilized in building MiXSE.

On MATH and MBPP benchmarks, which can be considered highly relevant to target benchmarks, GSM8K and HumanEval, we find our Self-MoE can still improve over the base LLM even though they were not directly targeted in our training regime. This finding supports the generalizability of the Self-MoE approach.

Concerning the potential side effect of forgetting, we extend our testing to include domains such as world knowledge, common sense, and safety, which are rarely associated with the targets directly. Our experiments show that overall, there are rarely meaningful drops in performances when applying our Self-MoE. Only a minor drop is observed with MiXSE in TriviaQA, but this is substantially less than in the case of instance merging. This suggests our approach almost maintains existing knowledge for non-targets while significantly boosting target performances, unlike monolithic baselines.

Refer to caption
Figure 4: Results of Self-MoE applied to other LLMs.

4.5 Applicability to Other Base LLMs

Following the successful demonstration of our Self-MoE approach based on Gemma-7B, we now present Figure 4 where we apply Self-MoE to other base LLMs beyond Gemma-7B. We use diverse model variants including LLaMA-2 7B & 13B, Mistral 7B, and LLaMA-3 8B. Our findings suggest that our approach improves all models regardless of the model family, size, and level of base performance. This is significant as it might imply that one can take any monolithic model to enjoy a free upgrade to a compositional system that offers better effectiveness, flexibility, and interpretability.

4.6 Impact of the Number of Synthetic Data

Figure 5 illustrates the impact of scaling self-generated synthetic data for Self-MoE. As the data scales from 0 to 20K, our MiXSE model demonstrates substantial and consistent improvements over the base one in average performance across domains, suggesting the scalable potential of Self-MoE. Instance Merging, serving as a strong baseline, also benefits from increased data, but the gains progress at a slower rate, as evidenced by linear trendlines. This reflects the inefficiency of the static merging scheme, which, being monolithic, suffers from trade-offs in knowledge gains and forgetting.

Refer to caption
Figure 5: Analysis with the varied sizes of self-generated synthetic data for Self-MoE.

# Experts Knowledge Reasoning Math Coding Avg. (MMLU) (BBH) (GSM8K) (HumanEval) 0 (Base LLM) 58.4 56.1 42.5 34.1 47.8 1 (K) 64.0 41.7 40.5 28.0 43.6 2 (K+R) 65.8 58.0 43.0 32.3 49.8 3 (K+R+M) 62.7 61.5 54.5 32.9 52.9 4 (K+R+M+C) 65.6 61.1 52.5 37.8 54.3

Table 4: Scaling the number of experts. K: Knowledge expert. R: Reasoning expert. M: Math expert. C: Coding expert.

4.7 Scaling the Number of Experts

In Table 4, we present the results of MiXSE composed of varying numbers of experts, with experts added progressively one at a time in the order of knowledge, reasoning, math, and coding. The results indicate that starting with the knowledge expert, which initially exhibits a performance trade-off, subsequent additions of reasoning, math, and coding experts consistently enhance overall performance. This highlights the compositional MiXSE’s advantage of adaptability and modularity.

Method Compos- Semantic Light- Data & w/o itional Experts weight Resource Teacher -Efficient & Labels Base LLM Gemma 7B \originalTimes - - - - LLaMA-2 70B \originalTimes - - - - Mixtral 8x7B \originalCheck \originalTimes \originalTimes - - Pre-training Methods Branch-Train-Merge (4x7B) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck Sparse Upcycling (4x7B) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck Branch-Train-Mix (4x7B) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck MoE w/ LoRA PHATGOOSE \originalCheck \originalCheck \originalCheck \originalTimes \originalTimes MOLE \originalCheck \originalCheck \originalCheck \originalTimes \originalTimes Distillation from Larger Models GLAN 7B (w/ GPT-4) \originalTimes - - \originalTimes \originalTimes Orca-2 7B (w/ GPT-4) \originalTimes - - \originalTimes \originalTimes Merlinite 7B (w/ Mixtral 8x7B) \originalTimes - - \originalTimes \originalTimes Self-Improving MiXSE (Gemma 7B) \originalCheck \originalCheck \originalCheck \originalCheck \originalCheck

Table 5: Comprehensive summary of relevant models for references. Detailed discussions in Appendix C.1.

5 Related Work

To offer a broader perspective, Table 5 presents a comprehensive summary of various models that, while relevant, are not directly comparable. For further discussions and a more detailed comparison, please refer to Appendix C.1.

Combination of Experts.

There have been numerous efforts to combine the strengths of multiple models or modules. The Mixture of Experts (MoE) models such as Switch Transformer (Fedus et al., 2022), GLAM (Du et al., 2022), and Mixtral (Jiang et al., 2024) exemplify this, dynamically allocating tasks based on the expertise of each component for better efficiency and scalability. These models contrast with ours by not prioritizing lightweight experts, resulting in a larger model with more parameters. Unlike their experts implicitly learned during pre-training, Self-MoE explicitly creates semantic experts for targeted improvements.

Another relevant area is model merging, which involves the weighted averaging of multiple models to form a single, aggregated model (Wortsman et al., 2022; Matena and Raffel, 2022; Ilharco et al., 2023; ** et al., 2023). One of the leading methods, TIES (Yadav et al., 2023) tackles conflicts and parameter inconsistencies among models. DARE (Yu et al., 2024) further reduces the redundancy of parameters. However, these methods are fundamentally static in that they operate with fixed parameters once merged, which may lead to interference, lacking the dynamic flexibility that MiXSE offers.

There exist notable recent MoE models that similarly explore the utilization of semantic experts, albeit in distinct contexts (Wu et al., 2024; Muqeeth et al., 2024; Sukhbaatar et al., 2024). MOLE relies on human-labeled data, and PHATGOOSE assumes the availability of existing expert models trained by external creators and necessitates additional training for a router on the creators’ side. BTX relies on extensive pre-training, demanding significant resources, yet it as a pre-trained model holds the potential to complement our self-training approach. Unlike MOLE and PHATGOOSE, our Self-MoE framework creates experts and a router from scratch through self-improvement, while using minimal resources, as contrasted to BTX.

Self-Improvement and Specialization of LLMs.

The pursuit of enhancing the capabilities of LLMs often revolves around an instruction-tuning scheme, which can significantly boost cross-task generalizability (Ouyang et al., 2022; Su et al., 2022; Mishra et al., 2022; Wei et al., 2022). Due to the bottlenecks of expensive annotation costs which lead to limited scalability, the self-training concept (Luo, 2022) has gained attention from the community, where LLMs are aligned with automatically self-generated synthetic instructions (Wang et al., 2023; Sun et al., 2023; Li et al., 2024b). These are distinguished from distillation techniques (Hinton et al., 2015; Kang et al., 2023b), which assume a stronger teacher model (Mitra et al., 2023; Li et al., 2024a; Sudalairaj et al., 2024), limiting their applicability.

With the growing necessity to adapt generalist models to specific domains, Kang et al. (2023a) adopts the self-training paradigm for specialization, tackling that general instruction tuning is rarely effective in expert domains. While this work serves as a foundation for enhancing specialized expertise with minimal resources, we recognize the inherent trade-offs as a monolithic structure, such as potential compromises in performance outside of the specialized domains. Unlike them, our Self-MoE achieves uncompromising multiple expertise by taking a modular approach without extensive resources and adding many parameters.

6 Conclusion

In this study, we proposed Self-MoE to build compositional LLMs with self-specialized experts, MiXSE, to enhance targeted capabilities, adaptability, and interpretability without the reliance on extensive human-labeled data. Empirical evaluations across diverse domains demonstrated that MiXSE significantly enhances base LLM performance and overcomes specialization trade-offs. This work marks a significant step towards modular, self-improving paradigms which can address the inherent limitations of monolithic models, offering a promising direction for future LLM research.

Limitations

While our study demonstrates promising results for the Self-MoE, we recognize areas requiring further investigation in future work. Employing self-specialization Kang et al. (2023a) to generate synthetic data within our framework may raise concerns about potential data contamination and noise. Nonetheless, findings from Kang et al. (2023a), which conducted an n-gram overlap analysis between the self-specialization data and test data, confirmed no significant overlap, thus alleviating the concerns about contamination. Despite this, the need for continuous monitoring of potential biases from pre-training and the development of enhanced data validation and noise filtering strategies remain critical. Moreover, due to computational constraints, we did not scale our model and data to their full potential. Future work should therefore concentrate on overcoming these limitations, which will enable better data quality and more extensive training to unveil the full potential of the Self-MoE framework.

References

  • AI@Meta (2024) AI@Meta. 2024. Llama 3 model card.
  • Austin et al. (2021) Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, and Charles Sutton. 2021. Program synthesis with large language models. Preprint, arXiv:2108.07732.
  • Ben Allal et al. (2022) Loubna Ben Allal, Niklas Muennighoff, Logesh Kumar Umapathi, Ben Lipkin, and Leandro von Werra. 2022. A framework for the evaluation of code generation models. https://github.com/bigcode-project/bigcode-evaluation-harness.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Ronan Le bras, Jianfeng Gao, and Ye** Choi. 2020. Piqa: Reasoning about physical commonsense in natural language. Proceedings of the AAAI Conference on Artificial Intelligence, 34(05):7432–7439.
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  • Buehler and Buehler (2024) Eric L. Buehler and Markus J. Buehler. 2024. X-lora: Mixture of low-rank adapter experts, a flexible framework for large language models with applications in protein mechanics and molecular design. Preprint, arXiv:2402.07148.
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
  • Cheng et al. (2024) Daixuan Cheng, Shaohan Huang, and Furu Wei. 2024. Adapting large language models via reading comprehension. In The Twelfth International Conference on Learning Representations.
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training verifiers to solve math word problems. Preprint, arXiv:2110.14168.
  • Du et al. (2022) Nan Du, Yan** Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, Barret Zoph, Liam Fedus, Maarten P Bosma, Zongwei Zhou, Tao Wang, Emma Wang, Kellie Webster, Marie Pellat, Kevin Robinson, Kathleen Meier-Hellstern, Toju Duke, Lucas Dixon, Kun Zhang, Quoc Le, Yonghui Wu, Zhifeng Chen, and Claire Cui. 2022. GLaM: Efficient scaling of language models with mixture-of-experts. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 5547–5569. PMLR.
  • Fedus et al. (2022) William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39.
  • Feng et al. (2024) Shangbin Feng, Weijia Shi, Yuyang Bai, Vidhisha Balachandran, Tianxing He, and Yulia Tsvetkov. 2024. Knowledge card: Filling LLMs’ knowledge gaps with plug-in specialized language models. In The Twelfth International Conference on Learning Representations.
  • Gao et al. (2023) Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2023. A framework for few-shot language model evaluation.
  • Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021a. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  • Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021b. Measuring mathematical problem solving with the math dataset. NeurIPS.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the knowledge in a neural network. In NIPS Deep Learning and Representation Learning Workshop.
  • Hu et al. (2022) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  • Huang et al. (2023) Chengsong Huang, Qian Liu, Bill Yuchen Lin, Tianyu Pang, Chao Du, and Min Lin. 2023. Lorahub: Efficient cross-task generalization via dynamic lora composition. Preprint, arXiv:2307.13269.
  • Huang et al. (2024) Jianheng Huang, Leyang Cui, Ante Wang, Chengyi Yang, Xinting Liao, Linfeng Song, Junfeng Yao, and **song Su. 2024. Mitigating catastrophic forgetting in large language models with self-synthesized rehearsal. Preprint, arXiv:2403.01244.
  • Ilharco et al. (2023) Gabriel Ilharco, Marco Tulio Ribeiro, Mitchell Wortsman, Ludwig Schmidt, Hannaneh Hajishirzi, and Ali Farhadi. 2023. Editing models with task arithmetic. In The Eleventh International Conference on Learning Representations.
  • Jiang et al. (2023) Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2023. Mistral 7b. Preprint, arXiv:2310.06825.
  • Jiang et al. (2024) Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven Le Scao, Théophile Gervet, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. 2024. Mixtral of experts. Preprint, arXiv:2401.04088.
  • ** et al. (2023) Xisen **, Xiang Ren, Daniel Preotiuc-Pietro, and Pengxiang Cheng. 2023. Dataless knowledge fusion by merging weights of language models. In The Eleventh International Conference on Learning Representations.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada. Association for Computational Linguistics.
  • Kang et al. (2023a) Junmo Kang, Hongyin Luo, Yada Zhu, James Glass, David Cox, Alan Ritter, Rogerio Feris, and Leonid Karlinsky. 2023a. Self-specialization: Uncovering latent expertise within large language models. arXiv preprint arXiv:2310.00160.
  • Kang et al. (2023b) Junmo Kang, Wei Xu, and Alan Ritter. 2023b. Distill or annotate? cost-efficient fine-tuning of compact models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11100–11119, Toronto, Canada. Association for Computational Linguistics.
  • Komatsuzaki et al. (2023) Aran Komatsuzaki, Joan Puigcerver, James Lee-Thorp, Carlos Riquelme Ruiz, Basil Mustafa, Joshua Ainslie, Yi Tay, Mostafa Dehghani, and Neil Houlsby. 2023. Sparse upcycling: Training mixture-of-experts from dense checkpoints. In The Eleventh International Conference on Learning Representations.
  • Kotha et al. (2024) Suhas Kotha, Jacob Mitchell Springer, and Aditi Raghunathan. 2024. Understanding catastrophic forgetting in language models via implicit inference. In The Twelfth International Conference on Learning Representations.
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural questions: A benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:452–466.
  • Li et al. (2024a) Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, and Furu Wei. 2024a. Synthetic data (almost) from scratch: Generalized instruction tuning for language models. Preprint, arXiv:2402.13064.
  • Li et al. (2022) Margaret Li, Suchin Gururangan, Tim Dettmers, Mike Lewis, Tim Althoff, Noah A. Smith, and Luke Zettlemoyer. 2022. Branch-train-merge: Embarrassingly parallel training of expert language models. Preprint, arXiv:2208.03306.
  • Li et al. (2024b) Xian Li, ** Yu, Chunting Zhou, Timo Schick, Omer Levy, Luke Zettlemoyer, Jason E Weston, and Mike Lewis. 2024b. Self-alignment with instruction backtranslation. In The Twelfth International Conference on Learning Representations.
  • Liang et al. (2023) Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Alexander Cosgrove, Christopher D Manning, Christopher Re, Diana Acosta-Navas, Drew Arad Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu Ren, Huaxiu Yao, Jue WANG, Keshav Santhanam, Laurel Orr, Lucia Zheng, Mert Yuksekgonul, Mirac Suzgun, Nathan Kim, Neel Guha, Niladri S. Chatterji, Omar Khattab, Peter Henderson, Qian Huang, Ryan Andrew Chi, Sang Michael Xie, Shibani Santurkar, Surya Ganguli, Tatsunori Hashimoto, Thomas Icard, Tianyi Zhang, Vishrav Chaudhary, William Wang, Xuechen Li, Yifan Mai, Yuhui Zhang, and Yuta Koreeda. 2023. Holistic evaluation of language models. Transactions on Machine Learning Research. Featured Certification, Expert Certification.
  • Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
  • Ling et al. (2023) Chen Ling, Xujiang Zhao, Jiaying Lu, Chengyuan Deng, Can Zheng, Junxiang Wang, Tanmoy Chowdhury, Yun Li, Hejie Cui, Xuchao Zhang, Tianjiao Zhao, Amit Panalkar, Dhagash Mehta, Stefano Pasquali, Wei Cheng, Haoyu Wang, Yanchi Liu, Zhengzhang Chen, Haifeng Chen, Chris White, Quanquan Gu, Jian Pei, Carl Yang, and Liang Zhao. 2023. Domain specialization as the key to make large language models disruptive: A comprehensive survey. Preprint, arXiv:2305.18703.
  • Luo (2022) Hongyin Luo. 2022. Self-training for natural language processing. Ph.D. thesis, Massachusetts Institute of Technology.
  • Mangrulkar et al. (2022) Sourab Mangrulkar, Sylvain Gugger, Lysandre Debut, Younes Belkada, Sayak Paul, and Benjamin Bossan. 2022. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  • Matena and Raffel (2022) Michael S Matena and Colin A Raffel. 2022. Merging models with fisher-weighted averaging. In Advances in Neural Information Processing Systems, volume 35, pages 17703–17716. Curran Associates, Inc.
  • Min et al. (2022) Sewon Min, Mike Lewis, Luke Zettlemoyer, and Hannaneh Hajishirzi. 2022. MetaICL: Learning to learn in context. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2791–2809, Seattle, United States. Association for Computational Linguistics.
  • Mishra et al. (2022) Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. 2022. Cross-task generalization via natural language crowdsourcing instructions. In ACL.
  • Mitra et al. (2023) Arindam Mitra, Luciano Del Corro, Shweti Mahajan, Andres Codas, Clarisse Simoes, Sahaj Agarwal, Xuxi Chen, Anastasia Razdaibiedina, Erik Jones, Kriti Aggarwal, Hamid Palangi, Guoqing Zheng, Corby Rosset, Hamed Khanpour, and Ahmed Awadallah. 2023. Orca 2: Teaching small language models how to reason. Preprint, arXiv:2311.11045.
  • Muqeeth et al. (2024) Mohammed Muqeeth, Haokun Liu, Yufan Liu, and Colin Raffel. 2024. Learning to route among specialized experts for zero-shot generalization. Preprint, arXiv:2402.05859.
  • Ouyang et al. (2022) Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  • Su et al. (2022) Hong** Su, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, Tao Yu, et al. 2022. One embedder, any task: Instruction-finetuned text embeddings. arXiv preprint arXiv:2212.09741.
  • Sudalairaj et al. (2024) Shivchander Sudalairaj, Abhishek Bhandwaldar, Aldo Pareja, Kai Xu, David D. Cox, and Akash Srivastava. 2024. Lab: Large-scale alignment for chatbots. Preprint, arXiv:2403.01081.
  • Sukhbaatar et al. (2024) Sainbayar Sukhbaatar, Olga Golovneva, Vasu Sharma, Hu Xu, Xi Victoria Lin, Baptiste Rozière, Jacob Kahn, Daniel Li, Wen tau Yih, Jason Weston, and Xian Li. 2024. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. Preprint, arXiv:2403.07816.
  • Sun et al. (2023) Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, and Chuang Gan. 2023. Principle-driven self-alignment of language models from scratch with minimal human supervision. In Advances in Neural Information Processing Systems.
  • Suzgun et al. (2022) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. 2022. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261.
  • Taori et al. (2023) Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  • Team et al. (2024) Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, Pouya Tafti, Léonard Hussenot, Pier Giuseppe Sessa, Aakanksha Chowdhery, Adam Roberts, Aditya Barua, Alex Botev, Alex Castro-Ros, Ambrose Slone, Amélie Héliou, Andrea Tacchetti, Anna Bulanova, Antonia Paterson, Beth Tsai, Bobak Shahriari, Charline Le Lan, Christopher A. Choquette-Choo, Clément Crepy, Daniel Cer, Daphne Ippolito, David Reid, Elena Buchatskaya, Eric Ni, Eric Noland, Geng Yan, George Tucker, George-Christian Muraru, Grigory Rozhdestvenskiy, Henryk Michalewski, Ian Tenney, Ivan Grishchenko, Jacob Austin, James Keeling, Jane Labanowski, Jean-Baptiste Lespiau, Jeff Stanway, Jenny Brennan, Jeremy Chen, Johan Ferret, Justin Chiu, Justin Mao-Jones, Katherine Lee, Kathy Yu, Katie Millican, Lars Lowe Sjoesund, Lisa Lee, Lucas Dixon, Machel Reid, Maciej Mikuła, Mateo Wirth, Michael Sharman, Nikolai Chinaev, Nithum Thain, Olivier Bachem, Oscar Chang, Oscar Wahltinez, Paige Bailey, Paul Michel, Petko Yotov, Rahma Chaabouni, Ramona Comanescu, Reena Jana, Rohan Anil, Ross McIlroy, Ruibo Liu, Ryan Mullins, Samuel L Smith, Sebastian Borgeaud, Sertan Girgin, Sholto Douglas, Shree Pandya, Siamak Shakeri, Soham De, Ted Klimenko, Tom Hennigan, Vlad Feinberg, Wojciech Stokowiec, Yu hui Chen, Zafarali Ahmed, Zhitao Gong, Tris Warkentin, Ludovic Peran, Minh Giang, Clément Farabet, Oriol Vinyals, Jeff Dean, Koray Kavukcuoglu, Demis Hassabis, Zoubin Ghahramani, Douglas Eck, Joelle Barral, Fernando Pereira, Eli Collins, Armand Joulin, Noah Fiedel, Evan Senter, Alek Andreev, and Kathleen Kenealy. 2024. Gemma: Open models based on gemini research and technology. Preprint, arXiv:2403.08295.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Preprint, arXiv:2307.09288.
  • Wan et al. (2024) Zhongwei Wan, Xin Wang, Che Liu, Samiul Alam, Yu Zheng, Jiachen Liu, Zhongnan Qu, Shen Yan, Yi Zhu, Quanlu Zhang, Mosharaf Chowdhury, and Mi Zhang. 2024. Efficient large language models: A survey. Transactions on Machine Learning Research. Survey Certification.
  • Wang et al. (2023) Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13484–13508, Toronto, Canada. Association for Computational Linguistics.
  • Wei et al. (2022) Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V Le. 2022. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  • Wortsman et al. (2022) Mitchell Wortsman, Gabriel Ilharco, Samir Ya Gadre, Rebecca Roelofs, Raphael Gontijo-Lopes, Ari S Morcos, Hongseok Namkoong, Ali Farhadi, Yair Carmon, Simon Kornblith, and Ludwig Schmidt. 2022. Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 23965–23998. PMLR.
  • Wu et al. (2023) Hongqiu Wu, Linfeng Liu, Hai Zhao, and Min Zhang. 2023. Empower nested Boolean logic via self-supervised curriculum learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13731–13742, Singapore. Association for Computational Linguistics.
  • Wu et al. (2024) Xun Wu, Shaohan Huang, and Furu Wei. 2024. Mixture of loRA experts. In The Twelfth International Conference on Learning Representations.
  • Yadav et al. (2023) Prateek Yadav, Derek Tam, Leshem Choshen, Colin Raffel, and Mohit Bansal. 2023. TIES-merging: Resolving interference when merging models. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Yu et al. (2024) Le Yu, Bowen Yu, Haiyang Yu, Fei Huang, and Yongbin Li. 2024. Language models are super mario: Absorbing abilities from homologous models as a free lunch. In International Conference on Machine Learning. PMLR.
  • Zaharia et al. (2024) Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The shift from models to compound ai systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/.
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. 2019. HellaSwag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, Florence, Italy. Association for Computational Linguistics.
  • Zhang et al. (2024) Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. 2024. When scaling meets LLM finetuning: The effect of data, model and finetuning method. In The Twelfth International Conference on Learning Representations.
  • Zhao et al. (2023) Haiyan Zhao, Hanjie Chen, Fan Yang, Ninghao Liu, Huiqi Deng, Hengyi Cai, Shuaiqiang Wang, Dawei Yin, and Mengnan Du. 2023. Explainability for large language models: A survey. arXiv preprint arXiv:2309.01029.

Category Benchmark # Examples Target Academic Knowledge MMLU (57 Tasks) 14,079 Reasoning BBH (27 Tasks) 6,511 Math GSM8K 8,790 Coding HumanEval 164 Non-Target (In-Expertise) Math MATH 12,500 Coding MBPP 257 Non-Target (Out-of-Expertise) World Knowledge Natural Questions 3,610 TriviaQA 17,200 Commonsense Hellaswag 10,000 PIQA 3,000 Safety TruthfulQA 817

Table 6: Dataset statistics. Non-Target (In-Expertise) indicates where MiXSE does not directly specialize using seed data directly while relevant to targets. Non-Target (Out-of-Expertise) refers to irrelevant cases.

Method Compos- Semantic Light- Data & Resrc w/o Teacher Knowledge Reasoning Math Coding itional Experts weight -Efficient & Labels (MMLU 5-shot) (BBH) (GSM8K) (HumanEval) Base LLM Gemma 7B (Team et al., 2024) \originalTimes - - - - 65.7 56.1 42.5 34.1 LLaMA-2 70B (Touvron et al., 2023) \originalTimes - - - - 68.9 51.2 35.2 29.9 Mixtral 8x7B (Jiang et al., 2024) \originalCheck \originalTimes \originalTimes - - 70.6 67.1 65.7 32.3 Pre-training Methods Branch-Train-Merge (4x7B) (Li et al., 2022) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck 44.3 - 27.7 30.6 Sparse Upcycling (4x7B) (Komatsuzaki et al., 2023) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck 52.1 - 40.1 26.2 Branch-Train-Mix (4x7B) (Sukhbaatar et al., 2024) \originalCheck \originalCheck \originalTimes \originalTimes \originalCheck 52.5 - 37.1 28.7 MoE w/ LoRA PHATGOOSE (Muqeeth et al., 2024) \originalCheck \originalCheck \originalCheck \originalTimes \originalTimes - 35.6 - - MOLE (Wu et al., 2024) \originalCheck \originalCheck \originalCheck \originalTimes \originalTimes - 42.2 - - Distillation/Synthetic Data from Larger Models GLAN 7B (w/ GPT-4) (Li et al., 2024a) \originalTimes - - \originalTimes \originalTimes 62.9 60.7 80.8 48.8 Orca-2 7B (w/ GPT-4) (Mitra et al., 2023) \originalTimes - - \originalTimes \originalTimes 53.9 42.8 55.7 17.1 Merlinite 7B (w/ Mixtral 8x7B) (Sudalairaj et al., 2024) \originalTimes - - \originalTimes \originalTimes 64.9 - 44.6 - Self-Improving MiXSE (Gemma 7B) \originalCheck \originalCheck \originalCheck \originalCheck \originalCheck 66.2 61.1 52.5 37.8

Table 7: Additional comparisons with other models for references. Results are extracted from each corresponding paper, except for pre-training methods where the numbers are all from BTX (Sukhbaatar et al., 2024).

Appendix A Experiment Details

We provide each of our self-specialization prompts for knowledge, reasoning, math, and coding experts in Tables 9, 10, 11, and 12. We largely follow Kang et al. (2023a)’s prompt structure to ensure quality, with additional domain-specific instructions that are considered necessary to inform task-related information.

For our evaluation, we employ popular evaluation frameworks to pursue standard evaluation setups and protocols: HELM (Liang et al., 2023), LM Evaluation Harness (Gao et al., 2023), and BigCode Evaluation Harness (Ben Allal et al., 2022). We use Huggingface PEFT (Mangrulkar et al., 2022) and XLoRA (Buehler and Buehler, 2024) for the implementation of MoE compatible with LoRA.

Regarding seed instructions, we sampled 100 training instances from each of the MMLU, BBH, and GSM8K datasets, for knowledge, reasoning, and math domains, respectively. For coding, since the size of the HumanEval dataset is very small and thus the training set is not available, we took 100 samples from the MBPP training set and converted the task format to make them suit the HumanEval.

During instruction generation, we use three seed data, which are randomly sampled, as in-context examples, using a temperature of 1 and top-p of 0.98, whereas we use five seed data in-context for response generation with greedy decoding. For specialization, we use LoRA applied to all modules with a rank of 8 and alpha of 16, and train it using a learning rate of 3e-4, epochs of 3, and batch size of 32. We train each module and MiXSE using a standard Alpaca (Taori et al., 2023) template on a single A100-80GB, which takes only a few hours.

Appendix B Dataset Descriptions

The statistics for each dataset are provided in Table 6. All datasets are primarily in English, and the license information is as follows:

  • MMLU, GSM8K, HumanEval, MATH, Hellaswag: MIT license

  • BBH, Natural Questions, TriviaQA, TruthfulQA: Apache-2.0 license

  • MBPP: CC by 4.0

  • PIQA: Academic Free License v3.0

We confirm that our usage of these datasets adheres to their intended purposes as defined by their licensing conditions. The datasets were only used for academic research and development.

Appendix C Additional Results

C.1 Additional Comparison and Discussion

In Table 7, we present additional comparisons with various other models and methods to provide a broader perspective, though comparisons may not appear to be direct, due to factors involved such as parameters, resources, etc. We discuss some noteworthy points.

Notably, although MiXSE significantly improves upon its base model, Gemma 7B, it does not yet reach the performance levels of the more powerful Mixtral 8x7B. It’s important to understand that Mixtral also utilizes an MoE (Mixture of Experts) architecture, but unlike MiXSE, it does not prioritize lightweight experts, leading to a much larger model with significantly more parameters. Moreover, while Mixtral’s experts are implicitly built during pre-training, MiXSE explicitly creates semantic experts, allowing for targeted improvements and clearer interpretability. Importantly, our self-improving method can be potentially applied on top of any pre-trained model including Mixtral in principle.

Similarly, BTX (Branch-Train-MiX) uses a pre-training MoE strategy where parameter-heavy semantic experts are employed, yielding substantial enhancements over the base LLM. This approach highlights the effectiveness of using semantically rich experts to refine the model’s capabilities. To make comparisons in terms of efficiency, our model uses fewer parameters (7B), compared to BTX (12B active with much more whole parameters) and requires only about 1 GPU day for training, compared to 900 GPU days for BTX. In essence, since BTX is also a pre-training method while specialized, we expect it to be complementary to our Self-MoE, as evidenced in previous work (Kang et al., 2023a).

With a shared spirit, MOLE and PHATGOOSE build a MoE (Mixture of Experts) using LoRA, which is semantic and lightweight. However, there are significant differences in foundational assumptions: MOLE depends on human-labeled data, while PHATGOOSE requires access to pre-trained expert models developed externally. In contrast, our Self-MoE framework independently constructs both experts and a router entirely from scratch, focusing on self-improvement without such dependencies. While their scenarios are considered reasonable in a certain context, we aim for broader applicability by minimizing assumptions on conditions.

Lastly, GLAN demonstrates outstanding performance across various domains. This is attributed to their reliance on distilling from the larger and stronger model, GPT-4, using a huge amount of data (e.g., 10 million). As outlined in our problem statement (Section 2), we deliberately avoid assuming the availability of such advanced models to ensure the broader applicability of our method which self-improves from scratch. Consequently, while acknowledging each of their own value, it is crucial to recognize that direct comparisons may not be entirely appropriate, given the fundamental differences in resource assumptions and initial conditions.

Benchmark Base LLM Seed Only MiXSE Knowledge 58.3 57.4 65.6 (MMLU) Reasoning 56.1 57.0 61.1 (BBH) Math 42.5 45.0 52.5 (GSM8K) Coding 34.1 34.1 37.8 (HumanEval) Avg. 47.8 48.4 54.3

Table 8: Results of MiXSE using only seed data.

C.2 MiXSE using Only Seed Data

Table 8 shows the results of the MiXSE when exploiting only seed data for training, clarifying the benefits derived from our methodological enhancements beyond the mere inclusion of seed data in training. While the Seed Only shows slight improvements over the Base LLM in some benchmarks, the significant enhancements of our MiXSE across all benchmarks confirm that the enhanced capabilities of Self-MoE are not merely due to the use of seed data. This further highlights the achievement of self-improvement with our method.

C.3 Vaildity of Comparative Results

In an effort to address the concern related to the sensitivity of in-context learning (Min et al., 2022), we perform three runs with the different lists of few-shot samples where applicable. As a result, we see that the mean of the base LLM (Gemma-7B)’s average performance across domains is 47.9 with a standard deviation (SD) of 0.56, that of our MiXSE is 53.6 with an SD of 0.60, and that of instance merging is 51.6 with an SD of 0.87. A statistical analysis between MiXSE and instance merging yields a p-value of 0.03, confirming the significant difference.

Instruction Brainstorming Prompt

You are asked to come up with a set of task instructions about diverse domains across STEM, humanities, social sciences, and others. These task instructions will be given to a language model and we will evaluate the model for completing the instructions.

Here are the requirements:
1. The type of task should be multiple-choice question answering. That is, a question along with multiple options (A, B, C, D) should be provided.
2. The language used for the instruction/question also should be diverse.
3. A language model should be able to complete the instruction. For example, do not ask the assistant to create any visual or audio output. For another example, do not ask the assistant to wake you up at 5pm or set a reminder because it cannot perform any action.
4. The instructions should be in English.
5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
6. You should generate an appropriate input to the instruction. The input field should contain a specific example provided for the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging.
7. Ensure diverse domains are covered for extensive expert-level knowledge. The subjects may include Abstract Algebra, Anatomy, Astronomy, Business Ethics, Clinical Knowledge, College-level Biology, Chemistry, Computer Science, Mathematics, Medicine, Physics, Computer Security, Conceptual Physics, Econometrics, Electrical Engineering, Elementary Mathematics, Formal Logic, Global Facts, High School-level Biology, Chemistry, Computer Science, European History, Geography, Gov’t and Politics, Macroeconomics, Mathematics, Microeconomics, Physics, Psychology, Statistics, US History, World History, Human Aging, Human Sexuality, International Law, Jurisprudence, Logical Fallacies, Machine Learning, Management, Marketing, Medical Genetics, Miscellaneous, Moral Disputes, Moral Scenarios, Nutrition, Philosophy, Prehistory, Professional-level (Accounting, Law, Medicine, Psychology), Public Relations, Security Studies, Sociology, US Foreign Policy, Virology, World Religions, etc.

List of tasks:
Response Generation

You are a knowledgeable domain expert. Given an instruction and a question, generate the best answer to solve the given task about STEM, humanities, social sciences, and others.

Table 9: Prompts for knowledge-related instruction and response generation.

Instruction Brainstorming Prompt

You are asked to come up with a set of task instructions focusing on challenging tasks that require multi-step reasoning. These task instructions will be given to a language model and we will evaluate the model for completing the instructions.

Here are the requirements:
1. The type of task should be question answering, requiring multi-step reasoning.
2. The language used for the instruction/question also should be diverse.
3. The generated problem should have a single correct answer.
4. The instructions should be in English.
5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
6. You should generate an appropriate input question to the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging.
7. Ensure diverse topics and levels are covered for extensive expert-level reasoning. The tasks may be about boolean expression, causal judgement, date understanding, disambiguation of question, closing Dyck-n words, formal fallacies, geometric shapes, hyperbaton, logical deduction of objects, movie recommendation, multi-step arithmetic problem, navigation, object counting, table reasoning, reasoning about colored objects, selecting one that ruins the name in an input, salient translation error detection, sarcastic sentence classification, sports understanding, temporal sequences, tracking shuffled objects, web of lies, word sorting, etc.

List of tasks:
Response Generation

You are a multi-step reasoning expert. Given an instruction and a challenging question, generate step-by-step reasoning and the answer.

Table 10: Prompts for reasoning-related instruction and response generation.

Instruction Brainstorming Prompt

You are asked to come up with a set of task instructions focusing on mathematical problems. These task instructions will be given to a language model and we will evaluate the model for completing the instructions.

Here are the requirements:
1. The type of task should be question answering, requiring multi-step reasoning.
2. The language used for the instruction/question also should be diverse.
3. The generated mathematical problem should have a solution.
4. The instructions should be in English.
5. The instructions should be 1 to 2 sentences long. Either an imperative sentence or a question is permitted.
6. You should generate an appropriate input question to the instruction. It should involve realistic data and should not contain simple placeholders. The input should provide substantial content to make the instruction challenging.
7. Ensure diverse topics and levels are covered for extensive expert-level reasoning. The subjects may include Algebra, Counting, Probability, Calculus, Statistics, Geometry, Linear Algebra, Number Theory and grade school math, etc.

List of tasks:
Response Generation

You are a math expert. Given an instruction and a mathematical question, generate step-by-step reasoning and the answer.

Table 11: Prompts for math-related instruction and response generation.

Instruction Brainstorming Prompt

You are asked to come up with a set of task instructions focusing on coding problems. These task instructions will be given to a language model and we will evaluate the model for completing the instructions.

Here are the requirements:
1. The type of task should be about coding problems, such as writing a python function given a specific instruction and test examples.
2. The language used for the instruction should be diverse, but the programming language should be python.
3. The generated problem should have a solution.
4. The instructions should be in English.
5. You should generate appropriate and correct test examples for the given problem.
6. Ensure diverse functions and levels are covered for extensive expert-level coding.

List of tasks:
Response Generation

You are a coding expert. Given an instruction and test cases, write a python function that passes the test cases.

Table 12: Prompts for coding-related instruction and response generation.