Sparse Diffusion Policy: A Sparse, Reusable, and Flexible Policy for Robot Learning

Yixiao Wang^1∗†, Yifei Zhang^1∗, Mingxiao Huo^2∗†, Ran Tian¹, Xiang Zhang¹, Yichen Xie¹,
Chenfeng Xu¹, Pengliang Ji², Wei Zhan¹, Mingyu Ding^1‡, Masayoshi Tomizuka¹
¹UC Berkeley ²Carnegie Mellon University

Abstract

The increasing complexity of tasks in robotics demands efficient strategies for multitask and continual learning. Traditional models typically rely on a universal policy for all tasks, facing challenges such as high computational costs and catastrophic forgetting when learning new tasks. To address these issues, we introduce a sparse, reusable, and flexible policy, Sparse Diffusion Policy (SDP). By adopting Mixture of Experts (MoE) within a transformer-based diffusion policy, SDP selectively activates experts and skills, enabling efficient and task-specific learning without retraining the entire model. SDP not only reduces the burden of active parameters but also facilitates the seamless integration and reuse of experts across various tasks. Extensive experiments on diverse tasks in both simulations and real world show that SDP 1) excels in multitask scenarios with negligible increases in active parameters, 2) prevents forgetting in continual learning of new tasks, and 3) enables efficient task transfer, offering a promising solution for advanced robotic applications. Demos and codes can be found in https://forrest-110.github.io/sparse_diffusion_policy/.

^†^†^∗Equal Contribution. ^†Project Lead. ^‡Corresponding Author.

Keywords: Robot learning, Multitask and continual learning, Mixture of experts

1 Introduction

Generalist robots are gaining substantial attention in both academia and industry, capable of performing a wide range of tasks and continually learning new ones without losing previously acquired skills [1, 2, 3, 4, 5, 6]. Traditional approaches often rely on a universal and monolithic policy [1, 2] for all tasks, activating all the parameters in the large network for even simple tasks like pushing. Besides, given the diverse nature and lifelong requirements of robot learning tasks [7, 8], when encountering a new task, these approaches typically require costly fine-tuning [9] , which carries the risk of catastrophic forgetting of previously acquired skills. For task-specific adapters such as LoRA [10], it entails expanding active parameters during inference. An alternative approach is to train separate policies for different tasks, though this requires independent and from-scratch training for each task and prevents knowledge transfer across tasks.

Recent works on skill discovery [5, 11, 12] and chain of skills [13, 14] show promise in addressing the above challenges. These methods necessitate meticulous design with knowledge guidance such as visual features [5, 15, 16, 17, 18] and language prompts [14, 19], to learn different skills for different tasks, with the goal of reusing these skills in unseen scenarios. However, their skill abstraction modules are typically not scalable and the network structure is not designed to be sparse for efficient computing. As a result, the influence of network structure has not yet been thoroughly explored. Recently, Mixture of Experts (MoE) [20] has proven successful in large-scale applications across NLP, vision, and multimodal domains [21, 22, 23]. It selectively activates only a subset of expert networks selected by a router, allows experts to be utilized across various tasks and over time, and facilitates the integration of additional networks while preserving the functionality of existing ones. This observation raises a natural question: Can the mere employment of a sparse, reusable, and flexible MoE structure overcome the challenge without the extensive integration of human-derived knowledge?

Motivated by the above observation, we introduce Sparse Diffusion Policy (SDP), as depicted in Figure 1, a framework for multitask and continual learning by exploring the potential of integrating MoE architecture within a transformer-based diffusion policy [24]. SDP offers several advantages: 1) Sparsity. Only a select set of skills is activated at one time, significantly enhancing computational efficiency during inference. 2) Reusability. Skills are systematically reused across different tasks, for example, “pick and place” is a common skill frequently utilized in robotic tasks. 3) Flexibility. Skills for new tasks can be merged or added to the existing skill pool, enabling their flexible use in future tasks. We conceptualize the experts in MoE as specialized skills and the router as a skill planner (as illustrated in Figure 1). Furthermore, we explore specific training and application strategies of MoE for robotic learning.

Our extensive experiments in both simulations and real-world settings demonstrate the effectiveness of SDP in multitask, continual, and transfer learning for robotic tasks. It achieves superior multitask performance, with only a 1% increase in active parameters compared to the single-task model. For continual learning, SDP maintains a higher success rate on new tasks without forgetting previously learned tasks, whereas the baseline model [10] requires activating over 62% of its parameters. Furthermore, we investigate the potential for task transfer to complex, long-horizon tasks using a very small pretrained model, initially trained on only two half-length tasks. By training a highly lightweight router (less than 0.4% of the total parameters), SDP outperforms models trained from scratch. Through experiments, we observe that the SDP is capable of extracting a broad range of skills through the combination of experts, and the router functions effectively as a skill planner.

Refer to caption — Figure 1: Overview of Sparse Diffusion Policy (SDP). 1) Multitask Learning: SDP can simultaneously acquire experts from different human demonstration datasets. Due to its sparsity, SDP can activate different experts for different tasks. Additionally, with its reusability, SDP can activate the same expert to share knowledge among tasks. 2) Continual Learning: With its flexibility, SDP can transfer to new tasks by adding only a few new experts to learn the new tasks. This approach mitigates catastrophic forgetting by retaining the old experts and routers. 3) Task Transfer: Leveraging its reusability, SDP can transfer to new tasks by tuning the old experts and routers for expert selection. This allows SDP to acquire new skills based on the previously learned knowledge.

2 Related Work

2.1 Multitask and Continual Learning in Robotics

In the field of robot learning, significant advancements have been made in multitask [25, 26, 27, 28, 29, 30, 13, 31, 32, 2, 33, 34, 35, 36, 37] and continual learning [38, 39, 40, 41, 5, 42], allowing robots to efficiently acquire and retain multiple skills over time. Multitask learning approaches, such as policy distillation [43, 44, 45] and hierarchical reinforcement learning [46, 47, 48, 49], enable robots to learn and perform multiple tasks simultaneously by leveraging shared representations and decomposing complex tasks into manageable subtasks. However, these methods cannot induce sparsity during policy learning, which can enhance the efficiency of the policy network in multitask learning. Continual learning techniques, including regularization-based methods like Elastic Weight Consolidation (EWC) [50], memory-based strategies like experience replay, and architectural innovations such as Progressive Neural Networks (PNNs) [51], are developed to mitigate catastrophic forgetting, allowing robots to incrementally acquire new skills while retaining previously learned ones. Also, there are works in meta-learning [52, 53, 54] and few-shot learning [55, 56, 57, 58], that provide robots with the ability to quickly adapt to new tasks with minimal data. However, the MoE structure can naturally support continual learning without forgetting old tasks due to its unique architecture. This approach requires fewer additional techniques and can seamlessly integrate with multitask learning to create a dynamic task pool.

2.2 MoE in Computer Vision and Large Language Model

The Mixture of Experts (MoE) approach has seen significant advancements in both computer vision and large language models, offering a promising strategy to enhance model performance by leveraging specialized sub-models as “experts”. In computer vision, MoE frameworks have been employed for multitask learning and transfer learning [59, 22, 60, 61, 62], demonstrating their efficiency in handling diverse and complex datasets such as segmentation, image classification. Moreover, many works [63, 64, 65] integrated MoE into the Transformer architecture, showing substantial gains in natural language processing tasks. These advancements underscore the potential of MoE systems to address the growing demands for computational efficiency [66, 67] and model accuracy [68, 69, 70] in both computer vision and language processing domains. This work focuses on leveraging the sparsity of MoE to conduct multitask and continue learning for the robot learning area. We also make full use of the MoE module to explore the efficient finetuning for task transfer.

3 Method

Our approach integrates Mixture of Experts (MoE) layers into a transformer-based diffusion policy network [24], combined with a specifically designed training and application strategy for multitask and continual learning in robotics. Owing to the network’s structural sparsity, we refer to our method as Sparse Diffusion Policy (SDP). In the subsequent section, we first outline the problem formulation for multitask and continual imitation learning. We then discuss the integration of the Mixture of Experts (MoE) structure and explore how its sparsity, flexibility, and reusability can be specifically utilized for robot learning. Finally, we present the training strategies we have developed to further unleash their potential in the domain of robot learning.

3.1 Problem Formulation

We consider a set of robot tasks $\mathcal{C}=\{\mathcal{T}_{j}\}_{j=1}^{J}$ . For task $j$ , there are $N$ expert demonstrations $\{\tau_{j,i}\}_{i=1}^{N}$ . Each demonstration $\tau_{j,i}$ is a sequence of state-action pairs. We formulate robot imitation learning as an action sequence prediction problem [24, 36], training a model to minimize the error in future actions conditioned on historical states. Specifically, for task $j$ , imitation learning minimize the behavior clone loss $\mathcal{L}_{bc}^{j}$ formulated as

\mathcal{L}_{bc}^{j}=\mathbb{E}_{s_{t-o:t+h},a_{t-o:t+h}\sim\mathcal{T}_{j}}% \left[\sum_{t=0}^{T}\mathcal{L}\left(\pi(a_{t:t+h}|s_{t-o+1:t},\mathcal{T}_{j}% ;\boldsymbol{\theta}),a_{t:t+h}\right)\right].

(1)

where $a$ is action, $s$ is state, $h$ is the prediction horizon, $o$ is the number of historical steps, $\mathcal{L}$ is a supervised action prediction loss such as mean squared error or negative log-likelihood, $T$ is the length of demonstrations and $\theta$ represents the learnable parameters of the network. In a multitask setting (to learn $\{\mathcal{T}_{j}\}_{j=1}^{J-1}$ ), the behavior cloning loss is given by $\mathcal{L}_{bc}=\sum_{j=1}^{J-1}\mathcal{L}_{bc}^{j}$ . In the case of continual learning, when only task $J$ is to be learned, we have access exclusively to the corresponding demonstrations $\mathcal{T}_{J}$ in this learning cycle, and the behavior cloning loss is $\mathcal{L}_{bc}=\mathcal{L}_{bc}^{J}$ .

3.2 Sparse Diffusion Policy with MoE layers

We utilize the transformer-based diffusion policy [24] and replace the Feed-Forward Networks (FFN) with MoE layers. For the $n$ -th MoE layer [20], $n=\{1,2,...,N\}$ , there are $L$ experts $\{\mathcal{E}^{n}_{l}\}_{l=1}^{L}$ and one router $\mathcal{R}^{n}$ . Each expert network $\mathcal{E}^{n}_{l}$ is composed of multilayer perceptrons (MLP). Router $\mathcal{R}^{n}$ compares the input $x\in\mathbb{R}^{1\times M}$ with the expert embeddings $W^{n}\in\mathbb{R}^{M\times L}$ and get their scores. Only the Top-K expert networks are activated for inference. Specifically, the MoE layer output $y$ is derived as

\displaystyle y=\sum_{l=1}^{L}\mathcal{R}^{n}(x,l)\mathcal{E}_{l}^{n}({x}),% \quad\mathcal{R}^{n}(x,l)=\operatorname{Top-K}(\operatorname{Softmax}(xW^{n}),% l).

(2)

where $\operatorname{Top-K}(v,l)$ is the $l$ -th element of vector $v$ if it is largest $K$ elements otherwise $0$ .

Multitask Learning. Since different tasks require distinct experts for completion, we assign a task-specific router $\mathcal{R}_{j}^{n}$ to enable different tasks to select different experts based on the same historical states. As depicted in Figure 2, the same experts can be reused at different times within the same task and across various tasks, facilitated by the task-specific router and time-varied state. On the other hand, state-specific and task-specific experts can also be utilized and learned. More importantly, the activation of a limited number of experts demonstrates computational efficiency.

Continual Learning. MoE layers facilitate straightforward model expansion and support continual learning [61]. Specifically, for each new task, we freeze the previously learned experts and routers, integrate new trainable experts into each MoE layer and train the corresponding task-specific routers (See Figure 2). Catastrophic forgetting is mitigated by fixing previously learned parameters, thus enabling lifelong learning. Upon mastering the new task, the experts related to it are abstracted and become reusable in subsequent learning processes. Moreover, the computational cost remains constant despite the continuous integration of new tasks.

Intuitive Interpretation and Task Transfer. Intuitively, each unique combination of the TopK experts within every layer of a Mixture of Experts (MoE) architecture (refer to Figure 2) represents a distinct “skill”. The routing mechanism acts as a planner for skill chains, selecting specific experts to assemble a skill. In this structure, the number of potential combinations of experts across $N$ layers is given by $(\frac{L!}{(L-K)!})^{N}$ . This formula suggests the capacity to cover a broad spectrum of diverse skills using a finite set of experts and layers. It also benefits continual learning. For instance, with parameters $L=4,N=2,K=1$ , adding one expert per layer (two in total) generates $9$ new combinations. More promisingly, when confronted with an unseen long-horizon task, the inherent generalizability of MoE allows for significant flexibility. Even with only a few tasks previously trained, broad coverage of diverse skills by combination of experts makes it possible to train a new, lightweight router (merely 0.5% of the parameters discussed in Section 4.3) and behave well in a novel and even long-horizon task (denoted as $\mathcal{T}_{J}$ for convenience, the same in continual learning), enabling fast task transfer (See Figure 2).

3.3 Training Objective

Directly minimizing the behavior clone loss $\mathcal{L}_{bc}$ often leads to favoring certain experts and reinforcing their selection through a cycle of increased training and preference [71, 20]. To avoid overload of the expert, Previous studies [20, 63, 72, 73] have incorporated an auxiliary load balancing loss to prevent the overloading of any specific expert. However, in robot learning, certain skills are consistently utilized across various tasks, suggesting that some experts should be frequently engaged. For instance, pick-and-place skills are commonly required in numerous robotic tasks, resulting in excessive activation of the combination of experts related to the pick-and-place skill. Consequently, expert overload is typical in multitask robot learning, whereas task overload is atypical due to the distinct nature of each task. Each task includes specific components that necessitate unique skills for successful completion. Inspired by this observation, we propose encouraging experts to specialize in specific tasks [22, 61] to prevent task overload. Specifically, we would like to maximize the mutual information between the task $\mathcal{T}$ and the expert $\mathcal{E}^{n}$ for each MoE layer:

I(\mathcal{T},\mathcal{E}^{n})=\sum_{j=1}^{J}\sum_{l=1}^{L}p(\mathcal{T}_{j},% \mathcal{E}_{l}^{n})\log\frac{p(\mathcal{T}_{j},\mathcal{E}_{l}^{n})}{p(% \mathcal{T}_{j})p(\mathcal{E}_{l}^{n})}

(3)

where we assume that each task is equally important, i.e., $p(\mathcal{T}_{j})=\frac{1}{J}$ . Thus, the total training objective is

\mathcal{L}=\mathcal{L}_{bc}-\gamma\sum_{n=1}^{N}I(\mathcal{T},\mathcal{E}^{n})

(4)

where $\mathcal{L}_{bc}=\sum_{j=1}^{J-1}\mathcal{L}_{bc}^{j}$ in multitask learning, and $\mathcal{L}_{bc}=\mathcal{L}_{bc}^{J}$ in continual learning and task transfer.

4 Experimental Results

4.1 Multitask Learning

In this section, we discuss sparsity and reusability of our SDP, highlighting its ability to simultaneously acquire multiple experts efficiently while sharing experts across tasks. As shown in Figure 4, we conducted three multitask learning experiments across different simulation settings: a 2D vision-based simulation, a 3D point cloud-based simulation, and real-robot experiments.

2D Simulation Results. We evaluate the performance of SDP in Mimicgen [74]. Mimicgen includes 1K-10K human demonstrations per task with broad initial state distributions, effectively showing the generalization for multitask evaluation. To our knowledge, this is the first work to explore multitask training on the Mimicgen benchmark. We choose task-conditioned diffusion (TCD) [76, 19], fine-tuning the last action prediction head (TH), fine-tuning last three diffusion policy transformer layers (TT w/3 layers) as our baselines. All three baseline models require the activation of all network parameters, exhibiting a dense structure rather than a sparse one (Ours).

Each model is trained for 300 epochs using one A6000 GPU for 130-150 hours and evaluated every 50 epochs. We report the average success rate of the best three checkpoints in Table 1. Our SDP surpasses all baseline models in nearly all individual tasks, achieving this with the same level of active parameters in the policy network, thanks to the sparsity of our framework. Furthermore, we have illustrated a task-expert relationship map in Figure 4. This visualization shows that each task activates only a subset of the experts within the Mixture of Experts (MoE) model, contributing to computational efficiency. More importantly, the map reveals that different tasks can activate the same expert, demonstrating the reusability of experts across various tasks.

3D Simulation Results. We evaluate our approach in 3D point could-based robot learning simulation [75]. Specifically, we integrate the Mixture of Experts (MoE) into the Feed-Forward Network (FFN) blocks of the 3D diffusion policy network [75]. The number of parameters for the active experts is set to be equivalent to those in the original network. We select the toilet, faucet, and laptop tasks from the DexArt benchmark [77], with only 20-40 human demonstrations and point cloud inputs. Due to the limited number of human demonstrations, we extend the training process to 6000 epochs and evaluate the results every 200 epochs. Similar in 2D simulation, we report the average success rate of the best three checkpoints. As shown in Table 2, our method outperforms the baseline, demonstrating the effectiveness of our SDP in multitask 3D perception settings.

Real Robot Experiment Results. To further evaluate the effectiveness of our SDP framework, we conduct real robot experiments on FANUC LRMate 200iD/7L robotic arm outfitted with an SMC gripper. We choose three diverse and universal tasks for evaluation: pulling a circle, picking and placing a cup, and hanging a cup. For the first two tasks, we collect 20 demonstrations each, and for the last task, we collect 40 demonstrations with two distinct hanging sticks. More details can be found in the Appendix A.2. The robot is manipulated using admittance control [78], which can achieve compliant robot motion to ensure safety during manipulation. We use TCD [76, 19] as our baseline and both models are trained for 2000 epochs. As shown in Table 3, our method greatly outperforms TCD. Additionally, we found that TCD cannot clearly distinguish the multimodal action distributions across different tasks due to the lack of sparsity, leading to model collapse. More details and visualizations can be found in the Appendix A.3.

Table 1: multitask evaluation in 2D simulation, MimicGen [74]. We report the average success rate of the best three checkpoints, as well as activate parameters during the inference. We highlight our results in light-blue.

Method	Active Params	Square	Stack	Coffee	Hammer	Mug	Nut	Stack three	Thread	Avg.
TH	52.6 M	0.76	0.98	0.72	0.97	0.63	0.52	0.73	0.55	0.73
TT w/ 3Layer	52.6 M	0.73	0.95	0.76	0.99	0.65	0.49	0.68	0.59	0.73
TCD [76, 19]	52.7 M	0.63	0.95	0.77	0.92	0.53	0.44	0.62	0.56	0.68
SDP(Ours)	53.3 M	0.74	0.99	0.83	0.98	0.70	0.42	0.76	0.65	0.76

Table 2: Multitask evaluation on 3D simulation [75].

Method	Toilet	Faucet	Laptop	Avg.
TT w/ 1Layer	0.73	0.35	0.85	0.64
TCD [76, 19]	0.72	0.33	0.80	0.62
SDP(Ours)	0.75	0.43	0.82	0.67

Table 3: Multitask evaluation on real robot.

Method	Pull	Pick Place	Hang	Avg.
TCD [76, 19]	1.0	0.0	0.35	0.45
SDP(Ours)	1.0	1.0	0.95	0.98

4.2 Continual Learning

In this section, we evaluate the performance of the Sparse Diffusion Policy (SDP) in continual learning, specifically focusing on the flexibility of our framework. More specifically, when integrating a new task, we introduce a small, fixed number of new experts and a new router into the Mixture of Experts (MoE) structure, while freezing the existing experts, routers, and other modules (excluding the MoE). We then train only the newly-added router and experts, which enhances the efficiency of learning new tasks and avoids catastrophic forgetting [79] of previous tasks. We choose three tasks from robimimic [80] for evaluation: Can (picking and placing a can), Lift (lifting a cube), and Square (inserting a square peg into the correct post). Initially, we train our model on Can (Stage 1), then sequentially transfer to Lift (Stage 2) and Square (Stage 3). At each stage, we evaluate the performance of the current task as well as the previously learned tasks.

Comparison with baselines. First, we compare our approach with LoRA [10] and a fully fine-tuned (FFT) version of our method. Each method is trained for 500 epochs and evaluated every 50 epochs. We report the average success rate of the best three checkpoints. The results, presented in Table 4, demonstrate that our method generally exhibited superior performance in both new tasks and previous tasks. Conversely, the FFT method showed significant forgetting of previously learned tasks upon learning new ones, while the LoRA method struggled with improving the performance in the new task, particularly evident in the Square task. Moreover, the active parameters of our model remain constant during continual learning, which is significantly fewer than those of LoRA as tasks are added. These findings underscore the flexibility of our method in supporting efficient and effective continual learning.

Ablation study on continual learning. We further conducted ablation studies in the context of continual learning, presenting our results on the Square task (Stage 3). This task was chosen because it is not only the most challenging but also the final task, making it difficult to transfer knowledge to new, unseen tasks without the potential negative influence of past experiences. The results, presented in Figure 5, highlight the impact of various components of our method on the success rate. The baseline model (Ours) achieves the highest success rate. The Ours-MI_Loss variant, which excludes the Mutual Information loss, shows a drop in performance. We argue that the router tends to prefer the old experts since newly-added experts that have not yet learned anything will output random actions. And it will enforce the preference to the old frozen experts. However, the Mutual Information loss compels the router to select task-specific experts, thereby avoiding the negative effects associated with the old experts. These results underscore the importance of using a comprehensive set of MoEs and the Mutual Information loss for achieving optimal performance in complex continual learning scenarios.

Table 4: Evaluation on continual learning. Comparison of different policy decoders. AP denotes Active Parameters of the policy network. Grey blocks indicate performance on new tasks; light-blue blocks indicate performance on previous tasks.

Policy Decoder	Stage 1		Stage 2			Stage 3
Policy Decoder	Can	AP	Can	Lift	AP	Can	Lift	Square	AP
FFT	0.97	9.0 M	0.00	1.00	9.0 M	0.00	0.00	0.89	9.0 M
LoRA	0.94	9.0 M	0.94	1.00	12.0 M	0.94	1.00	0.73	14.9 M
MOE (Ours)	0.96	9.2 M	0.94	1.00	9.2 M	0.94	1.00	0.75	9.2 M

4.3 Task Transfer

In this section, we test the reusability of the SDP, highlighting its ability to transfer to new tasks while retaining accumulated old skills. We first set up a novel experiment that rapidly transfers to a new task using only the old tasks. Then, we visualize the horizon-expert map to observe the role of old skills in transferring to new skills.

Task transfer novel experiments. To test the reusability of our model, we include a new task for transfer and use base tasks to learn the base policy. We first select the Coffee and Mug Cleanup tasks from the MimicGen dataset [74] as our base tasks. For the new task, we choose Coffee Preparation from the MimicGen dataset, which is a long-horizon and partially unseen task requiring skills from the base tasks. Notably, Coffee Preparation includes picking coffee from the drawer, whereas the base tasks involve picking coffee from the table and placing it in the drawer. This indicates that Coffee Preparation requires skills from the base tasks but not the exact same actions.

Table 5: Evaluation on Coffee Preparation task. We pretrain our model on Coffee, Mug Cleanup. Then fast transfer to the coffee preparation task. The trainable parameters is the number of trainable parameters in the policy network.

Method	Trainable Params	Coffee Preparation
Scratch	25.9M	0.70
Rou. only	0.1M	0.80

We pretrain our model on base tasks, and for the new policy, we only fine-tune the router in MoE and unfreeze the vision encoder for only 100 epochs on the Coffee Preparation task. We compare our model to a model trained directly on the Coffee Preparation task from scratch for 100 epochs. The results are shown in Table 5. They show that for the policy network, compared with the train-from-scratch method, we use less than 0.5% of the parameters and achieve better results, indicating that the reused skills from the base two tasks are significantly useful.

Skills Visualisation. As shown in Figure 6, we visualize the horizon-expert map, which indicates the activated experts at different horizon timesteps. After only tuning the router, we find that the most frequently activated experts have already been learned in the base tasks. Therefore, this verifies the reusability of the SDP model, which can cover the old skills through the learned experts. Therefore, in this setting, the experts become a pool of skills, and the router function becomes a chain planner, then SDP can acquire new skills rapidly.

5 Discussion and Limitation

In this paper, we have introduced the Sparse Diffusion Policy (SDP) framework, which integrates Mixture of Experts (MoE) layers into the diffusion policy. Our approach is designed around three key principles: sparsity, flexibility, and reusability. By activating only relevant portions of the network for specific tasks, SDP can induce the sparsity for efficient multitask learning. The flexibility of our algorithms is capable of acquiring new tasks without forgetting existing skills, while the reusability of existing knowledge allows for efficient multitask and task transfer learning. Our model has great potential for future large-scale robot learning.

Limitation. Our SDP may fail if the shared knowledge in the network is too limited but same experts are activated by different routers. Additionally, the router in SDP is task-specific, which hinders the ability for universal task completion. By combining it with large language models, the SDP conditioned on language with a broader policy can be applied to a wider range of robot learning scenarios in the future.

Acknowledgments

If a paper is accepted, the final camera-ready version will (and probably should) include acknowledgments. All acknowledgments go at the end of the paper, including thanks to reviewers who gave useful comments, to colleagues who contributed to the ideas, and to funding agencies and corporate sponsors that provided financial support.

References

Padalkar et al. [2023] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al. Open x-embodiment: Robotic learning datasets and rt-x models. arXiv preprint arXiv:2310.08864, 2023.
Shridhar et al. [2023] M. Shridhar, L. Manuelli, and D. Fox. Perceiver-actor: A multi-task transformer for robotic manipulation. In Conference on Robot Learning, pages 785–799. PMLR, 2023.
Lesort et al. [2020] T. Lesort, V. Lomonaco, A. Stoian, D. Maltoni, D. Filliat, and N. Díaz-Rodríguez. Continual learning for robotics: Definition, framework, learning strategies, opportunities and challenges. Information fusion, 58:52–68, 2020.
Liu et al. [2023] Z. Liu, J. Zhang, K. Asadi, Y. Liu, D. Zhao, S. Sabach, and R. Fakoor. Tail: Task-specific adapters for imitation learning with large pretrained models. arXiv preprint arXiv:2310.05905, 2023.
Wan et al. [2023] W. Wan, Y. Zhu, R. Shah, and Y. Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. arXiv preprint arXiv:2311.02058, 2023.
Liang et al. [2023] Z. Liang, Y. Mu, M. Ding, F. Ni, M. Tomizuka, and P. Luo. Adaptdiffuser: Diffusion models as adaptive self-evolving planners. arXiv preprint arXiv:2302.01877, 2023.
Gupta et al. [2022] A. Gupta, C. Lynch, B. Kinman, G. Peake, S. Levine, and K. Hausman. Demonstration-bootstrapped autonomous practicing via multi-task reinforcement learning, 2022.
Liu et al. [2023] Z. Liu, Z. Guo, Y. Yao, Z. Cen, W. Yu, T. Zhang, and D. Zhao. Constrained decision transformer for offline safe reinforcement learning. In International Conference on Machine Learning, pages 21611–21630. PMLR, 2023.
Yao et al. [2024] Y. Yao, Z. Liu, Z. Cen, J. Zhu, W. Yu, T. Zhang, and D. Zhao. Constraint-conditioned policy optimization for versatile safe reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024.
Hu et al. [2021] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
Park et al. [2023] S. Park, O. Rybkin, and S. Levine. Metra: Scalable unsupervised rl with metric-aware abstraction. arXiv preprint arXiv:2310.08887, 2023.
Ju et al. [2024] Z. Ju, C. Yang, F. Sun, H. Wang, and Y. Qiao. Rethinking mutual information for language conditioned skill discovery on imitation learning. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 34, pages 301–309, 2024.
Xian et al. [2023] Z. Xian, N. Gkanatsios, T. Gervet, T.-W. Ke, and K. Fragkiadaki. Chaineddiffuser: Unifying trajectory diffusion and keypose prediction for robotic manipulation. In 7th Annual Conference on Robot Learning, 2023.
Zhang et al. [2023] J. Zhang, J. Zhang, K. Pertsch, Z. Liu, X. Ren, M. Chang, S.-H. Sun, and J. J. Lim. Bootstrap your own skills: Learning to solve new tasks with large language model guidance. arXiv preprint arXiv:2310.10021, 2023.
Huo et al. [2023] M. Huo, M. Ding, C. Xu, T. Tian, X. Zhu, Y. Mu, L. Sun, M. Tomizuka, and W. Zhan. Human-oriented representation learning for robotic manipulation. arXiv preprint arXiv:2310.03023, 2023.
Nair et al. [2022] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation. arXiv preprint arXiv:2203.12601, 2022.
Xiao et al. [2022] T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
Radosavovic et al. [2023] I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell. Real-world robot learning with masked visual pre-training. In Conference on Robot Learning, pages 416–426. PMLR, 2023.
Liang et al. [2023] Z. Liang, Y. Mu, H. Ma, M. Tomizuka, M. Ding, and P. Luo. Skilldiffuser: Interpretable hierarchical planning via skill abstractions in diffusion-based task execution. arXiv preprint arXiv:2312.11598, 2023.
Shazeer et al. [2016] N. Shazeer, A. Mirhoseini, K. Maziarz, A. Davis, Q. Le, G. Hinton, and J. Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations, 2016.
Sukhbaatar et al. [2024] S. Sukhbaatar, O. Golovneva, V. Sharma, H. Xu, X. V. Lin, B. Rozière, J. Kahn, D. Li, W.-t. Yih, J. Weston, et al. Branch-train-mix: Mixing expert llms into a mixture-of-experts llm. arXiv preprint arXiv:2403.07816, 2024.
Chen et al. [2023] Z. Chen, Y. Shen, M. Ding, Z. Chen, H. Zhao, E. G. Learned-Miller, and C. Gan. Mod-squad: Designing mixtures of experts as modular multi-task learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11828–11837, 2023.
Li et al. [2024] Y. Li, S. Jiang, B. Hu, L. Wang, W. Zhong, W. Luo, L. Ma, and M. Zhang. Uni-moe: Scaling unified multimodal llms with mixture of experts. arXiv preprint arXiv:2405.11273, 2024.
Chi et al. [2023] C. Chi, S. Feng, Y. Du, Z. Xu, E. Cousineau, B. Burchfiel, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion. arXiv preprint arXiv:2303.04137, 2023.
Kalashnikov et al. [2022] D. Kalashnikov, J. Varley, Y. Chebotar, B. Swanson, R. Jonschkowski, C. Finn, S. Levine, and K. Hausman. Scaling up multi-task robotic reinforcement learning. In Conference on Robot Learning, pages 557–575. PMLR, 2022.
Yu et al. [2020] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and S. Levine. Meta-world: A benchmark and evaluation for multi-task and meta reinforcement learning. In Conference on robot learning, pages 1094–1100. PMLR, 2020.
Deisenroth et al. [2014] M. P. Deisenroth, P. Englert, J. Peters, and D. Fox. Multi-task policy search for robotics. In 2014 IEEE international conference on robotics and automation (ICRA), pages 3876–3881. IEEE, 2014.
Ze et al. [2023] Y. Ze, G. Yan, Y.-H. Wu, A. Macaluso, Y. Ge, J. Ye, N. Hansen, L. E. Li, and X. Wang. Gnfactor: Multi-task real robot learning with generalizable neural feature fields. In Conference on Robot Learning, pages 284–301. PMLR, 2023.
Goyal et al. [2023] A. Goyal, J. Xu, Y. Guo, V. Blukis, Y.-W. Chao, and D. Fox. Rvt: Robotic view transformer for 3d object manipulation. In Conference on Robot Learning, pages 694–710. PMLR, 2023.
Song et al. [2024] W. Song, H. Zhao, P. Ding, C. Cui, S. Lyu, Y. Fan, and D. Wang. Germ: A generalist robotic model with mixture-of-experts for quadruped robot. arXiv preprint arXiv:2403.13358, 2024.
Ma et al. [2024] X. Ma, S. Patidar, I. Haughton, and S. James. Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation. arXiv preprint arXiv:2403.03890, 2024.
Gervet et al. [2023] T. Gervet, Z. Xian, N. Gkanatsios, and K. Fragkiadaki. Act3d: 3d feature field transformers for multi-task robotic manipulation. In 7th Annual Conference on Robot Learning, 2023.
Rahmatizadeh et al. [2018] R. Rahmatizadeh, P. Abolghasemi, L. Bölöni, and S. Levine. Vision-based multi-task manipulation for inexpensive robots using end-to-end learning from demonstration. In 2018 IEEE international conference on robotics and automation (ICRA), pages 3758–3765. IEEE, 2018.
Ke et al. [2024] T.-W. Ke, N. Gkanatsios, and K. Fragkiadaki. 3d diffuser actor: Policy diffusion with 3d scene representations. arXiv preprint arXiv:2402.10885, 2024.
Huang et al. [2020] W. Huang, I. Mordatch, and D. Pathak. One policy to control them all: Shared modular policies for agent-agnostic control. In International Conference on Machine Learning, pages 4455–4464. PMLR, 2020.
Radosavovic et al. [2023] I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik. Robot learning with sensorimotor pre-training. In Conference on Robot Learning, pages 683–693. PMLR, 2023.
Shafiullah et al. [2022] N. M. Shafiullah, Z. Cui, A. A. Altanzaya, and L. Pinto. Behavior transformers: Cloning $k$ modes with one stone. Advances in neural information processing systems, 35:22955–22968, 2022.
Mendez-Mendez et al. [2023] J. Mendez-Mendez, L. P. Kaelbling, and T. Lozano-Pérez. Embodied lifelong learning for task and motion planning. In Conference on Robot Learning, pages 2134–2150. PMLR, 2023.
Liu et al. [2024] X. Liu, D. Pathak, and D. Zhao. Meta-evolve: Continuous robot evolution for one-to-many policy transfer. arXiv preprint arXiv:2405.03534, 2024.
Traoré et al. [2019] R. Traoré, H. Caselles-Dupré, T. Lesort, T. Sun, G. Cai, D. Filliat, and N. Díaz-Rodríguez. Discorl: Continual reinforcement learning via policy distillation. In NeurIPS Workshop on Deep Reinforcement Learning, 2019.
Xie and Finn [2022] A. Xie and C. Finn. Lifelong robotic reinforcement learning by retaining experiences. In Conference on Lifelong Learning Agents, pages 838–855. PMLR, 2022.
Hejna III and Sadigh [2023] D. J. Hejna III and D. Sadigh. Few-shot preference learning for human-in-the-loop rl. In Conference on Robot Learning, pages 2014–2025. PMLR, 2023.
Rusu et al. [2015] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, and R. Hadsell. Policy distillation. arXiv preprint arXiv:1511.06295, 2015.
Sun et al. [2022] L. Sun, H. Zhang, W. Xu, and M. Tomizuka. Paco: Parameter-compositional multi-task reinforcement learning. Advances in Neural Information Processing Systems, 35:21495–21507, 2022.
Czarnecki et al. [2019] W. M. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, G. Swirszcz, and M. Jaderberg. Distilling policy distillation. In The 22nd international conference on artificial intelligence and statistics, pages 1331–1340. PMLR, 2019.
Zhang et al. [2021] J. Zhang, H. Yu, and W. Xu. Hierarchical reinforcement learning by discovering intrinsic options. arXiv preprint arXiv:2101.06521, 2021.
Botvinick [2012] M. M. Botvinick. Hierarchical reinforcement learning and decision making. Current opinion in neurobiology, 22(6):956–962, 2012.
Nachum et al. [2018a] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. Advances in neural information processing systems, 31, 2018a.
Nachum et al. [2018b] O. Nachum, S. Gu, H. Lee, and S. Levine. Near-optimal representation learning for hierarchical reinforcement learning. arXiv preprint arXiv:1810.01257, 2018b.
Kirkpatrick et al. [2017] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
Rusu et al. [2016] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
Finn et al. [2017] C. Finn, P. Abbeel, and S. Levine. Model-agnostic meta-learning for fast adaptation of deep networks. In International conference on machine learning, pages 1126–1135. PMLR, 2017.
Nichol et al. [2018] A. Nichol, J. Achiam, and J. Schulman. On first-order meta-learning algorithms. arXiv preprint arXiv:1803.02999, 2018.
Vilalta and Drissi [2002] R. Vilalta and Y. Drissi. A perspective view and survey of meta-learning. Artificial intelligence review, 18:77–95, 2002.
Vitiello et al. [2023] P. Vitiello, K. Dreczkowski, and E. Johns. One-shot imitation learning: A pose estimation perspective. arXiv preprint arXiv:2310.12077, 2023.
Biza et al. [2023] O. Biza, S. Thompson, K. R. Pagidi, A. Kumar, E. van der Pol, R. Walters, T. Kipf, J.-W. van de Meent, L. L. Wong, and R. Platt. One-shot imitation learning via interaction war**. arXiv preprint arXiv:2306.12392, 2023.
Vosylius and Johns [2023] V. Vosylius and E. Johns. Few-shot in-context imitation learning via implicit graph alignment. arXiv preprint arXiv:2310.12238, 2023.
Shen et al. [2023] W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola. Distilled feature fields enable few-shot language-guided manipulation. arXiv preprint arXiv:2308.07931, 2023.
Chen et al. [2023] T. Chen, X. Chen, X. Du, A. Rashwan, F. Yang, H. Chen, Z. Wang, and Y. Li. Adamv-moe: Adaptive multi-task vision mixture-of-experts. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 17346–17357, 2023.
Riquelme et al. [2021] C. Riquelme, J. Puigcerver, B. Mustafa, M. Neumann, R. Jenatton, A. Susano Pinto, D. Keysers, and N. Houlsby. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
Chen et al. [2023] Z. Chen, M. Ding, Y. Shen, W. Zhan, M. Tomizuka, E. Learned-Miller, and C. Gan. An efficient general-purpose modular vision model via multi-task heterogeneous training. arXiv preprint arXiv:2306.17165, 2023.
Fan et al. [2022] Z. Fan, R. Sarkar, Z. Jiang, T. Chen, K. Zou, Y. Cheng, C. Hao, Z. Wang, et al. M³vit: Mixture-of-experts vision transformer for efficient multi-task learning with model-accelerator co-design. Advances in Neural Information Processing Systems, 35:28441–28457, 2022.
Fedus et al. [2022] W. Fedus, B. Zoph, and N. Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022.
Du et al. [2022] N. Du, Y. Huang, A. M. Dai, S. Tong, D. Lepikhin, Y. Xu, M. Krikun, Y. Zhou, A. W. Yu, O. Firat, et al. Glam: Efficient scaling of language models with mixture-of-experts. In International Conference on Machine Learning, pages 5547–5569. PMLR, 2022.
Artetxe et al. [2021] M. Artetxe, S. Bhosale, N. Goyal, T. Mihaylov, M. Ott, S. Shleifer, X. V. Lin, J. Du, S. Iyer, R. Pasunuru, et al. Efficient large scale language modeling with mixtures of experts. arXiv preprint arXiv:2112.10684, 2021.
Rajbhandari et al. [2022] S. Rajbhandari, C. Li, Z. Yao, M. Zhang, R. Y. Aminabadi, A. A. Awan, J. Rasley, and Y. He. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning, pages 18332–18346. PMLR, 2022.
Nie et al. [2022] X. Nie, P. Zhao, X. Miao, T. Zhao, and B. Cui. Hetumoe: An efficient trillion-scale mixture-of-expert distributed training system. arXiv preprint arXiv:2203.14685, 2022.
Zhu et al. [2022] J. Zhu, X. Zhu, W. Wang, X. Wang, H. Li, X. Wang, and J. Dai. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. Advances in Neural Information Processing Systems, 35:2664–2678, 2022.
Yu et al. [2022] P. Yu, M. Artetxe, M. Ott, S. Shleifer, H. Gong, V. Stoyanov, and X. Li. Efficient language modeling with sparse all-mlp. arXiv preprint arXiv:2203.06850, 2022.
Zheng et al. [2022] L. Zheng, Z. Li, H. Zhang, Y. Zhuang, Z. Chen, Y. Huang, Y. Wang, Y. Xu, D. Zhuo, E. P. Xing, et al. Alpa: Automating inter-and $\{$ Intra-Operator $\}$ parallelism for distributed deep learning. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 559–578, 2022.
Bengio et al. [2015] E. Bengio, P.-L. Bacon, J. Pineau, and D. Precup. Conditional computation in neural networks for faster models. arXiv preprint arXiv:1511.06297, 2015.
Zhou et al. [2022] Y. Zhou, T. Lei, H. Liu, N. Du, Y. Huang, V. Zhao, A. M. Dai, Q. V. Le, J. Laudon, et al. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems, 35:7103–7114, 2022.
Chi et al. [2022] Z. Chi, L. Dong, S. Huang, D. Dai, S. Ma, B. Patra, S. Singhal, P. Bajaj, X. Song, X.-L. Mao, et al. On the representation collapse of sparse mixture of experts. Advances in Neural Information Processing Systems, 35:34600–34613, 2022.
Mandlekar et al. [2023] A. Mandlekar, S. Nasiriany, B. Wen, I. Akinola, Y. Narang, L. Fan, Y. Zhu, and D. Fox. Mimicgen: A data generation system for scalable robot learning using human demonstrations. In 7th Annual Conference on Robot Learning, 2023.
Ze et al. [2024] Y. Ze, G. Zhang, K. Zhang, C. Hu, M. Wang, and H. Xu. 3d diffusion policy. arXiv preprint arXiv:2403.03954, 2024.
Ajay et al. [2022] A. Ajay, Y. Du, A. Gupta, J. Tenenbaum, T. Jaakkola, and P. Agrawal. Is conditional generative modeling all you need for decision-making? arXiv preprint arXiv:2211.15657, 2022.
Bao et al. [2023] C. Bao, H. Xu, Y. Qin, and X. Wang. Dexart: Benchmarking generalizable dexterous manipulation with articulated objects. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21190–21200, 2023.
Zhang et al. [2023] X. Zhang, C. Wang, L. Sun, Z. Wu, X. Zhu, and M. Tomizuka. Efficient sim-to-real transfer of contact-rich manipulation skills with online admittance residual learning. In Conference on Robot Learning, pages 1621–1639. PMLR, 2023.
McCloskey and Cohen [1989] M. McCloskey and N. J. Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier, 1989.
Mandlekar et al. [2021] A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y. Zhu, and R. Martín-Martín. What matters in learning from offline human demonstrations for robot manipulation. In Conference on Robot Learning (CoRL), 2021.
Kingma and Ba [2014] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

Appendix

Appendix A Experiment Details

A.1 Implementation Details

For multitask learning in 2D simulation environments, we train the policy for 300 epochs with 12 layers of transformer blocks and 512 embedding dimensions. The training batch size is set to 64, and the learning rate is 0.0001, using the Adam optimizer [81]. For the evaluation, the observation steps are set to 2, planning involves 8 action steps, and we execute the first step. For SDP, we set the number of experts to 8 and activate 2 experts for each single task.

For the 3D simulation environments, we use 2 layers of transformer blocks with 256 embedding dimensions. The learning rate is 0.0001, and we use the Adam optimizer [81].

For the continue learning setting, we use 8 layers of transformer blocks with 256 embedding dimensions. The learning rate is 0.0001, and we use the Adam optimizer [81]. For the SDP policy, we set the number of experts for a new trainable task to 8, and activate 2 experts for each new task.

A.2 Real World Robot Experiments Extension

In this section, we extend the details for the real-world robot experiments, as shown in Figure 7. For the pull task, we collect 20 human demonstrations where the objective is to pull the circle to the red region in the center of the table. For the pick-and-place task, we also collect 20 demonstrations, where the task is to pick up the cup and place it on the plate. For the last task, the hang task, we collect 20 demonstrations for goal point 1 and 20 demonstrations for goal point 2. The model distinguishes between the two goal points using a learnable Fourier Embedding. After collecting all the demonstrations, the model undergoes multitask training for the three tasks simultaneously. The results are used to test which model is better for multitask learning, as shown in Table 3.

A.3 Model Collapse

In this section, we will introduce an interesting observation from our real-world robot experiments, known as ’model collapse,’ which is closely related to sparsity. As shown in Figure 8 , the TCD method completely fails in the pick-and-place task. It cannot distinguish the action space among different tasks, resulting in the execution of actions not demonstrated by humans in this task.

A.4 Continue Learning Extensive Experiments

In this section, we conduct additional experiments on continual learning by using a fully finetuned visual encoder. This approach may lead to forgetting old tasks, so we test the transfer ability of different models under these conditions. As shown in Table 6, after transferring the last task (square), our model outperforms the LoRA method in terms of transfer ability with the fully finetuned visual encoder.

Table 6: Evaluation on continual learning. Comparison of different policy decoders. Grey blocks indicate performance on new tasks; light-blue blocks indicate performance on previous tasks.

Vision Encoder	Policy Decoder	Stage 1	Stage 2		Stage 3
Vision Encoder	Policy Decoder	Can	Can	Lift	Can	Lift	Square
FFT	Lora [10]	0.97	0.00	1.00	0.00	0.00	0.57
FFT	MOE (Ours)	0.93	0.00	1.00	0.00	0.00	0.75