Reinforcement Learning via Auxiliary Task Distillation

Abhinav Narayan Harish1, Larry Heck2, Josiah P. Hanna1, Zsolt Kira2, Andrew Szot2
1University of Wisconsin – Madison , 2Georgia Tech
Abstract

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves 2.3×2.3\times2.3 × higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.

1 Introduction

While reinforcement learning (RL) is successful in a variety of settings including games [1, 2, 3], chatbots [4, 5, 6, 7, 8] , and robotics [9, 10, 11], long-horizon tasks such as embodied object rearrangement where an embodied agent must rearrange objects to target positions still remains a challenge [12]. Object rearrangement requires learning heterogeneous behaviors like picking, navigating, placing, and opening concurrently with sequencing these behaviors to solve the overall task all while operating from egocentric vision. Furthermore, the behaviors are interdependent, meaning the robot can only learn how to pick an object if it has first learned how to open the fridge containing the object. Since object rearrangement requires interacting with objects across house-scale spaces with low-level control, the episodes consist of thousands of low-level control steps, exacerbating the problem of credit assignment [13, 14]. Overall, the problems of concurrently learning low-level control, high-level decision-making, interdependent task stages, and episodes with many time steps make object rearrangement challenging even with dense rewards. While some prior methods have applied end-to-end RL to rearrangement problems, they require excessive experience in simplified simulators [15] or expert skill demonstrations [16]. Other prior work has made such complex tasks more tractable by decomposing the full task into a hierarchy of separately trained skills that a high-level policy sequences together to solve the overall task. However, such hierarchical methods suffer from compounding errors between skills [17] and dynamically selecting between skills.

We therefore present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a method for training policies for long-horizon tasks from scratch with end-to-end RL from reward alone. AuxDistill also learns in auxiliary tasks using a multi-task RL framework and transfers the knowledge from these auxiliary tasks to help solve the desired “main” task of interest. Importantly, AuxDistill concurrently learns the main task and all auxiliary tasks, which are all initialized from scratch and not pre-trained. These auxiliary tasks help because they are easier to learn with RL than the main task and contain relevant behaviors for the full task. For example, in object rearrangement, the agent should also practice picking up objects in isolation from the complexities of the full task. Unlike curriculum learning in RL [18, 19, 20, 21], which also leverages easier tasks to learn harder tasks, AuxDistill does not have distinct curriculum phases. Instead, it has a single training objective for distilling sub-behaviors into the overall task behavior. Crucially, the agent end-to-end learns which auxiliary behaviors are relevant for the main task. This relevance is encoded by a scalar weight value determined by the relevance of the robot state to the auxiliary task and it grounded in the oracle task plan. A distillation objective transfers behaviors from relevant auxiliary tasks by encouraging the policy to act consistently between the main and auxiliary tasks for relevant states in the main task. For instance, in object rearrangement, when the robot is near a drawer with the target object inside, it is supervised with the distillation loss from an “open drawer" auxiliary task. Despite the policy also concurrently learning the “open drawer” auxiliary task from scratch, AuxDistill learns the auxiliary tasks faster since they are easier. Thus, the distillation loss is a useful dense per-time step supervision signal that helps overcome the challenges of learning from the main task reward alone.

We empirically demonstrate that AuxDistill outperforms a variety of baselines in terms of success rate on the home rearrangement benchmark in Habitat 2.0 [22]. These include baselines such as hierarchical RL, end-to-end RL with and without a curriculum, and imitation learning. On rearrangement episodes where objects start in open receptacles, AuxDistill achieves 1.4×1.4\times1.4 × higher success than baselines and 2.6×2.6\times2.6 × higher success in harder episodes where objects can start in closed receptacles. These results highlight the value of AuxDistill to be able to learn the entire rearrangement task end-to-end with RL. We show results beyond object rearrangement on a category-conditioned object manipulation task where AuxDistill achieves 1.75×1.75\times1.75 × higher success than baselines. We also conduct extensive ablation studies where we analyze the properties of auxiliary tasks needed by AuxDistill, and find that AuxDistill is generally robust to the choice of auxiliary tasks. An analysis of the distillation objective reveals the importance of the distillation loss, with the absence of the distillation loss leading to no success on the rearrangement task. All code is available at https://github.com/absdnd/aux_distill

2 Related Work

Some prior works address complex and long-horizon decision-making problems using a hierarchical breakdown of skills. One such category of approaches is option-critic methods [23, 24, 25], which seek to learn options to temporally abstract the high-level policy decision making. However, discovering and learning such options is unstable and results in a challenging credit assignment problem. Another line of work first trains or is given pre-trained skills and then learns a high-level policy to sequence these skills together to complete longer tasks [26, 27, 28, 29]. Using such a hierarchy results in compounding errors resulting from sequencing skills that were not trained to properly transition between each other [30, 31, 32, 33, 17]. Our approach does not utilize a hierarchical policy, and can solve long horizon tasks better than hierarchical methods by utilizing a single policy and thus avoiding compounding errors that emerge by sequencing independently trained policies.

Like our method, prior works have tackled model-free end-to-end RL to solve long, complex tasks. Scaling PPO [34] training has enabled robots to learn complex locomotion behaviors [35, 36, 37, 38, 39, 40], manipulate unseen objects in a robotic hand [10], and play competitive video games [2]. Likewise, Berges et al. [15] showed scaling end-to-end RL can learn object rearrangement. However, their approach requires training agents for billions of environment steps with a simplified kinematic physics simulation. We show results in the full dynamic Habitat simulation, achieving better results in an order of magnitude fewer environment interactions, and show results in harder rearrangement tasks. Other works have incorporated auxiliary objectives in RL to boost performance in embodied tasks  [41, 42, 43, 44]. These works add auxiliary self-supervised prediction objectives, whereas our method adds an auxiliary policy learning task via multi-task RL. Jia et al. [45] also uses easier auxiliary RL tasks that specialist agents learn to complete and then distill back into a generalist agent. While our work also uses RL in easier tasks to boost performance, it doesn’t require alternating between learning specialist and generalist policies.

Works have also attempted to learn a curriculum to leverage easier tasks to learn harder tasks that require composing multiple behaviors. Some works learn curriculum generation policies that adjust the environment to suit the agent’s capabilities [19, 20, 21]. In other works, a curriculum naturally emerges as a result of downstream training [46]. Finally, some works hand design curricula that change aspects of the environment, such as the starting and goal distributions depending on the agent performance during training [35]. Our work shares a similar spirit of leveraging easier auxiliary tasks to learn a harder task but it does not enforce an explicit curriculum and instead learns all tasks simultaneously.

Refer to caption
Figure 1: AuxDistill learns a rearrangement policy operating from egocentric depth perception and coordinate-based task specification. The full object rearrangement task decomposes into modular abilities that can be learned by auxiliary task with indicator vectors T1TNsubscript𝑇1subscript𝑇𝑁{T}_{1}\cdots{T}_{N}italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋯ italic_T start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT which are trained along with the main task using end-to-end RL. The distillation loss is computed as a weighted combination of the task relevance of otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the main task T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT under all auxiliary tasks. The task relevance function computes wi(st)subscript𝑤𝑖subscript𝑠𝑡w_{i}(s_{t})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) based on the relevance of the current observation and robot state to the auxiliary task Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. The distillation loss and RL-training loss are then used to update the policy.

3 Method

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method for training long-horizon policies from scratch using rewards alone. Learning from reward alone in complex tasks like embodied object rearrangement is challenging because an agent needs to combine thousands of low-level actions controlling the arm and base, operate from egocentric visual perception, and dynamically sequence distinct behaviors such as navigating, picking, placing, and opening. AuxDistill addresses these challenges of RL in complex problems, like embodied rearrangement, by learning to leverage knowledge from easier auxiliary tasks related to the desired task we are trying to solve. Unlike a curriculum, which learns gradually harder tasks in stages, AuxDistill learns the desired task concurrently with the auxiliary task and includes a novel distillation mechanism to transfer knowledge from easier to harder tasks.

3.1 Preliminaries

Our problem is formulated as a goal-specified Partially-Observable Markov Decision Process (POMDP) defined as a tuple =(𝒮,𝒪,𝒜,𝒫,,𝒢,ρ,γ)𝒮𝒪𝒜𝒫𝒢𝜌𝛾\mathcal{M}=\left(\mathcal{S},\mathcal{O},\mathcal{A},\mathcal{P},\mathcal{R},% \mathcal{G},\rho,\gamma\right)caligraphic_M = ( caligraphic_S , caligraphic_O , caligraphic_A , caligraphic_P , caligraphic_R , caligraphic_G , italic_ρ , italic_γ ) with underlying state space 𝒮𝒮\mathcal{S}caligraphic_S, observation space 𝒪𝒪\mathcal{O}caligraphic_O, action space 𝒜𝒜\mathcal{A}caligraphic_A, transition function 𝒫𝒫\mathcal{P}caligraphic_P, reward function \mathcal{R}caligraphic_R, goal space 𝒢𝒢\mathcal{G}caligraphic_G, initial state distribution ρ𝜌\rhoitalic_ρ and discount factor γ𝛾\gammaitalic_γ. For a task like rearrangement, the goal space 𝒢𝒢\mathcal{G}caligraphic_G is specified using the 3D coordinate of the object’s start location and the goal location where the object has to be placed. Our objective is to learn a goal-conditioned policy π(ao,g)𝜋conditional𝑎𝑜𝑔\pi(a\mid o,g)italic_π ( italic_a ∣ italic_o , italic_g ) map** an observation o𝑜oitalic_o and goal g𝑔gitalic_g to an action a𝑎aitalic_a that maximizes the sum of discounted rewards 𝔼s0ρ0,g𝒢tγt(st,g)subscript𝔼formulae-sequencesimilar-tosubscript𝑠0subscript𝜌0similar-to𝑔𝒢subscript𝑡superscript𝛾𝑡subscript𝑠𝑡𝑔\mathbb{E}_{s_{0}\sim\rho_{0},g\sim\mathcal{G}}\sum_{t}\gamma^{t}\mathcal{R}(s% _{t},g)blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_g ∼ caligraphic_G end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT caligraphic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g ).

3.2 Reinforcement Learning via Auxiliary Task Distillation

The core idea of AuxDistill is to learn a policy in a difficult desired “main” task by transferring knowledge from easier auxiliary tasks. This is done without using an explicit curriculum by learning the main and auxiliary tasks in a single training loop. We refer to the main task we wish to learn as 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with a task identifier T0subscript𝑇0T_{0}italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. We define N𝑁Nitalic_N auxiliary tasks, defined as {i}n=1Nsuperscriptsubscriptsubscript𝑖𝑛1𝑁\left\{\mathcal{M}_{i}\right\}_{n=1}^{N}{ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT with task identifiers {Ti}n=1Nsuperscriptsubscriptsubscript𝑇𝑖𝑛1𝑁\left\{{T}_{i}\right\}_{n=1}^{N}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The auxiliary tasks {i}subscript𝑖\left\{\mathcal{M}_{i}\right\}{ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } share the same state, observation, goal, and action space as the target task 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. Each isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT has a separately defined starting state distribution and reward function. We assume these auxiliary tasks are related to the full task yet are easier to solve than the full task. Specifically, the auxiliary tasks can be easier instantiations or sub-parts of the overall task. For example, in rearrangement, we define the auxiliary tasks in terms of interactions the agent needs to complete the entire rearrangement episode, such as picking, placing, and opening. Prior works use similar task definitions to train skills for hierarchical policies in rearrangement [17], but these works suffer from a two-stage pipeline of first needing to separately train each skill and then decide how to combine them. AuxDistill directly learns the full task from scratch while also performing better.

AuxDistill learns a single policy with RL that concurrently learns to perform the main and auxiliary tasks. We illustrate the implementation of AuxDistill in  Figure 1. This policy, parameterized by θ𝜃\thetaitalic_θ, is expressed as πθ(atot,g,T)subscript𝜋𝜃conditionalsubscript𝑎𝑡subscript𝑜𝑡𝑔𝑇\pi_{\theta}(a_{t}\mid o_{t},g,T)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g , italic_T ) for observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, episode goal g𝑔gitalic_g, and per-\mathcal{M}caligraphic_M task identifier T𝑇Titalic_T which is encoded as a one-hot embedding in a vector which has the same size as the maximum number of auxiliary tasks. Note that since all tasks share the same observation and action space, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT can act and observe in all tasks based on the input task identifier T𝑇Titalic_T.

AuxDistill updates the policy based on an average RL loss from the main task and all auxiliary tasks. Let iRL(θ)superscriptsubscriptsubscript𝑖RL𝜃\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}(\theta)caligraphic_L start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT ( italic_θ ) denote the RL loss for πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT in MDP isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. These auxiliary tasks are designed to capture a subset of the main task which are easier to accomplish than the main task. To compute these losses, we assume we can collect experience in the auxiliary tasks. For example, to collect experience in a “pick object” auxiliary task in object rearrangement, the robot is spawned close to the object. We then update based on the average of the task and auxiliary task losses: 1Ni=0N1𝑁superscriptsubscript𝑖0𝑁\frac{1}{N}\sum_{i=0}^{N}divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT iRL(θ)superscriptsubscriptsubscript𝑖RL𝜃\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}(\theta)caligraphic_L start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT ( italic_θ ). Intuitively, πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT will first learn to complete the easier auxiliary tasks, and this auxiliary task competency can aid in solving the main task. This shares a similar insight as curriculum learning, except all tasks are learned concurrently, and the curriculum stages are naturally induced by the policy naturally learning on easier tasks first.

In addition to optimizing the average RL loss between the main and auxiliary tasks, AuxDistill also optimizes a distillation loss that encourages the policy to transfer relevant knowledge from the auxiliary tasks to the main task. To achieve this, for a particular observation otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we consider the policy distribution πθ(ot,g,T)\pi_{\theta}(\cdot\mid o_{t},g,T)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g , italic_T ) under different task identifiers T𝑇Titalic_T. For compactness, notate πθ(ot,g,Ti)\pi_{\theta}(\cdot\mid o_{t},g,T_{i})italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ ∣ italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_g , italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) where Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT indicates the task identifier for isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT as πθTi(ot)superscriptsubscript𝜋𝜃subscript𝑇𝑖subscript𝑜𝑡\pi_{\theta}^{T_{i}}\left(o_{t}\right)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We want πθT0superscriptsubscript𝜋𝜃subscript𝑇0\pi_{\theta}^{T_{0}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT (the policy in the main task) to match the behaviors from the policy in the relevant auxiliary tasks πθTisuperscriptsubscript𝜋𝜃subscript𝑇𝑖\pi_{\theta}^{T_{i}}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. We measure this relevance of an auxiliary task isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the main task at time step t𝑡titalic_t via an auxiliary task relevance function wi(st)subscript𝑤𝑖subscript𝑠𝑡w_{i}(s_{t})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This function denotes how much the knowledge from isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT should apply to 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT in the underlying simulator state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. This relevance function can be grounded in the task plan generated by an oracle planner. For example, consider the state of the robot before picking up the object. In this state, the pick auxiliary task would be relevant to the main task, and the place auxiliary task would not. If an episode requires opening a fridge before accessing the object, the relevant task before the fridge is opened would be open-fridge. Note that computing the distillation weight can utilize oracle knowledge of the simulator state (for instance, whether the object is inside a fridge or a cabinet) since this information is only provided as a training signal and not used during inference. This information is utilized by all methods we compare against, including our strongest baseline [16].

We can then distill experience from isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT by supervising the policy to match the action distribution of the target task 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The distillation loss is computed as the KL-divergence of the action distribution under isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT weighted by the relevance of auxiliary task isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT given by wi(st)subscript𝑤𝑖subscript𝑠𝑡w_{i}(s_{t})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The overall loss function of AuxDistill, including the distillation loss, is then:

1Ni=0NiRL(θ)+λi=1Nwi(st)DKL(πθT0(ot)πθTi(ot))1𝑁superscriptsubscript𝑖0𝑁superscriptsubscriptsubscript𝑖RL𝜃𝜆superscriptsubscript𝑖1𝑁subscript𝑤𝑖subscript𝑠𝑡subscript𝐷KLconditionalsuperscriptsubscript𝜋𝜃subscript𝑇0subscript𝑜𝑡superscriptsubscript𝜋𝜃subscript𝑇𝑖subscript𝑜𝑡\displaystyle\frac{1}{N}\sum_{i=0}^{N}\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}% }(\theta)+\lambda\sum_{i=1}^{N}w_{i}(s_{t})D_{\text{KL}}(\pi_{\theta}^{T_{0}}(% o_{t})\;\|\;\pi_{\theta}^{T_{i}}(o_{t}))divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_L start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT ( italic_θ ) + italic_λ ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) (1)

Here λ𝜆\lambdaitalic_λ represents the distillation weight relative to other PPO losses. Note that our formulation relies on the auxiliary tasks being easier to solve and relevant to the main task. Distillation using the task relevance function offers an additional supervisory signal that the policy can optimize along with its own reward. This doesn’t impose the large overhead of hierarchical RL methods where each auxiliary task must be pre-trained in isolation with its own reward function and goal specification. Auxiliary tasks are flexibly incorporated to address parts of the task that are challenging for the agent to learn.

3.3 Implementation Details

We use PPO [34] to compute the RL loss iRLsuperscriptsubscriptsubscript𝑖RL\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}caligraphic_L start_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT RL end_POSTSUPERSCRIPT in Equation 1. We create M𝑀Mitalic_M environment instances for each of the main and N𝑁Nitalic_N auxiliary tasks and vectorize them for parallel action execution. We rollout the policy in all these M(N+1)𝑀𝑁1M(N+1)italic_M ( italic_N + 1 ) environments and collect a batch of data. We then compute Equation 1, using the PPO loss, update the policy, and then repeat this process.

The policy is represented as a 2-layer LSTM network with 512 hidden units per layer, similar to the architecture employed in  [22]. In our experiments, otsubscript𝑜𝑡o_{t}italic_o start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a depth or RGB egocentric image, which we encode with a ResNet50 [47] network. The goal and the proprioceptive state of the agent are concatenated with the visual embeddings along with the index of the POMDP isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, for which we use a one-hot embedding. This vector is then passed as an input to the LSTM. A linear projection on the output of the LSTM then produces an action to execute on the next step in the environment and a vector of value-predictions corresponding to each of the tasks isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used during training.

Our policy is learned via a multi-task RL formulation, with the main and auxiliary tasks being concurrently learned. Each of the tasks may have different reward magnitudes and relative performance during training. To address this, we use PopArt [48], with β=3e4𝛽3superscript𝑒4\beta=3e^{-4}italic_β = 3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT to enable learning across different return scales. Additional AuxDistill details are in Appendix A of the supplementary.

Train (Seen) Eval (Unseen)
Method All Episodes Easy Hard All Episodes Easy Hard
M3 (Oracle) [17] 27 ±plus-or-minus\pm± 0 57 ±plus-or-minus\pm± 2 12 ±plus-or-minus\pm± 1 28 ±plus-or-minus\pm± 0 58 ±plus-or-minus\pm± 2 13 ±plus-or-minus\pm± 2
M3 [17] 25 ±plus-or-minus\pm± 1 56 ±plus-or-minus\pm± 1 9 ±plus-or-minus\pm± 2 13 ±plus-or-minus\pm± 1 53 ±plus-or-minus\pm± 4 0 ±plus-or-minus\pm± 0
Galactic [15] - 37 ±plus-or-minus\pm± 0 - - 26 ±plus-or-minus\pm± 0 -
Monolithic RL 0 ±plus-or-minus\pm± 0 0 ±plus-or-minus\pm± 0 0 ±plus-or-minus\pm± 0 0 ±plus-or-minus\pm± 0 0 ±plus-or-minus\pm± 0 0 ±plus-or-minus\pm± 0
Skill Transformer [16] 25 ±plus-or-minus\pm± 1 44±plus-or-minus\pm± 1 16 ±plus-or-minus\pm± 1 23 ±plus-or-minus\pm± 1 37±plus-or-minus\pm± 1 16±plus-or-minus\pm± 1
AuxDistill (No Distillation) 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0
RL Curriculum 0±plus-or-minus\pm± 0 1±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 0±plus-or-minus\pm± 0 1±plus-or-minus\pm± 1 0±plus-or-minus\pm± 0
AuxDistill 49±plus-or-minus\pm± 2 74±plus-or-minus\pm± 2 36±plus-or-minus\pm± 2 52±plus-or-minus\pm± 2 75±plus-or-minus\pm± 2 41±plus-or-minus\pm± 2
Table 1: Success rates on the rearrangement task for our method, AuxDistill (highlighted in blue), and baselines. Displayed are the average and standard deviations for 3 seeds for M3, Monolithic RL, Skill Transformer, and M3 Oracle (numbers from  [16]), and 3 seeds for the remaining methods with the highest numbers per setting bolded. Numbers in the easy and hard columns are averages over 100 episodes and 200 episodes, respectively. The All Episodes column is an average across both the Easy and Hard episodes. AuxDistill outperforms all baselines in all settings.

4 Experiments

4.1 Object-Rearrangement

In this section, we compare AuxDistill with baselines on the Habitat 2.0 Object Rearrangement task [22]. For comparison with baselines, we run our experiments using the setup from [16]. In this task, a Fetch robot [49] must move an object from a specified start position to a desired goal position in an indoor home environment using only onboard sensing. The agent has no privileged information like existing maps of the environment, 3D object models, or exact object positions. The Fetch robot senses the world through a 256×256256256256\times 256256 × 256 depth camera mounted on the robot’s head, robot joint positions, gripper state, and base egomotion, giving the relative position of the robot from the start of the episode. The task is specified by a starting 3D object coordinate for the object to move and a 3D goal coordinate to move the object to. Both coordinates are specified relative to the robot’s position at the start of the episode. Only the starting object coordinate is specified, and this information is not updated based on the current object position (e.g. if it is moved).

The robot interacts with the world via a 7DoF arm, a suction gripper attached to the end of the arm, and a mobile base. The episode is successful if the target object is within 15cm of the goal position. The robot has a budget of 1,500 steps to complete the task.

We report performance on the easy and hard evaluation episodes from [16]. In easy episodes, both the target object and goal are on an open receptacle. This means the robot does not have to first open a receptacle before accessing the target object or goal. Instead, the agent can always execute the same sequence of navigate to object, pick up object, navigate to goal, and place object at goal. In hard episodes, the object may start in a closed receptacle. The agent, therefore, needs to use its visual input to perceive if the target object is in a closed receptacle. If so, the robot must then open the receptacle before picking the object. The object may start in either the fridge or a cabinet.

4.1.1 Training Setup:

We train AuxDistill using 11,791 episodes with a mix of easy and hard episodes across 63636363 scenes using the same rearrangement training dataset as in [50]. These episodes were obtained by sampling an equal number of episodes in the easy and hard categories, with easy episodes being equally sampled across episodes with open-cabinet, open-fridge and non-articulated episodes. For hard episodes, sampling is done uniformly between closed fridge and closed cabinet episodes. We use auxiliary tasks covering abilities included as a part of the standard rearrangement benchmark used in  [50]. The auxiliary tasks are:

  • Pick: The agent spawns randomly within the house and must navigate to and pick up the object.

  • Place the agent is spawned randomly in the house with the object in its gripper. The robot must navigate to and place the object within 15cm of the target 3D location on the receptacle.

  • Open Fridge: The robot spawns randomly within the house and must navigate to the fridge and use its base and arm to open the fridge door.

  • Open Cabinet: The robot spawns randomly in the house and must navigate to the kitchen area and open the cabinet using its base and arm.

  • Pick from Fridge: This task is similar to Pick except the agent has to pick up the object from inside an open fridge, which requires careful manipulation to minimize collisions.

The auxiliary relevance function is grounded in the oracle task plan [51] of the episode. This is formulated as an indicator function with wi(st)=1subscript𝑤𝑖subscript𝑠𝑡1w_{i}(s_{t})=1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 if the state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is relevant to isubscript𝑖\mathcal{M}_{i}caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT based on the oracle task plan of the episode and wi(st)=0subscript𝑤𝑖subscript𝑠𝑡0w_{i}(s_{t})=0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 otherwise.

For example, if at step t𝑡titalic_t the agent is not holding an object and the object is in an open receptacle, the agent must first pick up the object so wipick(st)=1subscript𝑤subscript𝑖𝑝𝑖𝑐𝑘subscript𝑠𝑡1w_{i_{pick}}(s_{t})=1italic_w start_POSTSUBSCRIPT italic_i start_POSTSUBSCRIPT italic_p italic_i italic_c italic_k end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1, and wi(st)=0subscript𝑤𝑖subscript𝑠𝑡0w_{i}(s_{t})=0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 0 for all other skills. For a complete description of the auxiliary tasks, including their reward functions, along with the precise auxiliary relevance function definition in object rearrangement, see  Appendix B. We train methods for 475M steps of environment interactions. We found this to be a sufficient number of environment interactions to ensure sufficient progress on auxiliary tasks for distillation to aid rearrangement performance on both the easy and hard episodes. For AuxDistill, we count the environment interactions in the auxiliary tasks towards this 475M step budget. We train with a learning rate of 3e43superscript𝑒43e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and linearly decay the learning rate to 00 over the course of learning.

Train (Seen) Eval (Unseen)
Method All Episodes Easy Hard All Episodes Easy Hard
All Skills 28 36 19 30 39 21
No Pick From Fridge 20 32 7 12 23 2
No Open Fridge 41 49 33 48 57 39
No Pick 0 0 0 0 0 0
Table 2: Robustness to auxiliary task selection. We compare the train and generalization performance of AuxDistill with 4 different auxiliary task selections. Results are averages across 100 episodes on a single seed of training. The train and evaluation episodes are in a simplified setting compared to Table 1 as described in Section 4.3

.

4.1.2 Baselines

We compare AuxDistill to both hierarchical methods governed by a task plan as well as monolithic baselines, which directly learn a pixels-to-actions policy using sensor observations. We compare to relevant baselines from Huang et al. [16] and two more baselines that use the auxiliary tasks.

  • Monolithic RL: A monolithic neural network is trained to map sensor observations directly to actions trained with end-to-end RL. The policy showed no signs of learning the main task, so we stopped training early after 100M100𝑀100M100 italic_M steps of training. This outcome is consistent with the monolithic RL baseline in prior works [15, 22].

  • Galactic: [15] An end-to-end RL framework similar to monolithic RL to map sensor observations directly to actions with the policy being trained with a simplified kinematic simulation on over >1e9absent1superscript𝑒9>1e^{9}> 1 italic_e start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT simulation steps and transferred to dynamic simulation utilized in our setting.

  • M3: [17] Each of the skill policies navigate, pick, place, and open are separately trained. These skills are then sequenced using a task planner.

  • M3 (Oracle)[17] This uses the same skill training as M3, but uses an oracle task planner.

  • Skill Transformer [16]: This method uses pre-trained skills to collect successful rearrangement demonstrations and then train on these demonstrations with imitation learning.

  • RL Curriculum This method is trained in two stages: i) First, we train the policy on the auxiliary tasks for 200M200𝑀200M200 italic_M steps to ensure the relevant auxiliary tasks have good training performance ii) This is followed by a training on the entire rearrangement task for another 300M300𝑀300M300 italic_M steps.

  • AuxDistill: No-Distillation Train AuxDistill under a similar setting as described in  Section 4.1.1 except with the distillation strength λ=0.0𝜆0.0\lambda=0.0italic_λ = 0.0.

For more details on the baselines, see Section B.2.

Refer to caption
Figure 2: Comparison of skill-robustness to different choices of auxiliary task on the hard distribution. Including both the Pick and Pick from Fridge is crucial for successful rearrangement on this distribution. Not utilizing Open-Fridge leads to a boost in rearrangement success. This improvement arises because the open-fridge skill is the easiest of all auxiliary tasks and utilizing it reduces the number of samples for the main task (from the 100100100100M step budget). We discuss the auxiliary task learning curves in Appendix B.3

4.1.3 Rearrangement Performance:

In Table 1, we compare AuxDistill to baselines in the rearrangement task. On all evaluation settings, AuxDistill outperforms all baselines. AuxDistill shows an absolute improvement of 22%percent2222\%22 % and 25%percent2525\%25 % on the easy and hard episodes respectively over the best performing baseline. Most baselines struggle to achieve any performance on the unseen episodes. Monolithic RL [22] achieves no success in any of the evaluation settings, demonstrating the impracticality of learning directly from the dense reward in the full rearrangement task. Our method also outperforms Galactic [15] despite learning on less than half the number of training samples (475M475𝑀475M475 italic_M for AuxDistill vs. >1e9absent1superscript𝑒9>1e^{9}> 1 italic_e start_POSTSUPERSCRIPT 9 end_POSTSUPERSCRIPT for Galactic) on the easy split with 74%percent7474\%74 % success over 37%percent3737\%37 % on easy episodes during training and 75%percent7575\%75 % vs. 26%percent2626\%26 % on evaluation. Note that Galactic does not report performance on the hard distribution.

Like AuxDistill, RL Curriculum also utilizes the auxiliary tasks yet finds no success. This demonstrates the value of concurrent training on the main and auxiliary tasks with the distillation loss in AuxDistill. Likewise, AuxDistill (No Distill) achieves no success, illustrating the importance of the distillation loss. The learning curve in Figure 3(a) as AuxDistill can gradually learn to complete the task with more environment interactions. On the other hand, RL Curriculum and AuxDistill (No-Distill) remain at no success regardless of the number of learning samples.

AuxDistill also outperforms the strongest baseline, Skill Transformer, by a significant margin with 75%percent7575\%75 % vs. 37%percent3737\%37 % on the unseen easy episodes and 41%percent4141\%41 % vs. 16%percent1616\%16 % on the unseen hard episodes. This demonstrates the advantages of using online, end-to-end RL, rather than solely training offline with demonstrations. As shown in Figure 3(a), AuxDistill is able to improve with subsequent environment interactions, yet Skill Transformer is limited by the performance of the expert demonstrations.

Table 1 also demonstrates that AuxDistill outperforms hierarchical baselines. M3 is able to achieve 53%percent5353\%53 % success on the unseen easy episodes since it utilizes strong pretrained skills. Even in this setting, AuxDistill achieves a higher success of 75%percent7575\%75 %. However, on the hard episodes, M3 cannot dynamically plan skills and achieves no success in the hard unseen episodes. AuxDistill achieves 41%percent4141\%41 % on this setting because it learns a monolithic policy with RL that combines low-level and high-level decision-making. AuxDistill even outperforms an oracle version of M3 that dynamically plans the skill sequence based on oracle state information (13%percent1313\%13 % vs. 41%percent4141\%41 % on the unseen hard episodes).

We observe that AuxDistill performs 7%percent77\%7 % worse on the Habitat 2.0 rearrangement training episodes than the evaluation episodes. We find this performance gap is due to additional easier episodes in the evaluation distribution where the object is closer to the target location. In the hard split, there are 55555555 eval episodes and 29292929 train episodes where the object is picked up from the fridge and the target location is <0.5mabsent0.5𝑚<0.5m< 0.5 italic_m from the object location. On removing these episodes, train performance is higher than evaluation, with 32%percent3232\%32 % success on train and 31%percent3131\%31 % success on evaluation.

Method Train (Seen) Eval (Unseen)
AuxDistill 12±plus-or-minus\pm± 2 11±plus-or-minus\pm± 1
Monolithic 8±plus-or-minus\pm± 0 9±plus-or-minus\pm± 1
AuxDistill (No Distillation) 3±plus-or-minus\pm± 1 4±plus-or-minus\pm± 1
RL-Curicullum 6±plus-or-minus\pm± 1 8±plus-or-minus\pm± 0
Table 3: Comparison of our method on the Category Pick task with auxiliary tasks on three random seeds. AuxDistill outperforms all methods, including the monolithic baseline, on the unseen evaluation split. Evaluation is conducted over 1000 seen episodes sampled from the training distribution and 1000 held-out episodes from evaluation.

4.2 Category Pick

The merits of AuxDistill extend to other challenging embodied tasks. In particular, consider a variant of the pick task described in Section 4.1 where a robot has to pick up an object using the object name passed as a one-hot embedding to the policy. Category pick requires the policy to discern the object to pick up by correlating the RGB observation with the object category passed as a one-hot embedding. This task is more challenging than geometric pick where the policy has access to the initial coordinate specification of the object to be picked. We include additional details about the task specification in Section B.4 of our supplementary material.

We leverage the easier coordinate pick task to aid the learning of the Category Pick task using AuxDistill. More specifically, we consider the task of interest M0subscript𝑀0M_{0}italic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT to be the Category Pick task and M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT (auxiliary task we would like to distill from) to be the coordinate pick task. Both tasks are trained jointly using AuxDistill with the same reward structure with wi(st)=1subscript𝑤𝑖subscript𝑠𝑡1w_{i}(s_{t})=1italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 1 and use λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 for distillation in  Equation 1. For this task, we train on the full rearrange-easy dataset from  [50] with 50,000 episodes.

We report the training curves of the Category Pick task in comparison with the RL Curriculum and No Distillation baseline used in  Table 1. In addition, we introduce a monolithic baseline for this task, which directly trains Category Pick without leveraging the coordinate pick task during training. We report the training curves in Figure 3(b). We evaluate the performance of the trained policy on 1000100010001000 episodes sampled from the training and 1000100010001000 episodes from the validation split.

In Table 3, we show that AuxDistill shows better performance with a success rate of 11%percent1111\%11 % outperforming all baselines that do not leverage distillation during RL training. The closest performing baseline to ours is the monolithic baseline, which achieves a success of 9%percent99\%9 % on the held-out distribution. AuxDistill (No-Distillation) and RL-curriculum perform worse with an evaluation success of 4%percent44\%4 % and 8%percent88\%8 %, respectively.

4.3 AuxDistill Analysis

Refer to caption
(a) Rearrange learning curve.
Refer to caption
(b) Category Pick learning.
Figure 3: Left: RL training success rates on training episodes on the rearrangement task from  Table 1. Note that AuxDistill (No Distill) and RL-Curriculum are displayed but achieve 0%percent00\%0 % success throughout training. Right: Learning curve on the Category Pick task of AuxDistill utilizing coordinate pick distillation vs. monolithic RL. AuxDistill outperforms baselines in both settings. Displayed are averages and standard deviation over 3 random seeds.

4.3.1 Robustness to Auxiliary Task Selection

A critical consideration for training AuxDistill is the selection of auxiliary tasks. In this section, we compare four different selections of auxiliary tasks in the object rearrangement task from Section 4.1. We conduct this ablation on a smaller distribution of episodes, which only include the Easy episodes and Hard episodes where the object starts in the fridge and train AuxDistill for 100M steps.

As we conduct this study only on the fridge category of articulated episodes, we modify the auxiliary task selection to be: Pick, Open Fridge, Place and Pick from Fridge. Note that the first three auxiliary tasks are the same from Section 4.1. The other auxiliary task selections are the same as this original selection, but excluding one of: Pick, Open Fridge, and Pick from Fridge.

In Table 2, we report the performance for each skill selection on the easy and hard episodes. We observe that among all the sub-tasks, the most important is Pick, as excluding it results in 0.0%percent0.00.0\%0.0 % success. Note that Pick is only relevant (wi(st)>0subscript𝑤𝑖subscript𝑠𝑡0w_{i}(s_{t})>0italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > 0) to easy episodes during training i.e removing it should only impact the performance on easy distribution of our training setup. However, we find that it affects both the easy and hard episodes as the task is not successfully carried out in either case if removed. In  Figure 2, we analyze the success of individual stages of rearrangement and notice that not including Pick leads to failure earlier on in the task.

In  Table 5 we notice that introducing the Open-Fridge skill can worsen rearrangement performance (30%percent3030\%30 % vs. 48%percent4848\%48 %). The reason for this is that the open-fridge skill is the easiest of all auxiliary tasks (see  Section B.3), and utilizing it reduces the number of samples allocated to the main task. However, certain auxiliary tasks can boost performance by addressing specific bottlenecks in the rearrangement. One such example is the Pick from Fridge, which boosts performance 7%percent77\%7 % to 19%percent1919\%19 % by addressing the task of picking from an open fridge, which can be challenging due to the difficulty of avoiding collisions with the fridge door. Further, the 4%percent44\%4 % performance improvement from 32%percent3232\%32 % to 36%percent3636\%36 % on easy distribution can be attributed to the presence of easy episodes in our dataset with an open-refrigerator where Pick from Fridge helps boost performance.

4.3.2 Distillation Coefficient Variation

We analyze the effect of the distillation loss coefficient λ𝜆\lambdaitalic_λ in AuxDistill in Table 4. The distillation loss coefficient controls how the agent balances minimizing the distillation from the auxiliary tasks versus maximizing the reward on the main task. In Table 4, we notice that using high values of distillation weighting λ=1.0𝜆1.0\lambda=1.0italic_λ = 1.0 shifts the objective of the rearrangement task from maximizing cumulative reward to distilling from the auxiliary task leading to 1%percent11\%1 % success rate on rearrangement and similarly, using a very small value of distillation coefficient λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01 leads to insufficient leveraging of auxiliary task information with a success rate of 7%percent77\%7 %. We find λ=0.1,0.05𝜆0.10.05\lambda=0.1,0.05italic_λ = 0.1 , 0.05 to be good choices for optimizing task reward and distilling behaviors from the auxiliary tasks.

Train (Seen) Eval (Unseen)
Method All Episodes Easy Hard All Episodes Easy Hard
λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01 6 6 5 10 18 2
λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 30 31 28 36 33 38
λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 22 26 19 30 39 21
λ=0.5𝜆0.5\lambda=0.5italic_λ = 0.5 2 3 0 2 5 0
λ=1.0𝜆1.0\lambda=1.0italic_λ = 1.0 0 1 0 1 2 0
Table 4: Performance of our method for a single seed on varying distillation coefficient during training. Using a large distillation coefficient (λ=1.0𝜆1.0\lambda=1.0italic_λ = 1.0) makes reward optimization challenging, and too small (λ=0.01𝜆0.01\lambda=0.01italic_λ = 0.01) results in insufficient auxiliary distillation to succeed on the rearrangement task. The intermediate values λ=0.05𝜆0.05\lambda=0.05italic_λ = 0.05 and λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 show the best performance on training and evaluation.

5 Conclusion

In this work, we presented AuxDistill, a new method for end-to-end RL on complex tasks by leveraging auxiliary tasks. AuxDistill learns in the auxiliary tasks concurrently with the main task through multi-task RL. A distillation loss transfers relevant behaviors from the auxiliary tasks to the main task. We show that AuxDistill outperforms a variety of baselines in Habitat object rearrangement by up to 27%percent\%%. We also show another application of AuxDistill in a category-conditioned manipulation task. Finally, we analyze AuxDistill in different auxiliary task selections and magnitudes of distillation strengths. Overall, AuxDistill presents a new way to tackle compound tasks with RL alone without requiring pre-trained skills or expert demonstrations.

Limitations of AuxDistill include the need for the auxiliary tasks and the auxiliary task relevance function. The auxiliary tasks require knowing behaviors that are relevant and easier to learn than the main task. Each auxiliary task requires defining a new start state distribution and reward function. However, designing these can rely on privileged state information which is less restrictive than the requirement of pre-trained skills or expert demonstrations.

References

  • Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
  • Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
  • Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • Shah et al. [2016] Pararth Shah, Dilek Hakkani-Tür, and Larry Heck. Interactive reinforcement learning for task-oriented dialogue management. In NIPS 2016 Deep Learning for Action and Interaction Workshop, volume 11, 2016.
  • Liu et al. [2017] Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. arXiv preprint arXiv:1711.10712, 2017.
  • Nakano et al. [2021] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Akkaya et al. [2019] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
  • Qi et al. [2023] Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation. In Conference on Robot Learning, pages 1722–1732. PMLR, 2023.
  • Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 2018. URL http://proceedings.mlr.press/v87/kalashnikov18a.html.
  • Batra et al. [2020] Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020.
  • Harutyunyan et al. [2019] Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado P van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. Advances in neural information processing systems, 32, 2019.
  • Ni et al. [2024] Tianwei Ni, Michel Ma, Benjamin Eysenbach, and Pierre-Luc Bacon. When do transformers shine in rl? decoupling memory from credit assignment. Advances in Neural Information Processing Systems, 36, 2024.
  • Berges et al. [2023] Vincent-Pierre Berges, Andrew Szot, Devendra Singh Chaplot, Aaron Gokaslan, Roozbeh Mottaghi, Dhruv Batra, and Eric Undersander. Galactic: Scaling end-to-end reinforcement learning for rearrangement at 100k steps-per-second. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13767–13777, 2023.
  • Huang et al. [2023] Xiaoyu Huang, Dhruv Batra, Akshara Rai, and Andrew Szot. Skill transformer: A monolithic policy for mobile manipulation. arXiv preprint arXiv:2308.09873, 2023.
  • Gu et al. [2022] Jiayuan Gu, Devendra Singh Chaplot, Hao Su, and Jitendra Malik. Multi-skill mobile manipulation for object rearrangement. arXiv preprint arXiv:2209.02778, 2022.
  • Narvekar et al. [2020] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
  • Dennis et al. [2020] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049–13061, 2020.
  • Azad et al. [2023] Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, and Ion Stoica. Clutr: Curriculum learning via unsupervised task representation learning. In International Conference on Machine Learning, pages 1361–1395. PMLR, 2023.
  • Fang et al. [2022] Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, and Jeannette Bohg. Active task randomization: Learning visuomotor skills for sequential manipulation by proposing feasible and novel tasks. arXiv preprint arXiv:2211.06134, 2022.
  • Szot et al. [2021] Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34, 2021.
  • Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
  • Bacon et al. [2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
  • Zhang and Whiteson [2019] Shangtong Zhang and Shimon Whiteson. Dac: The double actor-critic architecture for learning options. Advances in Neural Information Processing Systems, 32, 2019.
  • Xia et al. [2020] Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, and Silvio Savarese. Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation. arXiv preprint arXiv:2008.07792, 2020.
  • Karkus et al. [2020] Peter Karkus, Mehdi Mirza, Arthur Guez, Andrew Jaegle, Timothy Lillicrap, Lars Buesing, Nicolas Heess, and Theophane Weber. Beyond tabula-rasa: a modular reinforcement learning approach for physically embedded 3d sokoban. arXiv preprint arXiv:2010.01298, 2020.
  • Dalal et al. [2021] Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021.
  • Hafner et al. [2022] Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022.
  • Vezzani et al. [2022] Giulia Vezzani, Dhruva Tirumala, Markus Wulfmeier, Dushyant Rao, Abbas Abdolmaleki, Ben Moran, Tuomas Haarnoja, Jan Humplik, Roland Hafner, Michael Neunert, et al. Skills: Adaptive skill sequencing for efficient temporally-extended exploration. arXiv preprint arXiv:2211.13743, 2022.
  • Chen et al. [2023] Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023.
  • Mishra et al. [2023] Utkarsh Aashu Mishra, Shangjie Xue, Yongxin Chen, and Danfei Xu. Generative skill chaining: Long-horizon skill planning with diffusion models. In Conference on Robot Learning, pages 2905–2925. PMLR, 2023.
  • Lee et al. [2021] Youngwoon Lee, Joseph J Lim, Anima Anandkumar, and Yuke Zhu. Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization. arXiv preprint arXiv:2111.07999, 2021.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Rudin et al. [2022] Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning, pages 91–100. PMLR, 2022.
  • Agarwal et al. [2023] Ananye Agarwal, Ashish Kumar, Jitendra Malik, and Deepak Pathak. Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, pages 403–415. PMLR, 2023.
  • Fu et al. [2023] Zipeng Fu, Xuxin Cheng, and Deepak Pathak. Deep whole-body control: learning a unified policy for manipulation and locomotion. In Conference on Robot Learning, pages 138–149. PMLR, 2023.
  • Kumar et al. [2021] Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. RSS, 2021.
  • Radosavovic et al. [2023] Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Learning humanoid locomotion with transformers. arXiv preprint arXiv:2303.03381, 2023.
  • Katara et al. [2023] Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models. arXiv preprint arXiv:2310.18308, 2023.
  • Ye et al. [2021a] Joel Ye, Dhruv Batra, Erik Wijmans, and Abhishek Das. Auxiliary tasks speed up learning point goal navigation. In Conference on Robot Learning, pages 498–516. PMLR, 2021a.
  • Ye et al. [2021b] Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. Auxiliary tasks and exploration enable objectgoal navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16117–16126, 2021b.
  • Gregor et al. [2019] Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, and Aaron van den Oord. Sha** belief states with generative environment models for rl. Advances in Neural Information Processing Systems, 32, 2019.
  • Kuo et al. [2023] Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, and Zsolt Kira. Structure-encoding auxiliary tasks for improved visual representation in vision-and-language navigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1104–1113, 2023.
  • Jia et al. [2022] Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. Improving policy optimization with generalist-specialist learning. In International Conference on Machine Learning, pages 10104–10119. PMLR, 2022.
  • Baker et al. [2019] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019.
  • He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
  • Hessel et al. [2019] Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado Van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3796–3803, 2019.
  • robotics [2020] Fetch robotics. Fetch. http://fetchrobotics.com/, 2020.
  • Szot et al. [2022] Andrew Szot, Karmesh Yadav, Alex Clegg, Vincent-Pierre Berges, Aaron Gokaslan, Angel Chang, Manolis Savva, Zsolt Kira, and Dhruv Batra. Habitat rearrangement challenge 2022. https://aihabitat.org/challenge/2022_rearrange, 2022.
  • Fikes and Nilsson [1971] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
  • Wijmans et al. [2019] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357, 2019.

Appendix A Additional Method Details

A.1 Pseudocode

1 Initialize policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT
2 Initialize state-space, observation space and action space {𝒮,𝒪,𝒜}𝒮𝒪𝒜\{\mathcal{S},\mathcal{O},\mathcal{A}\}{ caligraphic_S , caligraphic_O , caligraphic_A }
3 Define relevance wi:𝒮{0,1}i{1,2,N}:subscript𝑤𝑖𝒮01for-all𝑖12𝑁w_{i}:\mathcal{S}\rightarrow\{0,1\}\hskip 2.84526pt\forall\hskip 2.84526pti\in% \{1,2,\cdots N\}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : caligraphic_S → { 0 , 1 } ∀ italic_i ∈ { 1 , 2 , ⋯ italic_N }
4 Initialize distillation coefficient, normalization parameters λ,β=3e4𝜆𝛽3superscript𝑒4\lambda,\beta=3e^{-4}italic_λ , italic_β = 3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT
5 for epoch1𝑒𝑝𝑜𝑐1epoch\leftarrow 1italic_e italic_p italic_o italic_c italic_h ← 1 to train-epochs do
6       // bold face represents a vector.
7       Collect a batch of B𝐵Bitalic_B samples {𝐫i𝐨i,𝐬i.𝐚i}i=0N\{\mathbf{r}_{i}\,\mathbf{o}_{i},\mathbf{s}_{i}.\mathbf{a}_{i}\}_{i=0}^{N}{ bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT . bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT by executing {πθTi}i=0Nsuperscriptsubscriptsuperscriptsubscript𝜋𝜃subscript𝑇𝑖𝑖0𝑁\{\pi_{\theta}^{T_{i}}\}_{i=0}^{N}{ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
8       Normalize returns per task: {𝐫in}i=0N=PopArt({𝐫i}i=0N,β)superscriptsubscriptsubscriptsuperscript𝐫𝑛𝑖𝑖0𝑁𝑃𝑜𝑝𝐴𝑟𝑡superscriptsubscriptsubscript𝐫𝑖𝑖0𝑁𝛽\{\mathbf{r}^{n}_{i}\}_{i=0}^{N}=PopArt(\{\mathbf{r}_{i}\}_{i=0}^{N},\beta){ bold_r start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT = italic_P italic_o italic_p italic_A italic_r italic_t ( { bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT , italic_β )
9       Obtain {πθTi(𝐨0)}i=1Nsuperscriptsubscriptsuperscriptsubscript𝜋𝜃subscript𝑇𝑖subscript𝐨0𝑖1𝑁\{\pi_{\theta}^{T_{i}}(\mathbf{o}_{0})\}_{i=1}^{N}{ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT by evaluating 𝐨0subscript𝐨0\mathbf{o}_{0}bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT with {Ti}i=1Nsuperscriptsubscriptsubscript𝑇𝑖𝑖1𝑁\{T_{i}\}_{i=1}^{N}{ italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
10       // Compute losses
11       Compute RL losses: LRLsubscript𝐿𝑅𝐿L_{RL}italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT using {𝐫in,𝐨i,𝐬i,𝐚i}i=0Nsuperscriptsubscriptsubscriptsuperscript𝐫𝑛𝑖subscript𝐨𝑖subscript𝐬𝑖subscript𝐚𝑖𝑖0𝑁\{\mathbf{r}^{n}_{i},\mathbf{o}_{i},\mathbf{s}_{i},\mathbf{a}_{i}\}_{i=0}^{N}{ bold_r start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT
12       Compute distill loss Ldsubscript𝐿𝑑L_{d}italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT: 1Ni=1Nt=0Bwi(s0t)DKL(πθT0(o0t)πθTi(o0t))1𝑁superscriptsubscript𝑖1𝑁superscriptsubscript𝑡0𝐵subscript𝑤𝑖subscriptsuperscript𝑠𝑡0subscript𝐷KLconditionalsuperscriptsubscript𝜋𝜃subscript𝑇0subscriptsuperscript𝑜𝑡0superscriptsubscript𝜋𝜃subscript𝑇𝑖subscriptsuperscript𝑜𝑡0\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{B}w_{i}({s}^{t}_{0})D_{\text{KL}}(\pi_{% \theta}^{T_{0}}({o}^{t}_{0})\;\|\;\pi_{\theta}^{T_{i}}({o}^{t}_{0}))divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) )
13       Update πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using Ldistill=LRLsubscript𝐿𝑑𝑖𝑠𝑡𝑖𝑙𝑙subscript𝐿𝑅𝐿L_{distill}=L_{RL}italic_L start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_i italic_l italic_l end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT + λLd𝜆subscript𝐿𝑑\lambda L_{d}italic_λ italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
14 end for
Algorithm 1 The workflow for training AuxDistill from a randomly initialized policy

In this section, we describe the workflow to implement AuxDistill, including data collection for all of the tasks in our training mix, computing the distillation loss, and policy optimization. We describe our algorithm in Algorithm 1

We initialize a policy πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT along with the state space 𝒮𝒮\mathcal{S}caligraphic_S, observation space 𝒪𝒪\mathcal{O}caligraphic_O and action space 𝒜𝒜\mathcal{A}caligraphic_A. The relevance function {wi(𝒮)}i=1N{0,1}superscriptsubscriptsubscript𝑤𝑖𝒮𝑖1𝑁01\{w_{i}(\mathcal{S})\}_{i=1}^{N}\rightarrow\{0,1\}{ italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( caligraphic_S ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT → { 0 , 1 } is defined as a map** between the robot state to a real-valued relevance used for distillation. The distillation weights are assigned to λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1 and β=3e4𝛽3superscript𝑒4\beta=3e^{-4}italic_β = 3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT. For each epoch of our training cycle, we collect a batch of data by executing our policy on each of the tasks and collect a batch of returns {𝐫i}i=1Nsuperscriptsubscriptsubscript𝐫𝑖𝑖1𝑁\{\mathbf{r}_{i}\}_{i=1}^{N}{ bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, observations {𝐨i}i=1Nsuperscriptsubscriptsubscript𝐨𝑖𝑖1𝑁\{\mathbf{o}_{i}\}_{i=1}^{N}{ bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, states {𝐬i}i=1Nsuperscriptsubscriptsubscript𝐬𝑖𝑖1𝑁\{\mathbf{s}_{i}\}_{i=1}^{N}{ bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and actions {𝐚i}i=1Nsuperscriptsubscriptsubscript𝐚𝑖𝑖1𝑁\{\mathbf{a}_{i}\}_{i=1}^{N}{ bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The returns are then normalized using PopArt with parameter β𝛽\betaitalic_β as {𝐫in}i=1Nsuperscriptsubscriptsubscriptsuperscript𝐫𝑛𝑖𝑖1𝑁\{\mathbf{r}^{n}_{i}\}_{i=1}^{N}{ bold_r start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. The observations of the main task are evaluated under each of the tasks Tisubscript𝑇𝑖T_{i}italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to obtain {πθTi(𝐨0)}i=1Nsuperscriptsubscriptsubscriptsuperscript𝜋subscript𝑇𝑖𝜃subscript𝐨0𝑖1𝑁\{\pi^{T_{i}}_{\theta}(\mathbf{o}_{0})\}_{i=1}^{N}{ italic_π start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( bold_o start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT. Note that here 𝐬isubscript𝐬𝑖\mathbf{s}_{i}bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the complete state information at a given time step, whereas 𝐨isubscript𝐨𝑖\mathbf{o}_{i}bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the visual observation available to the agent.

As described in  Section 3, our policy optimizes a weighted combination of two objectives. The RL-loss required for a regular PPO update given by LRLsubscript𝐿𝑅𝐿L_{RL}italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT losses computed using normalized returns, observations, states and actions {𝐫in,𝐨in,𝐬in,𝐚in}superscriptsubscript𝐫𝑖𝑛superscriptsubscript𝐨𝑖𝑛superscriptsubscript𝐬𝑖𝑛superscriptsubscript𝐚𝑖𝑛\{\mathbf{r}_{i}^{n},\mathbf{o}_{i}^{n},\mathbf{s}_{i}^{n},\mathbf{a}_{i}^{n}\}{ bold_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_o start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT , bold_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT }. The distillation loss is computed for each time step, i.e. {t=1,2,B}𝑡12𝐵\{t=1,2,\cdots B\}{ italic_t = 1 , 2 , ⋯ italic_B } as the KL-divergence between DKL(πθT0(o0t)πθTi(o0t))subscript𝐷KLconditionalsuperscriptsubscript𝜋𝜃subscript𝑇0subscriptsuperscript𝑜𝑡0superscriptsubscript𝜋𝜃subscript𝑇𝑖subscriptsuperscript𝑜𝑡0D_{\text{KL}}(\pi_{\theta}^{T_{0}}({o}^{t}_{0})\;\|\;\pi_{\theta}^{T_{i}}({o}^% {t}_{0}))italic_D start_POSTSUBSCRIPT KL end_POSTSUBSCRIPT ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ∥ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_o start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) weighted by the task relevance wi(s0t)subscript𝑤𝑖superscriptsubscript𝑠0𝑡{w}_{i}(s_{0}^{t})italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ). The total loss for policy optimization is computed using the weighting parameter λ𝜆\lambdaitalic_λ as Ldistill=LRL+λLdsubscript𝐿distillsubscript𝐿𝑅𝐿𝜆subscript𝐿𝑑L_{\text{distill}}=L_{RL}+\lambda L_{d}italic_L start_POSTSUBSCRIPT distill end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT italic_R italic_L end_POSTSUBSCRIPT + italic_λ italic_L start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

A.2 Additional Implementation Details

In this section, we further detail network architecture and implementation details used in the training of AuxDistill. The agent captures visual observations using a 256×256256256256\times 256256 × 256 depth sensor, which is encoded by a ResNet50 [47] architecture. The visual features are concatenated with the proprioceptive and goal sensor observations and passed onto the LSTM backbone network. The LSTM architecture has 2222 hidden layers with 128128128128 hidden units per layer, which generates a state representation of the environment. The LSTM output features are regressed to a multi-task value-head using a 2-layer critic network after concatenating with the task indicator T𝑇Titalic_T. The features generated by the output layer of the LSTM are used to regress to {μ,σ}10𝜇𝜎superscript10\{\mu,\sigma\}\in\mathbb{R}^{10}{ italic_μ , italic_σ } ∈ blackboard_R start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. The actions are then sampled from a Gaussian distribution defined by {μ,σ}𝜇𝜎\{\mu,\sigma\}{ italic_μ , italic_σ }, i.e., at𝒩(μ,σ)similar-tosubscript𝑎𝑡𝒩𝜇𝜎a_{t}\sim\mathcal{N}(\mu,\sigma)italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( italic_μ , italic_σ ). Overall, our network has 13Mabsent13𝑀\approx 13M≈ 13 italic_M trainable parameters.

The policy is updated using DDPPO [52], which collects data from 24242424 environment workers parallelized across 8888 GPUs across 6666 tasks (5 auxiliary tasks and the main task). The policy is updated after collecting 128128128128 steps of experience in each worker with a pre-emption threshold of 0.250.250.250.25. The policy is trained for 450M450𝑀450M450 italic_M steps collected across all auxiliary tasks. The starting learning rate of training is 3e43superscript𝑒43e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with linear learning rate decay to 00 over 500M500𝑀500M500 italic_M steps. Before each policy update, we normalize the returns per-task using Pop-Art with β=3.0e4𝛽3.0superscript𝑒4\beta=3.0e^{-4}italic_β = 3.0 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT (see  [48] for more details).

We utilize PPO with a value loss coefficient of 0.50.50.50.5 and an entropy coefficient of 0.0010.0010.0010.001 to incentivize exploration. Considering the longer horizon of the rearrangement task, we set the discount rate γ=0.999𝛾0.999\gamma=0.999italic_γ = 0.999. The distillation loss is computed by evaluating the observations of the main task under all task indicators {Ti}i=1Nsuperscriptsubscriptsuperscript𝑇𝑖𝑖1𝑁\{T^{i}\}_{i=1}^{N}{ italic_T start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT and computing the weighted KL-divergence using the auxiliary task relevance function. In our experiments, we determine the relevance function using the episode’s current robot state and oracle task plan. (see Table 5 for more details). This is used in conjunction with the RL-losses with a weighting factor of λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1.

Appendix B Additional Experiment Details

B.1 Additional Task Details

We train our method on a setup similar to the Habitat-Rearrangement challenge [50], which involves a robot spawned in a home indoor environment with the goal of manipulating objects without having access to privileged maps or other oracle information and solely operating using ego-centric perception. Each episode is specified by the 3D coordinate location of the object and goal at the beginning of the episode. As this dataset is imbalanced across easy and hard episodes, we sub-sample 11,7911179111,79111 , 791 from the entire dataset of 50,0005000050,00050 , 000 episodes in the rearrange dataset. The episodes are obtained by selecting an equal percentage of episodes with the object spawned inside a closed cabinet, closed fridge, or open receptacles. To carry out the task, we use a Fetch robot with a mobile base and a 7-DOF arm with a gripper. We provide the following inputs to the policy:

  • Depth Camera: a 256×256256256256\times 256256 × 256 depth camera attached to the head of the robot.

  • Coordinate Sensors: The euclidean distance between the 3D coordinate of the robot end-effector and the object to be picked up and the location where the object is to be placed.

  • Holding sensor: indicating whether the robot is gras** any object.

  • Relative Resting Position Sensor: highlighting the Euclidean distance of the 3D coordinate position of the end-effector to the resting position.

  • Joint sensor: indicating the 7-DOF joint position of the robot arm.

  • Task Indicator: encoded as a one-hot vector indicating which of the tasks is currently being executed.

The agent can interact with the environment for a maximum of 1500150015001500 steps. Further, we enforce a step threshold for each task stage based on the auxiliary tasks defined in  Section B.1.1. During training, we utilize an instantaneous force threshold of 30303030kN, but do not apply the force threshold during evaluation as consistent with [16].

B.1.1 Auxiliary Task Definitions

In this section, we describe the auxiliary task definitions for each of the tasks utilized in our training and ablation experiments.

Pick: This task involves the agent being spawned randomly in the house at least 3m3𝑚3m3 italic_m from the object of interest without the object in hand. The task is considered successful if the agent is successfully able to navigate to and pick up the object by calling a grip action when it is 0.15m0.15𝑚0.15m0.15 italic_m from the object of interest and rearrange its arm to 0.15m0.15𝑚0.15m0.15 italic_m of resting position. The horizon length of this task is 700700700700 steps. The reward function for this task is represented as:

R(st)=10𝕀success+2Δarmo𝕀!holding+2Δarmr𝕀holding+2𝕀pick\displaystyle R(s_{t})=10\mathbb{I}_{success}+2\Delta^{o}_{arm}\mathbb{I}_{!% holding}+2\Delta^{r}_{arm}\mathbb{I}_{holding}+2\mathbb{I}_{pick}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 10 blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT ! italic_h italic_o italic_l italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_h italic_o italic_l italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT + 2 blackboard_I start_POSTSUBSCRIPT italic_p italic_i italic_c italic_k end_POSTSUBSCRIPT

Here, 𝕀picksubscript𝕀𝑝𝑖𝑐𝑘\mathbb{I}_{pick}blackboard_I start_POSTSUBSCRIPT italic_p italic_i italic_c italic_k end_POSTSUBSCRIPT represents the condition of the pick skill successfully picking up the object, and 𝕀successsubscript𝕀𝑠𝑢𝑐𝑐𝑒𝑠𝑠\mathbb{I}_{success}blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT represents the agent being able to pick up the object successfully and rearrange its arm to the resting position. 2Δarmr𝕀holding2subscriptsuperscriptΔ𝑟𝑎𝑟𝑚subscript𝕀𝑜𝑙𝑑𝑖𝑛𝑔2\Delta^{r}_{arm}\mathbb{I}_{holding}2 roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_h italic_o italic_l italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT and ΔarmosubscriptsuperscriptΔ𝑜𝑎𝑟𝑚\Delta^{o}_{arm}roman_Δ start_POSTSUPERSCRIPT italic_o end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT represents the Euclidean distance of the robot of the arm to the object and ΔarmrsubscriptsuperscriptΔ𝑟𝑎𝑟𝑚\Delta^{r}_{arm}roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT represents the deviation from the resting position.

Place: This task involves the agent being spawned randomly in the house at least 3m3𝑚3m3 italic_m from the object without the object in hand. The agent has to navigate to the target receptacle, place the object within 0.15m0.15𝑚0.15m0.15 italic_m of the goal location, and rearrange its arm to its resting position. The horizon length of this task is 700700700700 steps.

R(st)=10𝕀success+2Δarmt𝕀holding+2Δarmr𝕀!holding+5𝕀place\displaystyle R(s_{t})=10\mathbb{I}_{success}+2\Delta^{t}_{arm}\mathbb{I}_{% holding}+2\Delta^{r}_{arm}\mathbb{I}_{!holding}+5\mathbb{I}_{place}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 10 blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_h italic_o italic_l italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT + 2 roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT ! italic_h italic_o italic_l italic_d italic_i italic_n italic_g end_POSTSUBSCRIPT + 5 blackboard_I start_POSTSUBSCRIPT italic_p italic_l italic_a italic_c italic_e end_POSTSUBSCRIPT

Here, 𝕀successsubscript𝕀𝑠𝑢𝑐𝑐𝑒𝑠𝑠\mathbb{I}_{success}blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT, 𝕀placesubscript𝕀𝑝𝑙𝑎𝑐𝑒\mathbb{I}_{place}blackboard_I start_POSTSUBSCRIPT italic_p italic_l italic_a italic_c italic_e end_POSTSUBSCRIPT represent a sparse reward for successful task completion and placing the object, respectively. ΔarmtsubscriptsuperscriptΔ𝑡𝑎𝑟𝑚\Delta^{t}_{arm}roman_Δ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT arm represents the per-time step deviation of the robot arm to the target location when the agent is holding the object and ΔarmrsubscriptsuperscriptΔ𝑟𝑎𝑟𝑚\Delta^{r}_{arm}roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT represents the deviation of the robot arm towards resting position after the object has been placed successfully.

Open-Cabinet: This skill involves the robot being spawned randomly in the house, with the task of navigating to the cabinet and opening the drawer by calling the grasp action within 0.15m0.15𝑚0.15m0.15 italic_m of the drawer handle marker. The drawer is then opened to a joint position of 0.450.450.450.45. Further, the agent must successfully rearrange its arm to its resting position. The task horizon length for this task is 600600600600 steps. The reward structure for this task is given by,

R(st)=10𝕀success+Δarmm𝕀!open+10Δarmr𝕀open+5𝕀open+5𝕀grasp\displaystyle R(s_{t})=10\mathbb{I}_{success}+\Delta^{m}_{arm}\mathbb{I}_{!% open}+10\Delta^{r}_{arm}\mathbb{I}_{open}+5\mathbb{I}_{open}+5\mathbb{I}_{grasp}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 10 blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT + roman_Δ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT ! italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + 10 roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + 5 blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + 5 blackboard_I start_POSTSUBSCRIPT italic_g italic_r italic_a italic_s italic_p end_POSTSUBSCRIPT

Here, 𝕀successsubscript𝕀𝑠𝑢𝑐𝑐𝑒𝑠𝑠\mathbb{I}_{success}blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT is an indicator for successful opening followed by arm rearrangement, 𝕀opensubscript𝕀𝑜𝑝𝑒𝑛\mathbb{I}_{open}blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT is the indicator for the drawer being successfully opened, and 𝕀graspsubscript𝕀𝑔𝑟𝑎𝑠𝑝\mathbb{I}_{grasp}blackboard_I start_POSTSUBSCRIPT italic_g italic_r italic_a italic_s italic_p end_POSTSUBSCRIPT is an indicator for the drawer handle being successfully grasped. Δarmm,ΔarmrsubscriptsuperscriptΔ𝑚𝑎𝑟𝑚subscriptsuperscriptΔ𝑟𝑎𝑟𝑚\Delta^{m}_{arm},\Delta^{r}_{arm}roman_Δ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT , roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT are used to encode the dense time-step reward based on the change in arm position to the target marker location and the resting position, respectively.

Open Fridge: This task involves the robot being spawned randomly in the house, with the task of navigating to the fridge in the scene successfully gras** the fridge handle marker by calling the grasp action at 0.15m0.15𝑚0.15m0.15 italic_m from the fridge door handle. The fridge door must be opened to a joint position of 1.221.221.221.22, and the arm must be rearranged to its resting position. The task horizon length for this task 600600600600 steps. The per-time-step reward for this is modeled as:

R(st)=10𝕀success+Δarmm𝕀!open+Δarmr𝕀open+5𝕀open+5𝕀grasp\displaystyle R(s_{t})=10\mathbb{I}_{success}+\Delta^{m}_{arm}\mathbb{I}_{!% open}+\Delta^{r}_{arm}\mathbb{I}_{open}+5\mathbb{I}_{open}+5\mathbb{I}_{grasp}italic_R ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = 10 blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT + roman_Δ start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT ! italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + roman_Δ start_POSTSUPERSCRIPT italic_r end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_a italic_r italic_m end_POSTSUBSCRIPT blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + 5 blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT + 5 blackboard_I start_POSTSUBSCRIPT italic_g italic_r italic_a italic_s italic_p end_POSTSUBSCRIPT

Here, 𝕀successsubscript𝕀𝑠𝑢𝑐𝑐𝑒𝑠𝑠\mathbb{I}_{success}blackboard_I start_POSTSUBSCRIPT italic_s italic_u italic_c italic_c italic_e italic_s italic_s end_POSTSUBSCRIPT, 𝕀opensubscript𝕀𝑜𝑝𝑒𝑛\mathbb{I}_{open}blackboard_I start_POSTSUBSCRIPT italic_o italic_p italic_e italic_n end_POSTSUBSCRIPT and 𝕀graspsubscript𝕀𝑔𝑟𝑎𝑠𝑝\mathbb{I}_{grasp}blackboard_I start_POSTSUBSCRIPT italic_g italic_r italic_a italic_s italic_p end_POSTSUBSCRIPT are similar to the ones defined for the Open-Cabinet skill.

Pick from Fridge: This task is similar in structure to the Pick skill except that the data distribution involves picking up an object has to be picked up from an open refrigerator with the agent being spawned <2mabsent2𝑚<2m< 2 italic_m from the target object.

Which of these auxiliary tasks are utilized for distillation is determined by whether the agent’s current state is relevant to the state of the agent in the rearrangement task. We show a table of each of the relevance of each task in  Table 5.

Pre-Conditions for Distillation
Auxillary Task Object Receptacle Did Pick Object? Is Receptacle Open?
Pick Open ×\times×
Place Open,Fridge,Cabinet ✓, ×\times×
Open-Fridge Fridge ×\times× ×\times×
Open-Cabinet Cabinet ×\times× ×\times×
Pick from Fridge Fridge ×\times×
Table 5: A table representing the relevance of each of the auxiliary tasks based on the stage of the task the robot is in. Did Pick Object? represents the success of picking up the correct object and Is Receptacle Open? represents whether the robot has successfully opened the receptacle once during the episode. The Object-Receptacle encodes oracle information about the category of episodes we’re operating on. The open fridge and open cabinet tasks represent cases when the object is in an open receptacle.
Refer to caption
(a) Pick
Refer to caption
(b) Place
Refer to caption
(c) Open-Fridge-Pick
Refer to caption
(d) Open-Cabinet
Refer to caption
(e) Open-Fridge
Figure 4: Success curves of the individual skills for the main experiment reported in  Table 1. The Open-Fridge Pick and Place skill shows high variance across seeds for the main method. RL-Curiculum shows higher variance on the Pick skill. Results are reported up to 250M250𝑀250M250 italic_M steps number of training steps to show a comparison with all baselines (RL-Curicullum trains sub-skills only for the first stage)

B.2 Additional Baseline Details

B.2.1 M3 &\&& M3-Oracle

This baseline sequences mobile manipulation policies, which include Navigation, Pick, Place, Open-Fridge, and Open-Cabinet from  [17] each of which are trained for 100M steps using PPO [34]. M3 (Oracle) is an oracle version of M3 that has access to privileged information about the oracle sequence of skills based on whether the object begins inside a closed or open receptacle at the start of the episode. Each policy is similar to the ones reported in  [16] with training for 100M100𝑀100M100 italic_M steps.

B.2.2 RL-Curriculum

This baseline captures a 2-stage variant of our method involving the training of all the auxiliary tasks {i}i=1Nsuperscriptsubscriptsubscript𝑖𝑖1𝑁\{\mathcal{M}_{i}\}_{i=1}^{N}{ caligraphic_M start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT at once followed by learning on the main task 0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The intuition behind this strategy is to learn a policy capable of performing simpler auxiliary tasks, which can be leveraged to learn the challenging main task of interest in the second stage. We allocate a budget of 500M500𝑀500M500 italic_M steps of agent experience, which is divided into two stages - i) 200M200𝑀200M200 italic_M steps across all auxiliary tasks followed by ii) 300M300𝑀300M300 italic_M steps to learn the main task. In the first stage, we encode each of the tasks using a separate identifier T𝑇Titalic_T. During the second stage, we pass in a task identifier capturing the main rearrangement task unseen during the first stage. We begin each stage with a learning rate of 3e43superscript𝑒43e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with linear learning rate decay over 300M300𝑀300M300 italic_M steps.

B.3 Auxillary Task Learning

In  Figure 4, we show the results of auxiliary task learning for the main results reported in  Table 1. We report up to 250M250𝑀250M250 italic_M steps of learning to demonstrate the results of all baselines, including stage 1 of the curriculum baseline.

Among all the skills, the open and close cabinet skills are the easiest to learn and show the lowest variance during training. The order of difficulty for the other skills is followed by Pick < Pick from Fridge <Place. The difficulty arising in the Placing skill is due to the distinct starting state distribution with the robot being spawned with the object in hand, which none of the other skills have.

B.4 Category Pick

As described in  Section 4.2, the Category Pick task demonstrates the merits of AuxDistill on a challenging observation space of the object category as opposed to its 3D coordinate location. We encode the object category as a one-hot sensor across a total of 20202020 objects. Further, the agent receives an RGB-sensor observation as opposed to the depth sensor used in our rearrangement experiments. Further, the robot is spawned <2mabsent2𝑚<2m< 2 italic_m to the target receptacle. All other RL-training parameters are similar to coordinate pick, including the reward structure, episode horizon length, and success condition, which are similar to the ones described in  B.1.1.

We train Category Pick with 16161616 environment workers with 8888 environments for each coordinate pick and category-pick parallelized across 8 GPUs. We train for 140M140𝑀140M140 italic_M steps collected across both language-pick and coordinate pick with a starting learning rate of 2e42superscript𝑒42e^{-4}2 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT with a linear learning rate decay across 200M200𝑀200M200 italic_M steps of training. We compare this with the monolithic baseline implemented with all 16161616 environments devoted to Category Pick. For training, we utilize the standard rearrange-easy dataset with 50,0005000050,00050 , 000 episodes and evaluate 1,00010001,0001 , 000 episodes on the standard validation split.

Refer to caption
(a) Main Task Learning
Refer to caption
(b) Aux-Task with Easy Main Task
Figure 5: Analyzing the curriculum of behaviors that emerges while training AuxDistill. On the left we compare the learning of the easier tasks followed by the harder tasks. On the right, we show a comparison of the main task (0subscript0\mathcal{M}_{0}caligraphic_M start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT) learning with the relevant auxiliary tasks. On the left, the easier task learns first, followed by the harder task; on the right, the easier distribution improves only after learning the relevant auxiliary skills, i.e., Pick and Place begin learning.

Appendix C Emergent Curriculum During Training

In this section, we discuss how training AuxDistill results in a curriculum. Note that our AuxDistill does not enforce this curriculum explicitly; this arises as a consequence of the multi-task RL training regime that optimizes to maximize cumulative return.

In  Figure 5, we show two such curricula;  Figure 5(a), which shows that the easy distribution is learned first during training, followed by the hard distribution. Recall that in Table 2 we show that rearrangement fails to learn on hard episodes without including the easier Pick auxiliary task during training. Building on this observation, in  Figure 5(b), we show another curriculum between the auxiliary task and the easy main task. The improvement of tasks is in the order Pick < Place < Rearrange-Easy. Note that this trend, however, does not hold throughout training; the success rate for the place skill saturates sooner at about 300M300𝑀300M300 italic_M steps. This difference could be due to the stricter success conditions requiring arm rearrangement for places that are not required by the main easy task.