Reinforcement Learning via Auxiliary Task Distillation

Abhinav Narayan Harish¹, Larry Heck², Josiah P. Hanna¹, Zsolt Kira², Andrew Szot²
¹University of Wisconsin – Madison , ²Georgia Tech

Abstract

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method that enables reinforcement learning (RL) to perform long-horizon robot control problems by distilling behaviors from auxiliary RL tasks. AuxDistill achieves this by concurrently carrying out multi-task RL with auxiliary tasks, which are easier to learn and relevant to the main task. A weighted distillation loss transfers behaviors from these auxiliary tasks to solve the main task. We demonstrate that AuxDistill can learn a pixels-to-actions policy for a challenging multi-stage embodied object rearrangement task from the environment reward without demonstrations, a learning curriculum, or pre-trained skills. AuxDistill achieves $2.3\times$ higher success than the previous state-of-the-art baseline in the Habitat Object Rearrangement benchmark and outperforms methods that use pre-trained skills and expert demonstrations.

1 Introduction

While reinforcement learning (RL) is successful in a variety of settings including games [1, 2, 3], chatbots [4, 5, 6, 7, 8] , and robotics [9, 10, 11], long-horizon tasks such as embodied object rearrangement where an embodied agent must rearrange objects to target positions still remains a challenge [12]. Object rearrangement requires learning heterogeneous behaviors like picking, navigating, placing, and opening concurrently with sequencing these behaviors to solve the overall task all while operating from egocentric vision. Furthermore, the behaviors are interdependent, meaning the robot can only learn how to pick an object if it has first learned how to open the fridge containing the object. Since object rearrangement requires interacting with objects across house-scale spaces with low-level control, the episodes consist of thousands of low-level control steps, exacerbating the problem of credit assignment [13, 14]. Overall, the problems of concurrently learning low-level control, high-level decision-making, interdependent task stages, and episodes with many time steps make object rearrangement challenging even with dense rewards. While some prior methods have applied end-to-end RL to rearrangement problems, they require excessive experience in simplified simulators [15] or expert skill demonstrations [16]. Other prior work has made such complex tasks more tractable by decomposing the full task into a hierarchy of separately trained skills that a high-level policy sequences together to solve the overall task. However, such hierarchical methods suffer from compounding errors between skills [17] and dynamically selecting between skills.

We therefore present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a method for training policies for long-horizon tasks from scratch with end-to-end RL from reward alone. AuxDistill also learns in auxiliary tasks using a multi-task RL framework and transfers the knowledge from these auxiliary tasks to help solve the desired “main” task of interest. Importantly, AuxDistill concurrently learns the main task and all auxiliary tasks, which are all initialized from scratch and not pre-trained. These auxiliary tasks help because they are easier to learn with RL than the main task and contain relevant behaviors for the full task. For example, in object rearrangement, the agent should also practice picking up objects in isolation from the complexities of the full task. Unlike curriculum learning in RL [18, 19, 20, 21], which also leverages easier tasks to learn harder tasks, AuxDistill does not have distinct curriculum phases. Instead, it has a single training objective for distilling sub-behaviors into the overall task behavior. Crucially, the agent end-to-end learns which auxiliary behaviors are relevant for the main task. This relevance is encoded by a scalar weight value determined by the relevance of the robot state to the auxiliary task and it grounded in the oracle task plan. A distillation objective transfers behaviors from relevant auxiliary tasks by encouraging the policy to act consistently between the main and auxiliary tasks for relevant states in the main task. For instance, in object rearrangement, when the robot is near a drawer with the target object inside, it is supervised with the distillation loss from an “open drawer" auxiliary task. Despite the policy also concurrently learning the “open drawer” auxiliary task from scratch, AuxDistill learns the auxiliary tasks faster since they are easier. Thus, the distillation loss is a useful dense per-time step supervision signal that helps overcome the challenges of learning from the main task reward alone.

We empirically demonstrate that AuxDistill outperforms a variety of baselines in terms of success rate on the home rearrangement benchmark in Habitat 2.0 [22]. These include baselines such as hierarchical RL, end-to-end RL with and without a curriculum, and imitation learning. On rearrangement episodes where objects start in open receptacles, AuxDistill achieves $1.4\times$ higher success than baselines and $2.6\times$ higher success in harder episodes where objects can start in closed receptacles. These results highlight the value of AuxDistill to be able to learn the entire rearrangement task end-to-end with RL. We show results beyond object rearrangement on a category-conditioned object manipulation task where AuxDistill achieves $1.75\times$ higher success than baselines. We also conduct extensive ablation studies where we analyze the properties of auxiliary tasks needed by AuxDistill, and find that AuxDistill is generally robust to the choice of auxiliary tasks. An analysis of the distillation objective reveals the importance of the distillation loss, with the absence of the distillation loss leading to no success on the rearrangement task. All code is available at https://github.com/absdnd/aux_distill

2 Related Work

Some prior works address complex and long-horizon decision-making problems using a hierarchical breakdown of skills. One such category of approaches is option-critic methods [23, 24, 25], which seek to learn options to temporally abstract the high-level policy decision making. However, discovering and learning such options is unstable and results in a challenging credit assignment problem. Another line of work first trains or is given pre-trained skills and then learns a high-level policy to sequence these skills together to complete longer tasks [26, 27, 28, 29]. Using such a hierarchy results in compounding errors resulting from sequencing skills that were not trained to properly transition between each other [30, 31, 32, 33, 17]. Our approach does not utilize a hierarchical policy, and can solve long horizon tasks better than hierarchical methods by utilizing a single policy and thus avoiding compounding errors that emerge by sequencing independently trained policies.

Like our method, prior works have tackled model-free end-to-end RL to solve long, complex tasks. Scaling PPO [34] training has enabled robots to learn complex locomotion behaviors [35, 36, 37, 38, 39, 40], manipulate unseen objects in a robotic hand [10], and play competitive video games [2]. Likewise, Berges et al. [15] showed scaling end-to-end RL can learn object rearrangement. However, their approach requires training agents for billions of environment steps with a simplified kinematic physics simulation. We show results in the full dynamic Habitat simulation, achieving better results in an order of magnitude fewer environment interactions, and show results in harder rearrangement tasks. Other works have incorporated auxiliary objectives in RL to boost performance in embodied tasks [41, 42, 43, 44]. These works add auxiliary self-supervised prediction objectives, whereas our method adds an auxiliary policy learning task via multi-task RL. Jia et al. [45] also uses easier auxiliary RL tasks that specialist agents learn to complete and then distill back into a generalist agent. While our work also uses RL in easier tasks to boost performance, it doesn’t require alternating between learning specialist and generalist policies.

Works have also attempted to learn a curriculum to leverage easier tasks to learn harder tasks that require composing multiple behaviors. Some works learn curriculum generation policies that adjust the environment to suit the agent’s capabilities [19, 20, 21]. In other works, a curriculum naturally emerges as a result of downstream training [46]. Finally, some works hand design curricula that change aspects of the environment, such as the starting and goal distributions depending on the agent performance during training [35]. Our work shares a similar spirit of leveraging easier auxiliary tasks to learn a harder task but it does not enforce an explicit curriculum and instead learns all tasks simultaneously.

Refer to caption — Figure 1: AuxDistill learns a rearrangement policy operating from egocentric depth perception and coordinate-based task specification. The full object rearrangement task decomposes into modular abilities that can be learned by auxiliary task with indicator vectors ${T}_{1}\cdots{T}_{N}$ which are trained along with the main task using end-to-end RL. The distillation loss is computed as a weighted combination of the task relevance of $o_{t}$ in the main task $T_{0}$ under all auxiliary tasks. The task relevance function computes $w_{i}(s_{t})$ based on the relevance of the current observation and robot state to the auxiliary task $T_{i}$ . The distillation loss and RL-training loss are then used to update the policy.

3 Method

We present Reinforcement Learning via Auxiliary Task Distillation (AuxDistill), a new method for training long-horizon policies from scratch using rewards alone. Learning from reward alone in complex tasks like embodied object rearrangement is challenging because an agent needs to combine thousands of low-level actions controlling the arm and base, operate from egocentric visual perception, and dynamically sequence distinct behaviors such as navigating, picking, placing, and opening. AuxDistill addresses these challenges of RL in complex problems, like embodied rearrangement, by learning to leverage knowledge from easier auxiliary tasks related to the desired task we are trying to solve. Unlike a curriculum, which learns gradually harder tasks in stages, AuxDistill learns the desired task concurrently with the auxiliary task and includes a novel distillation mechanism to transfer knowledge from easier to harder tasks.

3.1 Preliminaries

Our problem is formulated as a goal-specified Partially-Observable Markov Decision Process (POMDP) defined as a tuple $\mathcal{M}=\left(\mathcal{S},\mathcal{O},\mathcal{A},\mathcal{P},\mathcal{R},% \mathcal{G},\rho,\gamma\right)$ with underlying state space $\mathcal{S}$ , observation space $\mathcal{O}$ , action space $\mathcal{A}$ , transition function $\mathcal{P}$ , reward function $\mathcal{R}$ , goal space $\mathcal{G}$ , initial state distribution $\rho$ and discount factor $\gamma$ . For a task like rearrangement, the goal space $\mathcal{G}$ is specified using the 3D coordinate of the object’s start location and the goal location where the object has to be placed. Our objective is to learn a goal-conditioned policy $\pi(a\mid o,g)$ map** an observation $o$ and goal $g$ to an action $a$ that maximizes the sum of discounted rewards $\mathbb{E}_{s_{0}\sim\rho_{0},g\sim\mathcal{G}}\sum_{t}\gamma^{t}\mathcal{R}(s% _{t},g)$ .

3.2 Reinforcement Learning via Auxiliary Task Distillation

The core idea of AuxDistill is to learn a policy in a difficult desired “main” task by transferring knowledge from easier auxiliary tasks. This is done without using an explicit curriculum by learning the main and auxiliary tasks in a single training loop. We refer to the main task we wish to learn as $\mathcal{M}_{0}$ with a task identifier $T_{0}$ . We define $N$ auxiliary tasks, defined as $\left\{\mathcal{M}_{i}\right\}_{n=1}^{N}$ with task identifiers $\left\{{T}_{i}\right\}_{n=1}^{N}$ . The auxiliary tasks $\left\{\mathcal{M}_{i}\right\}$ share the same state, observation, goal, and action space as the target task $\mathcal{M}_{0}$ . Each $\mathcal{M}_{i}$ has a separately defined starting state distribution and reward function. We assume these auxiliary tasks are related to the full task yet are easier to solve than the full task. Specifically, the auxiliary tasks can be easier instantiations or sub-parts of the overall task. For example, in rearrangement, we define the auxiliary tasks in terms of interactions the agent needs to complete the entire rearrangement episode, such as picking, placing, and opening. Prior works use similar task definitions to train skills for hierarchical policies in rearrangement [17], but these works suffer from a two-stage pipeline of first needing to separately train each skill and then decide how to combine them. AuxDistill directly learns the full task from scratch while also performing better.

AuxDistill learns a single policy with RL that concurrently learns to perform the main and auxiliary tasks. We illustrate the implementation of AuxDistill in Figure 1. This policy, parameterized by $\theta$ , is expressed as $\pi_{\theta}(a_{t}\mid o_{t},g,T)$ for observation $o_{t}$ , episode goal $g$ , and per- $\mathcal{M}$ task identifier $T$ which is encoded as a one-hot embedding in a vector which has the same size as the maximum number of auxiliary tasks. Note that since all tasks share the same observation and action space, $\pi_{\theta}$ can act and observe in all tasks based on the input task identifier $T$ .

AuxDistill updates the policy based on an average RL loss from the main task and all auxiliary tasks. Let $\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}(\theta)$ denote the RL loss for $\pi_{\theta}$ in MDP $\mathcal{M}_{i}$ . These auxiliary tasks are designed to capture a subset of the main task which are easier to accomplish than the main task. To compute these losses, we assume we can collect experience in the auxiliary tasks. For example, to collect experience in a “pick object” auxiliary task in object rearrangement, the robot is spawned close to the object. We then update based on the average of the task and auxiliary task losses: $\frac{1}{N}\sum_{i=0}^{N}$ $\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}(\theta)$ . Intuitively, $\pi_{\theta}$ will first learn to complete the easier auxiliary tasks, and this auxiliary task competency can aid in solving the main task. This shares a similar insight as curriculum learning, except all tasks are learned concurrently, and the curriculum stages are naturally induced by the policy naturally learning on easier tasks first.

In addition to optimizing the average RL loss between the main and auxiliary tasks, AuxDistill also optimizes a distillation loss that encourages the policy to transfer relevant knowledge from the auxiliary tasks to the main task. To achieve this, for a particular observation $o_{t}$ , we consider the policy distribution $\pi_{\theta}(\cdot\mid o_{t},g,T)$ under different task identifiers $T$ . For compactness, notate $\pi_{\theta}(\cdot\mid o_{t},g,T_{i})$ where $T_{i}$ indicates the task identifier for $\mathcal{M}_{i}$ as $\pi_{\theta}^{T_{i}}\left(o_{t}\right)$ . We want $\pi_{\theta}^{T_{0}}$ (the policy in the main task) to match the behaviors from the policy in the relevant auxiliary tasks $\pi_{\theta}^{T_{i}}$ . We measure this relevance of an auxiliary task $\mathcal{M}_{i}$ to the main task at time step $t$ via an auxiliary task relevance function $w_{i}(s_{t})$ . This function denotes how much the knowledge from $\mathcal{M}_{i}$ should apply to $\mathcal{M}_{0}$ in the underlying simulator state $s_{t}$ . This relevance function can be grounded in the task plan generated by an oracle planner. For example, consider the state of the robot before picking up the object. In this state, the pick auxiliary task would be relevant to the main task, and the place auxiliary task would not. If an episode requires opening a fridge before accessing the object, the relevant task before the fridge is opened would be open-fridge. Note that computing the distillation weight can utilize oracle knowledge of the simulator state (for instance, whether the object is inside a fridge or a cabinet) since this information is only provided as a training signal and not used during inference. This information is utilized by all methods we compare against, including our strongest baseline [16].

We can then distill experience from $\mathcal{M}_{i}$ by supervising the policy to match the action distribution of the target task $\mathcal{M}_{0}$ . The distillation loss is computed as the KL-divergence of the action distribution under $\mathcal{M}_{i}$ and $\mathcal{M}_{0}$ weighted by the relevance of auxiliary task $\mathcal{M}_{i}$ given by $w_{i}(s_{t})$ . The overall loss function of AuxDistill, including the distillation loss, is then:

\displaystyle\frac{1}{N}\sum_{i=0}^{N}\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}% }(\theta)+\lambda\sum_{i=1}^{N}w_{i}(s_{t})D_{\text{KL}}(\pi_{\theta}^{T_{0}}(% o_{t})\;\|\;\pi_{\theta}^{T_{i}}(o_{t}))

(1)

Here $\lambda$ represents the distillation weight relative to other PPO losses. Note that our formulation relies on the auxiliary tasks being easier to solve and relevant to the main task. Distillation using the task relevance function offers an additional supervisory signal that the policy can optimize along with its own reward. This doesn’t impose the large overhead of hierarchical RL methods where each auxiliary task must be pre-trained in isolation with its own reward function and goal specification. Auxiliary tasks are flexibly incorporated to address parts of the task that are challenging for the agent to learn.

3.3 Implementation Details

We use PPO [34] to compute the RL loss $\mathcal{L}_{\mathcal{M}_{i}}^{\text{RL}}$ in Equation 1. We create $M$ environment instances for each of the main and $N$ auxiliary tasks and vectorize them for parallel action execution. We rollout the policy in all these $M(N+1)$ environments and collect a batch of data. We then compute Equation 1, using the PPO loss, update the policy, and then repeat this process.

The policy is represented as a 2-layer LSTM network with 512 hidden units per layer, similar to the architecture employed in [22]. In our experiments, $o_{t}$ is a depth or RGB egocentric image, which we encode with a ResNet50 [47] network. The goal and the proprioceptive state of the agent are concatenated with the visual embeddings along with the index of the POMDP $\mathcal{M}_{i}$ , for which we use a one-hot embedding. This vector is then passed as an input to the LSTM. A linear projection on the output of the LSTM then produces an action to execute on the next step in the environment and a vector of value-predictions corresponding to each of the tasks $\mathcal{M}_{i}$ used during training.

Our policy is learned via a multi-task RL formulation, with the main and auxiliary tasks being concurrently learned. Each of the tasks may have different reward magnitudes and relative performance during training. To address this, we use PopArt [48], with $\beta=3e^{-4}$ to enable learning across different return scales. Additional AuxDistill details are in Appendix A of the supplementary.

	Train (Seen)			Eval (Unseen)
Method	All Episodes	Easy	Hard	All Episodes	Easy	Hard
M3 (Oracle) [17]	27 $\pm$ 0	57 $\pm$ 2	12 $\pm$ 1	28 $\pm$ 0	58 $\pm$ 2	13 $\pm$ 2
M3 [17]	25 $\pm$ 1	56 $\pm$ 1	9 $\pm$ 2	13 $\pm$ 1	53 $\pm$ 4	0 $\pm$ 0
Galactic [15]	-	37 $\pm$ 0	-	-	26 $\pm$ 0	-
Monolithic RL	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
Skill Transformer [16]	25 $\pm$ 1	44 $\pm$ 1	16 $\pm$ 1	23 $\pm$ 1	37 $\pm$ 1	16 $\pm$ 1
AuxDistill (No Distillation)	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0
RL Curriculum	0 $\pm$ 0	1 $\pm$ 0	0 $\pm$ 0	0 $\pm$ 0	1 $\pm$ 1	0 $\pm$ 0
AuxDistill	49 $\pm$ 2	74 $\pm$ 2	36 $\pm$ 2	52 $\pm$ 2	75 $\pm$ 2	41 $\pm$ 2

Table 1: Success rates on the rearrangement task for our method, AuxDistill (highlighted in blue), and baselines. Displayed are the average and standard deviations for 3 seeds for M3, Monolithic RL, Skill Transformer, and M3 Oracle (numbers from [16]), and 3 seeds for the remaining methods with the highest numbers per setting bolded. Numbers in the easy and hard columns are averages over 100 episodes and 200 episodes, respectively. The All Episodes column is an average across both the Easy and Hard episodes. AuxDistill outperforms all baselines in all settings.

4 Experiments

4.1 Object-Rearrangement

In this section, we compare AuxDistill with baselines on the Habitat 2.0 Object Rearrangement task [22]. For comparison with baselines, we run our experiments using the setup from [16]. In this task, a Fetch robot [49] must move an object from a specified start position to a desired goal position in an indoor home environment using only onboard sensing. The agent has no privileged information like existing maps of the environment, 3D object models, or exact object positions. The Fetch robot senses the world through a $256\times 256$ depth camera mounted on the robot’s head, robot joint positions, gripper state, and base egomotion, giving the relative position of the robot from the start of the episode. The task is specified by a starting 3D object coordinate for the object to move and a 3D goal coordinate to move the object to. Both coordinates are specified relative to the robot’s position at the start of the episode. Only the starting object coordinate is specified, and this information is not updated based on the current object position (e.g. if it is moved).

The robot interacts with the world via a 7DoF arm, a suction gripper attached to the end of the arm, and a mobile base. The episode is successful if the target object is within 15cm of the goal position. The robot has a budget of 1,500 steps to complete the task.

We report performance on the easy and hard evaluation episodes from [16]. In easy episodes, both the target object and goal are on an open receptacle. This means the robot does not have to first open a receptacle before accessing the target object or goal. Instead, the agent can always execute the same sequence of navigate to object, pick up object, navigate to goal, and place object at goal. In hard episodes, the object may start in a closed receptacle. The agent, therefore, needs to use its visual input to perceive if the target object is in a closed receptacle. If so, the robot must then open the receptacle before picking the object. The object may start in either the fridge or a cabinet.

4.1.1 Training Setup:

We train AuxDistill using 11,791 episodes with a mix of easy and hard episodes across $63$ scenes using the same rearrangement training dataset as in [50]. These episodes were obtained by sampling an equal number of episodes in the easy and hard categories, with easy episodes being equally sampled across episodes with open-cabinet, open-fridge and non-articulated episodes. For hard episodes, sampling is done uniformly between closed fridge and closed cabinet episodes. We use auxiliary tasks covering abilities included as a part of the standard rearrangement benchmark used in [50]. The auxiliary tasks are:

•

Pick: The agent spawns randomly within the house and must navigate to and pick up the object.
•

Place the agent is spawned randomly in the house with the object in its gripper. The robot must navigate to and place the object within 15cm of the target 3D location on the receptacle.
•

Open Fridge: The robot spawns randomly within the house and must navigate to the fridge and use its base and arm to open the fridge door.
•

Open Cabinet: The robot spawns randomly in the house and must navigate to the kitchen area and open the cabinet using its base and arm.
•

Pick from Fridge: This task is similar to Pick except the agent has to pick up the object from inside an open fridge, which requires careful manipulation to minimize collisions.

The auxiliary relevance function is grounded in the oracle task plan [51] of the episode. This is formulated as an indicator function with $w_{i}(s_{t})=1$ if the state $s_{t}$ is relevant to $\mathcal{M}_{i}$ based on the oracle task plan of the episode and $w_{i}(s_{t})=0$ otherwise.

For example, if at step $t$ the agent is not holding an object and the object is in an open receptacle, the agent must first pick up the object so $w_{i_{pick}}(s_{t})=1$ , and $w_{i}(s_{t})=0$ for all other skills. For a complete description of the auxiliary tasks, including their reward functions, along with the precise auxiliary relevance function definition in object rearrangement, see Appendix B. We train methods for 475M steps of environment interactions. We found this to be a sufficient number of environment interactions to ensure sufficient progress on auxiliary tasks for distillation to aid rearrangement performance on both the easy and hard episodes. For AuxDistill, we count the environment interactions in the auxiliary tasks towards this 475M step budget. We train with a learning rate of $3e^{-4}$ and linearly decay the learning rate to $0$ over the course of learning.

	Train (Seen)			Eval (Unseen)
Method	All Episodes	Easy	Hard	All Episodes	Easy	Hard
All Skills	28	36	19	30	39	21
No Pick From Fridge	20	32	7	12	23	2
No Open Fridge	41	49	33	48	57	39
No Pick	0	0	0	0	0	0

Table 2: Robustness to auxiliary task selection. We compare the train and generalization performance of AuxDistill with 4 different auxiliary task selections. Results are averages across 100 episodes on a single seed of training. The train and evaluation episodes are in a simplified setting compared to Table 1 as described in Section 4.3

4.1.2 Baselines

We compare AuxDistill to both hierarchical methods governed by a task plan as well as monolithic baselines, which directly learn a pixels-to-actions policy using sensor observations. We compare to relevant baselines from Huang et al. [16] and two more baselines that use the auxiliary tasks.

•

Monolithic RL: A monolithic neural network is trained to map sensor observations directly to actions trained with end-to-end RL. The policy showed no signs of learning the main task, so we stopped training early after $100M$ steps of training. This outcome is consistent with the monolithic RL baseline in prior works [15, 22].
•

Galactic: [15] An end-to-end RL framework similar to monolithic RL to map sensor observations directly to actions with the policy being trained with a simplified kinematic simulation on over $>1e^{9}$ simulation steps and transferred to dynamic simulation utilized in our setting.
•

M3: [17] Each of the skill policies navigate, pick, place, and open are separately trained. These skills are then sequenced using a task planner.
•

M3 (Oracle): [17] This uses the same skill training as M3, but uses an oracle task planner.
•

Skill Transformer [16]: This method uses pre-trained skills to collect successful rearrangement demonstrations and then train on these demonstrations with imitation learning.
•

RL Curriculum This method is trained in two stages: i) First, we train the policy on the auxiliary tasks for $200M$ steps to ensure the relevant auxiliary tasks have good training performance ii) This is followed by a training on the entire rearrangement task for another $300M$ steps.
•

AuxDistill: No-Distillation Train AuxDistill under a similar setting as described in Section 4.1.1 except with the distillation strength $\lambda=0.0$ .

For more details on the baselines, see Section B.2.

4.1.3 Rearrangement Performance:

In Table 1, we compare AuxDistill to baselines in the rearrangement task. On all evaluation settings, AuxDistill outperforms all baselines. AuxDistill shows an absolute improvement of $22\%$ and $25\%$ on the easy and hard episodes respectively over the best performing baseline. Most baselines struggle to achieve any performance on the unseen episodes. Monolithic RL [22] achieves no success in any of the evaluation settings, demonstrating the impracticality of learning directly from the dense reward in the full rearrangement task. Our method also outperforms Galactic [15] despite learning on less than half the number of training samples ( $475M$ for AuxDistill vs. $>1e^{9}$ for Galactic) on the easy split with $74\%$ success over $37\%$ on easy episodes during training and $75\%$ vs. $26\%$ on evaluation. Note that Galactic does not report performance on the hard distribution.

Like AuxDistill, RL Curriculum also utilizes the auxiliary tasks yet finds no success. This demonstrates the value of concurrent training on the main and auxiliary tasks with the distillation loss in AuxDistill. Likewise, AuxDistill (No Distill) achieves no success, illustrating the importance of the distillation loss. The learning curve in Figure 3(a) as AuxDistill can gradually learn to complete the task with more environment interactions. On the other hand, RL Curriculum and AuxDistill (No-Distill) remain at no success regardless of the number of learning samples.

AuxDistill also outperforms the strongest baseline, Skill Transformer, by a significant margin with $75\%$ vs. $37\%$ on the unseen easy episodes and $41\%$ vs. $16\%$ on the unseen hard episodes. This demonstrates the advantages of using online, end-to-end RL, rather than solely training offline with demonstrations. As shown in Figure 3(a), AuxDistill is able to improve with subsequent environment interactions, yet Skill Transformer is limited by the performance of the expert demonstrations.

Table 1 also demonstrates that AuxDistill outperforms hierarchical baselines. M3 is able to achieve $53\%$ success on the unseen easy episodes since it utilizes strong pretrained skills. Even in this setting, AuxDistill achieves a higher success of $75\%$ . However, on the hard episodes, M3 cannot dynamically plan skills and achieves no success in the hard unseen episodes. AuxDistill achieves $41\%$ on this setting because it learns a monolithic policy with RL that combines low-level and high-level decision-making. AuxDistill even outperforms an oracle version of M3 that dynamically plans the skill sequence based on oracle state information ( $13\%$ vs. $41\%$ on the unseen hard episodes).

We observe that AuxDistill performs $7\%$ worse on the Habitat 2.0 rearrangement training episodes than the evaluation episodes. We find this performance gap is due to additional easier episodes in the evaluation distribution where the object is closer to the target location. In the hard split, there are $55$ eval episodes and $29$ train episodes where the object is picked up from the fridge and the target location is $<0.5m$ from the object location. On removing these episodes, train performance is higher than evaluation, with $32\%$ success on train and $31\%$ success on evaluation.

Method	Train (Seen)	Eval (Unseen)
AuxDistill	12 $\pm$ 2	11 $\pm$ 1
Monolithic	8 $\pm$ 0	9 $\pm$ 1
AuxDistill (No Distillation)	3 $\pm$ 1	4 $\pm$ 1
RL-Curicullum	6 $\pm$ 1	8 $\pm$ 0

Table 3: Comparison of our method on the Category Pick task with auxiliary tasks on three random seeds. AuxDistill outperforms all methods, including the monolithic baseline, on the unseen evaluation split. Evaluation is conducted over 1000 seen episodes sampled from the training distribution and 1000 held-out episodes from evaluation.

4.2 Category Pick

The merits of AuxDistill extend to other challenging embodied tasks. In particular, consider a variant of the pick task described in Section 4.1 where a robot has to pick up an object using the object name passed as a one-hot embedding to the policy. Category pick requires the policy to discern the object to pick up by correlating the RGB observation with the object category passed as a one-hot embedding. This task is more challenging than geometric pick where the policy has access to the initial coordinate specification of the object to be picked. We include additional details about the task specification in Section B.4 of our supplementary material.

We leverage the easier coordinate pick task to aid the learning of the Category Pick task using AuxDistill. More specifically, we consider the task of interest $M_{0}$ to be the Category Pick task and $M_{1}$ (auxiliary task we would like to distill from) to be the coordinate pick task. Both tasks are trained jointly using AuxDistill with the same reward structure with $w_{i}(s_{t})=1$ and use $\lambda=0.5$ for distillation in Equation 1. For this task, we train on the full rearrange-easy dataset from [50] with 50,000 episodes.

We report the training curves of the Category Pick task in comparison with the RL Curriculum and No Distillation baseline used in Table 1. In addition, we introduce a monolithic baseline for this task, which directly trains Category Pick without leveraging the coordinate pick task during training. We report the training curves in Figure 3(b). We evaluate the performance of the trained policy on $1000$ episodes sampled from the training and $1000$ episodes from the validation split.

In Table 3, we show that AuxDistill shows better performance with a success rate of $11\%$ outperforming all baselines that do not leverage distillation during RL training. The closest performing baseline to ours is the monolithic baseline, which achieves a success of $9\%$ on the held-out distribution. AuxDistill (No-Distillation) and RL-curriculum perform worse with an evaluation success of $4\%$ and $8\%$ , respectively.

4.3 AuxDistill Analysis

4.3.1 Robustness to Auxiliary Task Selection

A critical consideration for training AuxDistill is the selection of auxiliary tasks. In this section, we compare four different selections of auxiliary tasks in the object rearrangement task from Section 4.1. We conduct this ablation on a smaller distribution of episodes, which only include the Easy episodes and Hard episodes where the object starts in the fridge and train AuxDistill for 100M steps.

As we conduct this study only on the fridge category of articulated episodes, we modify the auxiliary task selection to be: Pick, Open Fridge, Place and Pick from Fridge. Note that the first three auxiliary tasks are the same from Section 4.1. The other auxiliary task selections are the same as this original selection, but excluding one of: Pick, Open Fridge, and Pick from Fridge.

In Table 2, we report the performance for each skill selection on the easy and hard episodes. We observe that among all the sub-tasks, the most important is Pick, as excluding it results in $0.0\%$ success. Note that Pick is only relevant ( $w_{i}(s_{t})>0$ ) to easy episodes during training i.e removing it should only impact the performance on easy distribution of our training setup. However, we find that it affects both the easy and hard episodes as the task is not successfully carried out in either case if removed. In Figure 2, we analyze the success of individual stages of rearrangement and notice that not including Pick leads to failure earlier on in the task.

In Table 5 we notice that introducing the Open-Fridge skill can worsen rearrangement performance ( $30\%$ vs. $48\%$ ). The reason for this is that the open-fridge skill is the easiest of all auxiliary tasks (see Section B.3), and utilizing it reduces the number of samples allocated to the main task. However, certain auxiliary tasks can boost performance by addressing specific bottlenecks in the rearrangement. One such example is the Pick from Fridge, which boosts performance $7\%$ to $19\%$ by addressing the task of picking from an open fridge, which can be challenging due to the difficulty of avoiding collisions with the fridge door. Further, the $4\%$ performance improvement from $32\%$ to $36\%$ on easy distribution can be attributed to the presence of easy episodes in our dataset with an open-refrigerator where Pick from Fridge helps boost performance.

4.3.2 Distillation Coefficient Variation

We analyze the effect of the distillation loss coefficient $\lambda$ in AuxDistill in Table 4. The distillation loss coefficient controls how the agent balances minimizing the distillation from the auxiliary tasks versus maximizing the reward on the main task. In Table 4, we notice that using high values of distillation weighting $\lambda=1.0$ shifts the objective of the rearrangement task from maximizing cumulative reward to distilling from the auxiliary task leading to $1\%$ success rate on rearrangement and similarly, using a very small value of distillation coefficient $\lambda=0.01$ leads to insufficient leveraging of auxiliary task information with a success rate of $7\%$ . We find $\lambda=0.1,0.05$ to be good choices for optimizing task reward and distilling behaviors from the auxiliary tasks.

	Train (Seen)			Eval (Unseen)
Method	All Episodes	Easy	Hard	All Episodes	Easy	Hard
$\lambda=0.01$	6	6	5	10	18	2
$\lambda=0.05$	30	31	28	36	33	38
$\lambda=0.1$	22	26	19	30	39	21
$\lambda=0.5$	2	3	0	2	5	0
$\lambda=1.0$	0	1	0	1	2	0

Table 4: Performance of our method for a single seed on varying distillation coefficient during training. Using a large distillation coefficient (

\lambda=1.0

) makes reward optimization challenging, and too small (

\lambda=0.01

) results in insufficient auxiliary distillation to succeed on the rearrangement task. The intermediate values

\lambda=0.05

and

\lambda=0.1

show the best performance on training and evaluation.

5 Conclusion

In this work, we presented AuxDistill, a new method for end-to-end RL on complex tasks by leveraging auxiliary tasks. AuxDistill learns in the auxiliary tasks concurrently with the main task through multi-task RL. A distillation loss transfers relevant behaviors from the auxiliary tasks to the main task. We show that AuxDistill outperforms a variety of baselines in Habitat object rearrangement by up to 27 $\%$ . We also show another application of AuxDistill in a category-conditioned manipulation task. Finally, we analyze AuxDistill in different auxiliary task selections and magnitudes of distillation strengths. Overall, AuxDistill presents a new way to tackle compound tasks with RL alone without requiring pre-trained skills or expert demonstrations.

Limitations of AuxDistill include the need for the auxiliary tasks and the auxiliary task relevance function. The auxiliary tasks require knowing behaviors that are relevant and easier to learn than the main task. Each auxiliary task requires defining a new start state distribution and reward function. However, designing these can rely on privileged state information which is less restrictive than the requirement of pre-trained skills or expert demonstrations.

References

Mnih et al. [2013] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
Berner et al. [2019] Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, et al. Dota 2 with large scale deep reinforcement learning. arXiv preprint arXiv:1912.06680, 2019.
Silver et al. [2017] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
Shah et al. [2016] Pararth Shah, Dilek Hakkani-Tür, and Larry Heck. Interactive reinforcement learning for task-oriented dialogue management. In NIPS 2016 Deep Learning for Action and Interaction Workshop, volume 11, 2016.
Liu et al. [2017] Bing Liu, Gokhan Tur, Dilek Hakkani-Tur, Pararth Shah, and Larry Heck. End-to-end optimization of task-oriented dialogue model with deep reinforcement learning. arXiv preprint arXiv:1711.10712, 2017.
Nakano et al. [2021] Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Akkaya et al. [2019] Ilge Akkaya, Marcin Andrychowicz, Maciek Chociej, Mateusz Litwin, Bob McGrew, Arthur Petron, Alex Paino, Matthias Plappert, Glenn Powell, Raphael Ribas, et al. Solving rubik’s cube with a robot hand. arXiv preprint arXiv:1910.07113, 2019.
Qi et al. [2023] Haozhi Qi, Ashish Kumar, Roberto Calandra, Yi Ma, and Jitendra Malik. In-hand object rotation via rapid motor adaptation. In Conference on Robot Learning, pages 1722–1732. PMLR, 2023.
Kalashnikov et al. [2018] Dmitry Kalashnikov, Alex Irpan, Peter Pastor, Julian Ibarz, Alexander Herzog, Eric Jang, Deirdre Quillen, Ethan Holly, Mrinal Kalakrishnan, Vincent Vanhoucke, and Sergey Levine. Scalable deep reinforcement learning for vision-based robotic manipulation. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, volume 87 of Proceedings of Machine Learning Research, pages 651–673. PMLR, 2018. URL http://proceedings.mlr.press/v87/kalashnikov18a.html.
Batra et al. [2020] Dhruv Batra, Angel X Chang, Sonia Chernova, Andrew J Davison, Jia Deng, Vladlen Koltun, Sergey Levine, Jitendra Malik, Igor Mordatch, Roozbeh Mottaghi, et al. Rearrangement: A challenge for embodied ai. arXiv preprint arXiv:2011.01975, 2020.
Harutyunyan et al. [2019] Anna Harutyunyan, Will Dabney, Thomas Mesnard, Mohammad Gheshlaghi Azar, Bilal Piot, Nicolas Heess, Hado P van Hasselt, Gregory Wayne, Satinder Singh, Doina Precup, et al. Hindsight credit assignment. Advances in neural information processing systems, 32, 2019.
Ni et al. [2024] Tianwei Ni, Michel Ma, Benjamin Eysenbach, and Pierre-Luc Bacon. When do transformers shine in rl? decoupling memory from credit assignment. Advances in Neural Information Processing Systems, 36, 2024.
Berges et al. [2023] Vincent-Pierre Berges, Andrew Szot, Devendra Singh Chaplot, Aaron Gokaslan, Roozbeh Mottaghi, Dhruv Batra, and Eric Undersander. Galactic: Scaling end-to-end reinforcement learning for rearrangement at 100k steps-per-second. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13767–13777, 2023.
Huang et al. [2023] Xiaoyu Huang, Dhruv Batra, Akshara Rai, and Andrew Szot. Skill transformer: A monolithic policy for mobile manipulation. arXiv preprint arXiv:2308.09873, 2023.
Gu et al. [2022] Jiayuan Gu, Devendra Singh Chaplot, Hao Su, and Jitendra Malik. Multi-skill mobile manipulation for object rearrangement. arXiv preprint arXiv:2209.02778, 2022.
Narvekar et al. [2020] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E Taylor, and Peter Stone. Curriculum learning for reinforcement learning domains: A framework and survey. The Journal of Machine Learning Research, 21(1):7382–7431, 2020.
Dennis et al. [2020] Michael Dennis, Natasha Jaques, Eugene Vinitsky, Alexandre Bayen, Stuart Russell, Andrew Critch, and Sergey Levine. Emergent complexity and zero-shot transfer via unsupervised environment design. Advances in neural information processing systems, 33:13049–13061, 2020.
Azad et al. [2023] Abdus Salam Azad, Izzeddin Gur, Jasper Emhoff, Nathaniel Alexis, Aleksandra Faust, Pieter Abbeel, and Ion Stoica. Clutr: Curriculum learning via unsupervised task representation learning. In International Conference on Machine Learning, pages 1361–1395. PMLR, 2023.
Fang et al. [2022] Kuan Fang, Toki Migimatsu, Ajay Mandlekar, Li Fei-Fei, and Jeannette Bohg. Active task randomization: Learning visuomotor skills for sequential manipulation by proposing feasible and novel tasks. arXiv preprint arXiv:2211.06134, 2022.
Szot et al. [2021] Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat. Advances in Neural Information Processing Systems, 34, 2021.
Sutton et al. [1999] Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
Bacon et al. [2017] Pierre-Luc Bacon, Jean Harb, and Doina Precup. The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
Zhang and Whiteson [2019] Shangtong Zhang and Shimon Whiteson. Dac: The double actor-critic architecture for learning options. Advances in Neural Information Processing Systems, 32, 2019.
Xia et al. [2020] Fei Xia, Chengshu Li, Roberto Martín-Martín, Or Litany, Alexander Toshev, and Silvio Savarese. Relmogen: Leveraging motion generation in reinforcement learning for mobile manipulation. arXiv preprint arXiv:2008.07792, 2020.
Karkus et al. [2020] Peter Karkus, Mehdi Mirza, Arthur Guez, Andrew Jaegle, Timothy Lillicrap, Lars Buesing, Nicolas Heess, and Theophane Weber. Beyond tabula-rasa: a modular reinforcement learning approach for physically embedded 3d sokoban. arXiv preprint arXiv:2010.01298, 2020.
Dalal et al. [2021] Murtaza Dalal, Deepak Pathak, and Russ R Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. Advances in Neural Information Processing Systems, 34:21847–21859, 2021.
Hafner et al. [2022] Danijar Hafner, Kuang-Huei Lee, Ian Fischer, and Pieter Abbeel. Deep hierarchical planning from pixels. Advances in Neural Information Processing Systems, 35:26091–26104, 2022.
Vezzani et al. [2022] Giulia Vezzani, Dhruva Tirumala, Markus Wulfmeier, Dushyant Rao, Abbas Abdolmaleki, Ben Moran, Tuomas Haarnoja, Jan Humplik, Roland Hafner, Michael Neunert, et al. Skills: Adaptive skill sequencing for efficient temporally-extended exploration. arXiv preprint arXiv:2211.13743, 2022.
Chen et al. [2023] Yuanpei Chen, Chen Wang, Li Fei-Fei, and C Karen Liu. Sequential dexterity: Chaining dexterous policies for long-horizon manipulation. arXiv preprint arXiv:2309.00987, 2023.
Mishra et al. [2023] Utkarsh Aashu Mishra, Shangjie Xue, Yongxin Chen, and Danfei Xu. Generative skill chaining: Long-horizon skill planning with diffusion models. In Conference on Robot Learning, pages 2905–2925. PMLR, 2023.
Lee et al. [2021] Youngwoon Lee, Joseph J Lim, Anima Anandkumar, and Yuke Zhu. Adversarial skill chaining for long-horizon robot manipulation via terminal state regularization. arXiv preprint arXiv:2111.07999, 2021.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Rudin et al. [2022] Nikita Rudin, David Hoeller, Philipp Reist, and Marco Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. In Conference on Robot Learning, pages 91–100. PMLR, 2022.
Agarwal et al. [2023] Ananye Agarwal, Ashish Kumar, Jitendra Malik, and Deepak Pathak. Legged locomotion in challenging terrains using egocentric vision. In Conference on Robot Learning, pages 403–415. PMLR, 2023.
Fu et al. [2023] Zipeng Fu, Xuxin Cheng, and Deepak Pathak. Deep whole-body control: learning a unified policy for manipulation and locomotion. In Conference on Robot Learning, pages 138–149. PMLR, 2023.
Kumar et al. [2021] Ashish Kumar, Zipeng Fu, Deepak Pathak, and Jitendra Malik. Rma: Rapid motor adaptation for legged robots. RSS, 2021.
Radosavovic et al. [2023] Ilija Radosavovic, Tete Xiao, Bike Zhang, Trevor Darrell, Jitendra Malik, and Koushil Sreenath. Learning humanoid locomotion with transformers. arXiv preprint arXiv:2303.03381, 2023.
Katara et al. [2023] Pushkal Katara, Zhou Xian, and Katerina Fragkiadaki. Gen2sim: Scaling up robot learning in simulation with generative models. arXiv preprint arXiv:2310.18308, 2023.
Ye et al. [2021a] Joel Ye, Dhruv Batra, Erik Wijmans, and Abhishek Das. Auxiliary tasks speed up learning point goal navigation. In Conference on Robot Learning, pages 498–516. PMLR, 2021a.
Ye et al. [2021b] Joel Ye, Dhruv Batra, Abhishek Das, and Erik Wijmans. Auxiliary tasks and exploration enable objectgoal navigation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16117–16126, 2021b.
Gregor et al. [2019] Karol Gregor, Danilo Jimenez Rezende, Frederic Besse, Yan Wu, Hamza Merzic, and Aaron van den Oord. Sha** belief states with generative environment models for rl. Advances in Neural Information Processing Systems, 32, 2019.
Kuo et al. [2023] Chia-Wen Kuo, Chih-Yao Ma, Judy Hoffman, and Zsolt Kira. Structure-encoding auxiliary tasks for improved visual representation in vision-and-language navigation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1104–1113, 2023.
Jia et al. [2022] Zhiwei Jia, Xuanlin Li, Zhan Ling, Shuang Liu, Yiran Wu, and Hao Su. Improving policy optimization with generalist-specialist learning. In International Conference on Machine Learning, pages 10104–10119. PMLR, 2022.
Baker et al. [2019] Bowen Baker, Ingmar Kanitscheider, Todor Markov, Yi Wu, Glenn Powell, Bob McGrew, and Igor Mordatch. Emergent tool use from multi-agent autocurricula. arXiv preprint arXiv:1909.07528, 2019.
He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CVPR, 2016.
Hessel et al. [2019] Matteo Hessel, Hubert Soyer, Lasse Espeholt, Wojciech Czarnecki, Simon Schmitt, and Hado Van Hasselt. Multi-task deep reinforcement learning with popart. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 3796–3803, 2019.
robotics [2020] Fetch robotics. Fetch. http://fetchrobotics.com/, 2020.
Szot et al. [2022] Andrew Szot, Karmesh Yadav, Alex Clegg, Vincent-Pierre Berges, Aaron Gokaslan, Angel Chang, Manolis Savva, Zsolt Kira, and Dhruv Batra. Habitat rearrangement challenge 2022. https://aihabitat.org/challenge/2022_rearrange, 2022.
Fikes and Nilsson [1971] Richard E Fikes and Nils J Nilsson. Strips: A new approach to the application of theorem proving to problem solving. Artificial intelligence, 2(3-4):189–208, 1971.
Wijmans et al. [2019] Erik Wijmans, Abhishek Kadian, Ari Morcos, Stefan Lee, Irfan Essa, Devi Parikh, Manolis Savva, and Dhruv Batra. Dd-ppo: Learning near-perfect pointgoal navigators from 2.5 billion frames. arXiv preprint arXiv:1911.00357, 2019.

Appendix A Additional Method Details

A.1 Pseudocode

1 Initialize policy

\pi_{\theta}

2 Initialize state-space, observation space and action space

\{\mathcal{S},\mathcal{O},\mathcal{A}\}

3 Define relevance

w_{i}:\mathcal{S}\rightarrow\{0,1\}\hskip 2.84526pt\forall\hskip 2.84526pti\in% \{1,2,\cdots N\}

4 Initialize distillation coefficient, normalization parameters

\lambda,\beta=3e^{-4}

5 for $epoch\leftarrow 1$ to train-epochs do

6 // bold face represents a vector.

7 Collect a batch of

B

samples

\{\mathbf{r}_{i}\,\mathbf{o}_{i},\mathbf{s}_{i}.\mathbf{a}_{i}\}_{i=0}^{N}

by executing

\{\pi_{\theta}^{T_{i}}\}_{i=0}^{N}

8 Normalize returns per task:

\{\mathbf{r}^{n}_{i}\}_{i=0}^{N}=PopArt(\{\mathbf{r}_{i}\}_{i=0}^{N},\beta)

9 Obtain

\{\pi_{\theta}^{T_{i}}(\mathbf{o}_{0})\}_{i=1}^{N}

by evaluating

\mathbf{o}_{0}

with

\{T_{i}\}_{i=1}^{N}

10 // Compute losses

11 Compute RL losses:

L_{RL}

using

\{\mathbf{r}^{n}_{i},\mathbf{o}_{i},\mathbf{s}_{i},\mathbf{a}_{i}\}_{i=0}^{N}

12 Compute distill loss

L_{d}

\frac{1}{N}\sum_{i=1}^{N}\sum_{t=0}^{B}w_{i}({s}^{t}_{0})D_{\text{KL}}(\pi_{% \theta}^{T_{0}}({o}^{t}_{0})\;\|\;\pi_{\theta}^{T_{i}}({o}^{t}_{0}))

13 Update

\pi_{\theta}

using

L_{distill}=L_{RL}

\lambda L_{d}

14 end for

Algorithm 1 The workflow for training AuxDistill from a randomly initialized policy

In this section, we describe the workflow to implement AuxDistill, including data collection for all of the tasks in our training mix, computing the distillation loss, and policy optimization. We describe our algorithm in Algorithm 1

We initialize a policy $\pi_{\theta}$ along with the state space $\mathcal{S}$ , observation space $\mathcal{O}$ and action space $\mathcal{A}$ . The relevance function $\{w_{i}(\mathcal{S})\}_{i=1}^{N}\rightarrow\{0,1\}$ is defined as a map** between the robot state to a real-valued relevance used for distillation. The distillation weights are assigned to $\lambda=0.1$ and $\beta=3e^{-4}$ . For each epoch of our training cycle, we collect a batch of data by executing our policy on each of the tasks and collect a batch of returns $\{\mathbf{r}_{i}\}_{i=1}^{N}$ , observations $\{\mathbf{o}_{i}\}_{i=1}^{N}$ , states $\{\mathbf{s}_{i}\}_{i=1}^{N}$ and actions $\{\mathbf{a}_{i}\}_{i=1}^{N}$ . The returns are then normalized using PopArt with parameter $\beta$ as $\{\mathbf{r}^{n}_{i}\}_{i=1}^{N}$ . The observations of the main task are evaluated under each of the tasks $T_{i}$ to obtain $\{\pi^{T_{i}}_{\theta}(\mathbf{o}_{0})\}_{i=1}^{N}$ . Note that here $\mathbf{s}_{i}$ represents the complete state information at a given time step, whereas $\mathbf{o}_{i}$ represents the visual observation available to the agent.

As described in Section 3, our policy optimizes a weighted combination of two objectives. The RL-loss required for a regular PPO update given by $L_{RL}$ losses computed using normalized returns, observations, states and actions $\{\mathbf{r}_{i}^{n},\mathbf{o}_{i}^{n},\mathbf{s}_{i}^{n},\mathbf{a}_{i}^{n}\}$ . The distillation loss is computed for each time step, i.e. $\{t=1,2,\cdots B\}$ as the KL-divergence between $D_{\text{KL}}(\pi_{\theta}^{T_{0}}({o}^{t}_{0})\;\|\;\pi_{\theta}^{T_{i}}({o}^% {t}_{0}))$ weighted by the task relevance ${w}_{i}(s_{0}^{t})$ . The total loss for policy optimization is computed using the weighting parameter $\lambda$ as $L_{\text{distill}}=L_{RL}+\lambda L_{d}$ .

A.2 Additional Implementation Details

In this section, we further detail network architecture and implementation details used in the training of AuxDistill. The agent captures visual observations using a $256\times 256$ depth sensor, which is encoded by a ResNet50 [47] architecture. The visual features are concatenated with the proprioceptive and goal sensor observations and passed onto the LSTM backbone network. The LSTM architecture has $2$ hidden layers with $128$ hidden units per layer, which generates a state representation of the environment. The LSTM output features are regressed to a multi-task value-head using a 2-layer critic network after concatenating with the task indicator $T$ . The features generated by the output layer of the LSTM are used to regress to $\{\mu,\sigma\}\in\mathbb{R}^{10}$ . The actions are then sampled from a Gaussian distribution defined by $\{\mu,\sigma\}$ , i.e., $a_{t}\sim\mathcal{N}(\mu,\sigma)$ . Overall, our network has $\approx 13M$ trainable parameters.

The policy is updated using DDPPO [52], which collects data from $24$ environment workers parallelized across $8$ GPUs across $6$ tasks (5 auxiliary tasks and the main task). The policy is updated after collecting $128$ steps of experience in each worker with a pre-emption threshold of $0.25$ . The policy is trained for $450M$ steps collected across all auxiliary tasks. The starting learning rate of training is $3e^{-4}$ with linear learning rate decay to $0$ over $500M$ steps. Before each policy update, we normalize the returns per-task using Pop-Art with $\beta=3.0e^{-4}$ (see [48] for more details).

We utilize PPO with a value loss coefficient of $0.5$ and an entropy coefficient of $0.001$ to incentivize exploration. Considering the longer horizon of the rearrangement task, we set the discount rate $\gamma=0.999$ . The distillation loss is computed by evaluating the observations of the main task under all task indicators $\{T^{i}\}_{i=1}^{N}$ and computing the weighted KL-divergence using the auxiliary task relevance function. In our experiments, we determine the relevance function using the episode’s current robot state and oracle task plan. (see Table 5 for more details). This is used in conjunction with the RL-losses with a weighting factor of $\lambda=0.1$ .

Appendix B Additional Experiment Details

B.1 Additional Task Details

We train our method on a setup similar to the Habitat-Rearrangement challenge [50], which involves a robot spawned in a home indoor environment with the goal of manipulating objects without having access to privileged maps or other oracle information and solely operating using ego-centric perception. Each episode is specified by the 3D coordinate location of the object and goal at the beginning of the episode. As this dataset is imbalanced across easy and hard episodes, we sub-sample $11,791$ from the entire dataset of $50,000$ episodes in the rearrange dataset. The episodes are obtained by selecting an equal percentage of episodes with the object spawned inside a closed cabinet, closed fridge, or open receptacles. To carry out the task, we use a Fetch robot with a mobile base and a 7-DOF arm with a gripper. We provide the following inputs to the policy:

•

Depth Camera: a $256\times 256$ depth camera attached to the head of the robot.
•

Coordinate Sensors: The euclidean distance between the 3D coordinate of the robot end-effector and the object to be picked up and the location where the object is to be placed.
•

Holding sensor: indicating whether the robot is gras** any object.
•

Relative Resting Position Sensor: highlighting the Euclidean distance of the 3D coordinate position of the end-effector to the resting position.
•

Joint sensor: indicating the 7-DOF joint position of the robot arm.
•

Task Indicator: encoded as a one-hot vector indicating which of the tasks is currently being executed.

The agent can interact with the environment for a maximum of $1500$ steps. Further, we enforce a step threshold for each task stage based on the auxiliary tasks defined in Section B.1.1. During training, we utilize an instantaneous force threshold of $30$ kN, but do not apply the force threshold during evaluation as consistent with [16].

B.1.1 Auxiliary Task Definitions

In this section, we describe the auxiliary task definitions for each of the tasks utilized in our training and ablation experiments.

Pick: This task involves the agent being spawned randomly in the house at least $3m$ from the object of interest without the object in hand. The task is considered successful if the agent is successfully able to navigate to and pick up the object by calling a grip action when it is $0.15m$ from the object of interest and rearrange its arm to $0.15m$ of resting position. The horizon length of this task is $700$ steps. The reward function for this task is represented as:

\displaystyle R(s_{t})=10\mathbb{I}_{success}+2\Delta^{o}_{arm}\mathbb{I}_{!% holding}+2\Delta^{r}_{arm}\mathbb{I}_{holding}+2\mathbb{I}_{pick}

Here, $\mathbb{I}_{pick}$ represents the condition of the pick skill successfully picking up the object, and $\mathbb{I}_{success}$ represents the agent being able to pick up the object successfully and rearrange its arm to the resting position. $2\Delta^{r}_{arm}\mathbb{I}_{holding}$ and $\Delta^{o}_{arm}$ represents the Euclidean distance of the robot of the arm to the object and $\Delta^{r}_{arm}$ represents the deviation from the resting position.

Place: This task involves the agent being spawned randomly in the house at least $3m$ from the object without the object in hand. The agent has to navigate to the target receptacle, place the object within $0.15m$ of the goal location, and rearrange its arm to its resting position. The horizon length of this task is $700$ steps.

\displaystyle R(s_{t})=10\mathbb{I}_{success}+2\Delta^{t}_{arm}\mathbb{I}_{% holding}+2\Delta^{r}_{arm}\mathbb{I}_{!holding}+5\mathbb{I}_{place}

Here, $\mathbb{I}_{success}$ , $\mathbb{I}_{place}$ represent a sparse reward for successful task completion and placing the object, respectively. $\Delta^{t}_{arm}$ arm represents the per-time step deviation of the robot arm to the target location when the agent is holding the object and $\Delta^{r}_{arm}$ represents the deviation of the robot arm towards resting position after the object has been placed successfully.

Open-Cabinet: This skill involves the robot being spawned randomly in the house, with the task of navigating to the cabinet and opening the drawer by calling the grasp action within $0.15m$ of the drawer handle marker. The drawer is then opened to a joint position of $0.45$ . Further, the agent must successfully rearrange its arm to its resting position. The task horizon length for this task is $600$ steps. The reward structure for this task is given by,

\displaystyle R(s_{t})=10\mathbb{I}_{success}+\Delta^{m}_{arm}\mathbb{I}_{!% open}+10\Delta^{r}_{arm}\mathbb{I}_{open}+5\mathbb{I}_{open}+5\mathbb{I}_{grasp}

Here, $\mathbb{I}_{success}$ is an indicator for successful opening followed by arm rearrangement, $\mathbb{I}_{open}$ is the indicator for the drawer being successfully opened, and $\mathbb{I}_{grasp}$ is an indicator for the drawer handle being successfully grasped. $\Delta^{m}_{arm},\Delta^{r}_{arm}$ are used to encode the dense time-step reward based on the change in arm position to the target marker location and the resting position, respectively.

Open Fridge: This task involves the robot being spawned randomly in the house, with the task of navigating to the fridge in the scene successfully gras** the fridge handle marker by calling the grasp action at $0.15m$ from the fridge door handle. The fridge door must be opened to a joint position of $1.22$ , and the arm must be rearranged to its resting position. The task horizon length for this task $600$ steps. The per-time-step reward for this is modeled as:

\displaystyle R(s_{t})=10\mathbb{I}_{success}+\Delta^{m}_{arm}\mathbb{I}_{!% open}+\Delta^{r}_{arm}\mathbb{I}_{open}+5\mathbb{I}_{open}+5\mathbb{I}_{grasp}

Here, $\mathbb{I}_{success}$ , $\mathbb{I}_{open}$ and $\mathbb{I}_{grasp}$ are similar to the ones defined for the Open-Cabinet skill.

Pick from Fridge: This task is similar in structure to the Pick skill except that the data distribution involves picking up an object has to be picked up from an open refrigerator with the agent being spawned $<2m$ from the target object.

Which of these auxiliary tasks are utilized for distillation is determined by whether the agent’s current state is relevant to the state of the agent in the rearrangement task. We show a table of each of the relevance of each task in Table 5.

	Pre-Conditions for Distillation
Auxillary Task	Object Receptacle	Did Pick Object?	Is Receptacle Open?
Pick	Open	$\times$	✓
Place	Open,Fridge,Cabinet	✓	✓, $\times$
Open-Fridge	Fridge	$\times$	$\times$
Open-Cabinet	Cabinet	$\times$	$\times$
Pick from Fridge	Fridge	$\times$	✓

Table 5: A table representing the relevance of each of the auxiliary tasks based on the stage of the task the robot is in. Did Pick Object? represents the success of picking up the correct object and Is Receptacle Open? represents whether the robot has successfully opened the receptacle once during the episode. The Object-Receptacle encodes oracle information about the category of episodes we’re operating on. The open fridge and open cabinet tasks represent cases when the object is in an open receptacle.

B.2 Additional Baseline Details

B.2.1 M3 $\&$ M3-Oracle

This baseline sequences mobile manipulation policies, which include Navigation, Pick, Place, Open-Fridge, and Open-Cabinet from [17] each of which are trained for 100M steps using PPO [34]. M3 (Oracle) is an oracle version of M3 that has access to privileged information about the oracle sequence of skills based on whether the object begins inside a closed or open receptacle at the start of the episode. Each policy is similar to the ones reported in [16] with training for $100M$ steps.

B.2.2 RL-Curriculum

This baseline captures a 2-stage variant of our method involving the training of all the auxiliary tasks $\{\mathcal{M}_{i}\}_{i=1}^{N}$ at once followed by learning on the main task $\mathcal{M}_{0}$ . The intuition behind this strategy is to learn a policy capable of performing simpler auxiliary tasks, which can be leveraged to learn the challenging main task of interest in the second stage. We allocate a budget of $500M$ steps of agent experience, which is divided into two stages - i) $200M$ steps across all auxiliary tasks followed by ii) $300M$ steps to learn the main task. In the first stage, we encode each of the tasks using a separate identifier $T$ . During the second stage, we pass in a task identifier capturing the main rearrangement task unseen during the first stage. We begin each stage with a learning rate of $3e^{-4}$ with linear learning rate decay over $300M$ steps.

B.3 Auxillary Task Learning

In Figure 4, we show the results of auxiliary task learning for the main results reported in Table 1. We report up to $250M$ steps of learning to demonstrate the results of all baselines, including stage 1 of the curriculum baseline.

Among all the skills, the open and close cabinet skills are the easiest to learn and show the lowest variance during training. The order of difficulty for the other skills is followed by Pick < Pick from Fridge <Place. The difficulty arising in the Placing skill is due to the distinct starting state distribution with the robot being spawned with the object in hand, which none of the other skills have.

B.4 Category Pick

As described in Section 4.2, the Category Pick task demonstrates the merits of AuxDistill on a challenging observation space of the object category as opposed to its 3D coordinate location. We encode the object category as a one-hot sensor across a total of $20$ objects. Further, the agent receives an RGB-sensor observation as opposed to the depth sensor used in our rearrangement experiments. Further, the robot is spawned $<2m$ to the target receptacle. All other RL-training parameters are similar to coordinate pick, including the reward structure, episode horizon length, and success condition, which are similar to the ones described in B.1.1.

We train Category Pick with $16$ environment workers with $8$ environments for each coordinate pick and category-pick parallelized across 8 GPUs. We train for $140M$ steps collected across both language-pick and coordinate pick with a starting learning rate of $2e^{-4}$ with a linear learning rate decay across $200M$ steps of training. We compare this with the monolithic baseline implemented with all $16$ environments devoted to Category Pick. For training, we utilize the standard rearrange-easy dataset with $50,000$ episodes and evaluate $1,000$ episodes on the standard validation split.

Appendix C Emergent Curriculum During Training

In this section, we discuss how training AuxDistill results in a curriculum. Note that our AuxDistill does not enforce this curriculum explicitly; this arises as a consequence of the multi-task RL training regime that optimizes to maximize cumulative return.

In Figure 5, we show two such curricula; Figure 5(a), which shows that the easy distribution is learned first during training, followed by the hard distribution. Recall that in Table 2 we show that rearrangement fails to learn on hard episodes without including the easier Pick auxiliary task during training. Building on this observation, in Figure 5(b), we show another curriculum between the auxiliary task and the easy main task. The improvement of tasks is in the order Pick < Place < Rearrange-Easy. Note that this trend, however, does not hold throughout training; the success rate for the place skill saturates sooner at about $300M$ steps. This difference could be due to the stricter success conditions requiring arm rearrangement for places that are not required by the main easy task.