\mdfsetup

innertopmargin=1pt

\method: Efficient Policy Learning by Extracting Transferrable Robot Skills from Offline Data

Jesse Zhang1, Minho Heo2, Zuxin Liu3,
Erdem Bıyık1, Joseph J. Lim2, Yao Liu4, Rasool Fakoor4
1University of Southern California, 2KAIST, 3CMU, 4Amazon Web Services
[email protected]
Abstract

Most reinforcement learning (RL) methods focus on learning optimal policies over low-level action spaces. While these methods can perform well in their training environments, they lack the flexibility to transfer to new tasks. Instead, RL agents that can act over useful, temporally extended skills rather than low-level actions can learn new tasks more easily. Prior work in skill-based RL either requires expert supervision to define useful skills, which is hard to scale, or learns a skill-space from offline data with heuristics that limit the adaptability of the skills, making them difficult to transfer during downstream RL. Our approach, \method, instead utilizes pre-trained vision language models to extract a discrete set of semantically meaningful skills from offline data, each of which is parameterized by continuous arguments, without human supervision. This skill parameterization allows robots to learn new tasks by only needing to learn when to select a specific skill and how to modify its arguments for the specific task. We demonstrate through experiments in sparse-reward, image-based, robot manipulation environments that \method can more quickly learn new tasks than prior works, with major gains in sample efficiency and performance over prior skill-based RL. Website at https://jessezhang.net/projects/extract.

1 Introduction

Imagine learning to play racquetball as a complete novice. Without prior experience in racket sports, this poses a daunting task that requires learning not only the (1) complex, high-level strategies to control when to serve, smash, and return the ball but also (2) how to actualize these moves in terms of fine-grained motor control. However, a squash player should have a considerably easier time adjusting to racquetball as they already know how to serve, take shots, and return; they simply need to learn when to use these skills and how to adjust them for larger racquetball balls. Our paper aims to make use of this intuition to enable efficient learning of new robotics tasks.

In general, humans can learn new tasks quickly—given prior experience—by adjusting existing skills for the new task [1, 2]. Skill-based reinforcement learning (RL) aims to emulate this transfer [3, 4, 5, 6, 7, 8, 9, 10, 11, 12] in learned agents by equip** them with a wide range of skills (i.e., temporally-extended action sequences) that they can call upon for efficient downstream learning. Transferring to new tasks in standard RL, based on low-level environment actions, is challenging because the learned policy becomes more task-specific as it learns to solve its training tasks [13, 14, 15, 16, 17]. In contrast, skill-based RL leverages temporally extended skills that can be both transferred across tasks and yield more informed exploration [6, 18, 19], thereby leading to more effective transfer and learning. However, existing skill-based RL approaches rely on costly human supervision [20, 21, 9, 10] or restrictive skill definitions [22, 6, 8] that limit the expressiveness and adaptability of the skills. Therefore, we ask: how can robots discover adaptable skills for efficient transfer learning without costly human supervision?

Refer to caption
Figure 1: \method unsupervisedly extracts a discrete set of skills from offline data that can be used for efficient learning of new tasks. (1) \method first uses VLMs to extract a discrete set of aligned skills from image-action data. (2) \method then trains a skill decoder to output low-level actions given discrete skill IDs and learned continuous arguments. (3) This decoder helps a skill-based policy efficiently learn new tasks with a simplified action space over skill IDs and arguments.

Calling back to the squash to racquetball transfer example, we humans categorize different racket movements into discrete skills—for example, a “forehand swing” is distinct from a “backhand return.” These discrete skills can be directly transferred by making minor modifications for racquetball’s larger balls and different rackets. This process is akin to that of calling a programmatic API, e.g., def forehand(x, y), where learning to transfer reduces to learning when to call discrete functions (e.g., forehand() vs backhand()) and how to execute them (i.e., what their arguments should be). In this paper, we propose a method to accelerate transfer learning by enabling robots to learn, without expert supervision, a discrete set of skills parameterized by input arguments that are useful for downstream tasks (see Figure 1). We assume access to an offline dataset of image-action pairs of trajectories from tasks that are different from the downstream target tasks. Our key insight is aligning skills by extracting high-level behaviors, i.e., discrete skills like “forehand swing,” from images in the dataset. However, two challenges preclude realizing this insight: (1) how to extract these input-parameterized skills, and (2) how to guide online learning of new tasks with these skills.

To this end, we propose \method (Extraction of Transferrable Robot Action Skills), a framework for extracting discrete, parameterized skills from offline data to guide online learning of new tasks. We first use pre-trained vision-language models (VLMs), trained to align images with language descriptions [23] so that images of similar high-level behaviors are embedded to similar latent embeddings [24], to extract—from our offline data—image embedding differences representing changes in high-level behaviors. Next, we cluster the embeddings in an unsupervised manner to form discrete skill clusters that represent high-level skills. To parameterize these skills, we train a skill decoder on these clusters, conditioned on the skill ID (e.g., representing a “backhand return”) and a learned argument (e.g., indicating velocity), to produce a skill consisting of a temporally extended, variable-length action sequence. Finally, to train a robot for new tasks, we train a skill-based RL policy to act over this skill-space while being guided by skill prior networks, learned from our offline skill data, guiding the policy for (1) when to select skills and (2) what their arguments should be.

In summary, \method enables sample-efficient transfer learning for robotic tasks by extracting a meaningful set of skills from offline data for an agent to use for learning new tasks. We first validate that \method learns a well-clustered set of skills. We then perform experiments across challenging, long-horizon, sparse-reward, image-based robotic manipulation tasks, demonstrating that \method agents can more quickly transfer skills to new tasks than prior work.

2 Related Work

Defining Skills Manually. Many works require manual definition of skills, e.g., as pre-defined primitives [4, 25, 26], subskill policies [27, 20, 28], or task sketches [29, 21], making them challenging to scale to arbitrary environments. Closest to ours,  Dalal et al. [9] and Nasiriany et al. [10] hand-define a set of skills parameterized by continuous arguments. But this hand-definition requires expensive human supervision and task-specific, environment-specific, or robot-specific fine-tuning. In contrast, \method automatically learns skills from offline data, which is much more scalable to enable learning multiple downstream tasks. We demonstrate in Section 5 that, given sufficient data coverage, skills extracted from data can transfer as effectively as hand-defined skills.

Unsupervised Skill Learning. A large body of prior work discovers skills in an unsupervised manner to accelerate learning new tasks. Some approaches use heuristics to extract skills from offline data, like defining skills as randomly sampled trajectories [30, 31, 32, 33, 34, 6, 8, 7, 35]. While these approaches have demonstrated that randomly sampled skill sequences can accelerate downstream learning, \method instead uses visual embeddings from VLMs to combine sequences performing similar behaviors into the same skill while allowing for intra-skill variation through their arguments. We show in Section 5 that our skill parameterization allows for more efficient online learning than randomly assigned skills. Moreover, Wan et al. [36] also learns skills via clustering visual features; however, in addition to major differences in methodology, they focus on imitation learning—requiring significant algorithmic changes to facilitate learning new tasks online [37, 38, 39]. Instead, we directly focus on online reinforcement learning of new tasks.

Another line of work aims to discover skills for tasks without offline data. Some learn skills while simultaneously attempting to solve the task [3, 40, 5, 41, 19, 11]. However, learning the skills and using them simultaneously is challenging, especially without dense reward supervision. Finally, some prior works construct unsupervised objectives, typically based on entropy maximization, to learn task-agnostic behaviors [42, 43, 44, 45, 46]. However, these entropy maximization objectives lead to learning a large set of skills, most of which form random behaviors unsuitable for any meaningful downstream task. Thus, using them to learn long-horizon, sparse-reward tasks is difficult. We focus on first extracting skills from demonstration data, assumed to have meaningful behaviors to learn from, for online learning of unseen, sparse-reward tasks.

3 Preliminaries

Problem Formulation. We assume access to an offline dataset of trajectories 𝒟={τ1,τ2,}𝒟subscript𝜏1subscript𝜏2\mathcal{D}=\{\tau_{1},\tau_{2},...\}caligraphic_D = { italic_τ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_τ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … } where each trajectory consists of ordered image observation and action tuples, τi=[(s1,a1),(s2,a2),]subscript𝜏𝑖subscript𝑠1subscript𝑎1subscript𝑠2subscript𝑎2\tau_{i}=\left[(s_{1},a_{1}),(s_{2},a_{2}),...\right]italic_τ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … ]. The downstream transfer learning problem is formulated as a Markov Decision Process (MDP) in which we want to learn a policy π𝜋\piitalic_π to maximize downstream rewards. Note the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D does not contain trajectories from downstream task(s), although we assume that the state space 𝒮𝒮\mathcal{S}caligraphic_S is shared and that actions in 𝒟𝒟\mathcal{D}caligraphic_D can be used to solve downstream tasks.

SPiRL. In order to extract skills from offline data and use these skills for a new policy, we build on top of a previous skill-based RL method, namely SPiRL [6]. SPiRL focused on learning skills defined by randomly sampled, fixed-length action sequences. We briefly summarize SPiRL here: Given H𝐻Hitalic_H-length sequences of consecutive actions from 𝒟𝒟\mathcal{D}caligraphic_D: a¯=a1,,aH¯𝑎subscript𝑎1subscript𝑎𝐻\bar{a}=a_{1},...,a_{H}over¯ start_ARG italic_a end_ARG = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT, SPiRL learns (1) a generative skill decoder model, pa(a¯z)subscript𝑝𝑎conditional¯𝑎𝑧p_{a}(\bar{a}\mid z)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z ), which decodes learned, latent skills z𝑧zitalic_z encoded by a skill encoder q(za¯)𝑞conditional𝑧¯𝑎q(z\mid\bar{a})italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG ) into environment action sequences a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG, and (2) a state-conditioned skill prior pz(zs)subscript𝑝𝑧conditional𝑧𝑠p_{z}(z\mid s)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s ) that predicts which latent skills z𝑧zitalic_z are likely to be useful at state s𝑠sitalic_s. To learn a new task, SPiRL trains a skill-based policy π(zs)𝜋conditional𝑧𝑠\pi(z\mid s)italic_π ( italic_z ∣ italic_s ), whose outputs z𝑧zitalic_z are skills decoded by pa(a¯z)subscript𝑝𝑎conditional¯𝑎𝑧p_{a}(\bar{a}\mid z)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z ) into low-level environment actions. The objective of policy learning is to maximize returns under π(zs)𝜋conditional𝑧𝑠\pi(z\mid s)italic_π ( italic_z ∣ italic_s ) with a KL divergence constraint to regularize π𝜋\piitalic_π against the prior pz(zs)subscript𝑝𝑧conditional𝑧𝑠p_{z}(z\mid s)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s ).

Refer to caption
Figure 2: \method consists of three phases. (1) Skill Extraction: We extract a discrete set of skills from offline data by clustering together visual VLM difference embeddings representing high-level behaviors. (2) Skill Learning: We train a skill decoder model, pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ), to output variable-length action sequences conditioned on a skill ID d𝑑ditalic_d and a learned continuous argument z𝑧zitalic_z. The argument z𝑧zitalic_z is learned by training pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) with a VAE reconstruction objective from action sequences encoded by a skill encoder, q(za¯,d)𝑞conditional𝑧¯𝑎𝑑q(z\mid\bar{a},d)italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ). We additionally train a skill selection prior and skill argument prior pd(ds)subscript𝑝𝑑conditional𝑑𝑠p_{d}(d\mid s)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ), pz(zs,d)subscript𝑝𝑧conditional𝑧𝑠𝑑p_{z}(z\mid s,d)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) to predict which skills d𝑑ditalic_d and their arguments z𝑧zitalic_z are useful for a given state s𝑠sitalic_s. Colorful arrows indicate gradients from reconstruction, argument prior, selection prior, and VAE losses. (3) Online RL: To learn a new task, we train a skill selection and skill argument policy with RL while regularizing them with the skill selection and skill argument priors.

4 Method

\method

aims to discover a discrete skill library from an offline dataset that can be modulated through input arguments for learning new tasks efficiently. \method operates in three stages: (1) an offline skill extraction stage, (2) an offline skill learning phase in which we train a decoder model to reproduce action sequences given a skill choice and its arguments, and finally (3) the online RL stage for training an agent to utilize these skills for new tasks. See Figure 2 for a detailed overview.

4.1 Offline Skill Extraction

Feature extraction. We leverage vision-language models (VLMs), trained to align large corpora of images with natural language descriptions [23, 47, 48, 49], to extract high-level features used to label skills. Although our approach does not require the use of language, we utilize VLMs because, as VLMs were trained to align images with language, VLM image embeddings represent a semantically aligned embedding space. However, one main issue precludes the naïve application of VLMs in robotics. In particular, VLMs do not inherently account for object variations or robot arm starting positions across images [50, 24, 51, 52]. But in robot manipulation, high-level behaviors should be characterized by changes in arm and object positions across a trajectory—picking up a cup should be considered the same skill regardless of if the cup is to the robot’s left or right. Our initial experiments of using the embeddings directly resulted in skills specific to one type of environment layout or object. Therefore, to capture high-level behaviors, we use trajectory-level embedding differences by taking the difference of each VLM image embedding with the first one in the trajectory:111To ensure that each timestep has an embedding, we assign embedding e1subscript𝑒1{e}_{1}italic_e start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to be identical to e2subscript𝑒2{e}_{2}italic_e start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

et=VLM(st)VLM(s1).subscript𝑒𝑡VLMsubscript𝑠𝑡VLMsubscript𝑠1{e}_{t}=\text{VLM}(s_{t})-\text{VLM}(s_{1}).italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = VLM ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - VLM ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) . (1)
Refer to caption
Figure 3: Skill label assignment consists of (1) using the VLM embedding differences for clustering, then (2) applying a median filter over the labels to smooth out noisy assignments.

Skill label assignment. After creating embeddings etsubscript𝑒𝑡{e}_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for each image stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, we assign skill labels in an unsupervised manner based on these features. Inspired by classical algorithms from speaker diarization, a long-studied problem in speech processing where the objective is to assign a “speaker label” to each speech timestep [53], we first perform unsupervised clustering with K-means on the entire dataset of embedding differences eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to assign per-timestep skill labels (the label is the cluster ID), then we smooth out the label assignments with a simple median filter run along the trajectory sequence to reduce the frequency of single or few-timestep label assignments. See Figure 3 for a visual demonstration of this process.

In summary, we first extract observation embedding difference features with a VLM and then perform unsupervised K-means clustering to obtain skill labels for each trajectory timestep. This forms the skill-labeled dataset 𝒟d={τd1,τd2,}subscript𝒟𝑑superscriptsubscript𝜏𝑑1superscriptsubscript𝜏𝑑2\mathcal{D}_{d}=\{\tau_{d}^{1},\tau_{d}^{2},...\}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … }, where each τdsubscript𝜏𝑑\tau_{d}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is a trajectory of sequential (s,a)𝑠𝑎(s,a)( italic_s , italic_a ) tuples that all belong to one skill d𝑑ditalic_d. Next, we perform skill learning on 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

4.2 Offline Skill Learning

We aim to learn a discrete set of skills, parameterized by continuous arguments, similar to a functional API over skills (see Figure 2 middle). Therefore, we train a generative skill decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) to convert a discrete skill choice d𝑑ditalic_d and a continuous argument for that skill, z𝑧zitalic_z, into an action sequence. As alluded to in Section 3, we build upon SPiRL by Pertsch et al. [6]. However, they train their decoder to decode fixed-length action sequences from a single continuous latent z𝑧zitalic_z. In contrast, we automatically extract a set of variable-length skill trajectories with labels denoted d𝑑ditalic_d and parameterize each skill by a learned, continuous latent argument z𝑧zitalic_z.222To simplify notation, we use z𝑧zitalic_z for both our method and SPiRL. However, it is important to note that z𝑧zitalic_z uniquely determines the skill in SPiRL, while z𝑧zitalic_z denotes a continuous latent argument in our method.

We train an autoregressive VAE [54] consisting of the following learned neural network components: a skill argument encoder q(za¯,d)𝑞conditional𝑧¯𝑎𝑑q(z\mid\bar{a},d)italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ) map** to a continuous latent z𝑧zitalic_z conditioned on a discrete skill choice d𝑑ditalic_d and an action sequence a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG, and an autoregressive skill decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) conditioned on the latent z𝑧zitalic_z and the discrete skill choice d𝑑ditalic_d. Because the action sequence a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG can be of various lengths, the decoder also learns to produce a continuous value l𝑙litalic_l at each autoregressive timestep representing the proportion of the skill completed at the current action. This variable is used during online RL to stop the execution of the skill when l𝑙litalic_l equals 1 (see Section B.1 for further details).

Recall that SPiRL also trains a skill prior network pz(zs)subscript𝑝𝑧conditional𝑧𝑠p_{z}(z\mid s)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s ) that predicts which z𝑧zitalic_z is useful for an observation s𝑠sitalic_s; this prior is used to guide a high-level policy toward selecting reasonable z𝑧zitalic_z while performing RL. In contrast with SPiRL where z𝑧zitalic_z uniquely represents a skill, we train two prior networks, one to guide the selection of the skill d𝑑ditalic_d, pd(ds)subscript𝑝𝑑conditional𝑑𝑠p_{d}(d\mid s)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ), and one to guide the selection of its argument z𝑧zitalic_z given d𝑑ditalic_d, pz(zs,d)subscript𝑝𝑧conditional𝑧𝑠𝑑p_{z}(z\mid s,d)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ). These are trained with the observation from the first timestep of the sampled trajectory, s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, to be able to guide a skill-based policy during online RL in choosing d𝑑ditalic_d and z𝑧zitalic_z. Our full objective for training this VAE is to maximize the following:

𝔼a¯,d,s1𝒟dzq(a¯,d)[[t=1|a¯|logpa(at,lz,d)action rec. + progress pred.]+βKL(q(za¯,d)N(0,I))VAE encoder KL regularization+logpd(ds1)discrete skill prior+logpz(𝐬𝐠(z)s1,d)continuous arg. prior],\underset{\begin{subarray}{c}\bar{a},d,s_{1}\sim\mathcal{D}_{d}\\ z\sim q(\cdot\mid\bar{a},d)\end{subarray}}{\mathbb{E}}\Biggl{[}\biggl{[}% \overset{|\bar{a}|}{\underset{t=1}{\sum}}\underbrace{\log p_{a}(a_{t},l\mid z,% d)}_{\text{action rec. + progress pred.}}\biggr{]}+\underbrace{\beta\,\text{KL% }\left(q(z\mid\bar{a},d)\parallel N(0,I)\right)}_{\text{VAE encoder KL % regularization}}+\underbrace{\log p_{d}(d\mid s_{1})}_{\text{discrete skill % prior}}+\underbrace{\log p_{z}(\mathbf{sg}(z)\mid s_{1},d)}_{\text{continuous % arg. prior}}\Biggr{]},start_UNDERACCENT start_ARG start_ROW start_CELL over¯ start_ARG italic_a end_ARG , italic_d , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_z ∼ italic_q ( ⋅ ∣ over¯ start_ARG italic_a end_ARG , italic_d ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG [ [ start_OVERACCENT | over¯ start_ARG italic_a end_ARG | end_OVERACCENT start_ARG start_UNDERACCENT italic_t = 1 end_UNDERACCENT start_ARG ∑ end_ARG end_ARG under⏟ start_ARG roman_log italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_l ∣ italic_z , italic_d ) end_ARG start_POSTSUBSCRIPT action rec. + progress pred. end_POSTSUBSCRIPT ] + under⏟ start_ARG italic_β KL ( italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ) ∥ italic_N ( 0 , italic_I ) ) end_ARG start_POSTSUBSCRIPT VAE encoder KL regularization end_POSTSUBSCRIPT + under⏟ start_ARG roman_log italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) end_ARG start_POSTSUBSCRIPT discrete skill prior end_POSTSUBSCRIPT + under⏟ start_ARG roman_log italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( bold_sg ( italic_z ) ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d ) end_ARG start_POSTSUBSCRIPT continuous arg. prior end_POSTSUBSCRIPT ] ,

(2)

where the stop-gradient 𝐬𝐠()𝐬𝐠\mathbf{sg}(\cdot)bold_sg ( ⋅ ) prevents prior losses from influencing the encoder. The first two terms are the β𝛽\betaitalic_β-VAE objective [55]; the last two train priors to predict the correct skill d𝑑ditalic_d and continuous argument z𝑧zitalic_z given s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT.

Additional fine-tuning. On extremely challenging transfer scenarios, demonstrations may still be needed to warm-start reinforcement learning [56]. \method can also flexibly be applied to this setting by using the same K-means clustering model from Section 4.1, which was trained to cluster 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, to assign skill labels to an additional, smaller demonstration dataset. After pre-training on 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, we then fine-tune the entire model on that labeled demonstration dataset before performing RL.

4.3 Online Skill-Based Reinforcement Learning

Finally, we describe how we perform RL for new tasks by training a skill-based policy to select skills and their arguments to solve new tasks. See Figure 2, right, for an overview of online RL.

Policy parameterization. After pre-training the decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ), we treat it as a frozen lower-level policy that a learned skill-based policy can use to interact with a new task. Specifically, we train a skill-based policy π(d,zs)𝜋𝑑conditional𝑧𝑠\pi(d,z\mid s)italic_π ( italic_d , italic_z ∣ italic_s ) to output a (d,z)𝑑𝑧(d,z)( italic_d , italic_z ) tuple representing a discrete skill choice and its continuous argument. We parameterize this policy as a product of two policies: π(d,zs)=πd(ds)πz(zs,d)𝜋𝑑conditional𝑧𝑠subscript𝜋𝑑conditional𝑑𝑠subscript𝜋𝑧conditional𝑧𝑠𝑑\pi(d,z\mid s)=\pi_{d}(d\mid s)\pi_{z}(z\mid s,d)italic_π ( italic_d , italic_z ∣ italic_s ) = italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) so that each component of π(d,zs)𝜋𝑑conditional𝑧𝑠\pi(d,z\mid s)italic_π ( italic_d , italic_z ∣ italic_s ) can be regularized with our pre-trained priors pd(ds)subscript𝑝𝑑conditional𝑑𝑠p_{d}(d\mid s)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) and pz(zs,d)subscript𝑝𝑧conditional𝑧𝑠𝑑p_{z}(z\mid s,d)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ). Intuitively, this parameterization separates decision-making into what skill to use and how to use it. The complete factorization of the skill-based policy follows:

π(as)=pa(a¯z,d)skill decoderπ(d,zs)=pa(a¯z,d)skill decoderπd(ds)πz(zs,d)learned skill-based policy.𝜋conditional𝑎𝑠subscriptsubscript𝑝𝑎conditional¯𝑎𝑧𝑑skill decoder𝜋𝑑conditional𝑧𝑠subscriptsubscript𝑝𝑎conditional¯𝑎𝑧𝑑skill decodersubscriptsubscript𝜋𝑑conditional𝑑𝑠subscript𝜋𝑧conditional𝑧𝑠𝑑learned skill-based policy\pi(a\mid s)=\underbrace{p_{a}(\bar{a}\mid z,d)}_{\text{skill decoder}}~{}% \cdot~{}\pi(d,z\mid s)=\underbrace{p_{a}(\bar{a}\mid z,d)}_{\text{skill % decoder}}\cdot\underbrace{\pi_{d}(d\mid s)\cdot\pi_{z}(z\mid s,d)}_{\text{% learned skill-based policy}}.italic_π ( italic_a ∣ italic_s ) = under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) end_ARG start_POSTSUBSCRIPT skill decoder end_POSTSUBSCRIPT ⋅ italic_π ( italic_d , italic_z ∣ italic_s ) = under⏟ start_ARG italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) end_ARG start_POSTSUBSCRIPT skill decoder end_POSTSUBSCRIPT ⋅ under⏟ start_ARG italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ⋅ italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) end_ARG start_POSTSUBSCRIPT learned skill-based policy end_POSTSUBSCRIPT . (3)

Policy learning. We can train the skill-based policy with online data collection using any entropy-regularized RL algorithm, such as SAC [57], where we regularize against the skill priors instead of against a max-entropy uniform prior. Because we have factorized π(d,zs)𝜋𝑑conditional𝑧𝑠\pi(d,z\mid s)italic_π ( italic_d , italic_z ∣ italic_s ) into two separate policies, we can easily regularize each with the priors trained in Section 4.2. The training objective for the policy with SAC is to maximize over πd,πzsubscript𝜋𝑑subscript𝜋𝑧\pi_{d},\pi_{z}italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT:

𝔼s,dπd(.|s)zπz(.|s,d)[Q(s,z,d)αzKL(πz(zs,d)pz(s,d))skill argument guidanceαdKL(πd(ds)pd(s))skill choice guidance],\underset{\begin{subarray}{c}s,d\sim\pi_{d}(.|s)\\ z\sim\pi_{z}(.|s,d)\end{subarray}}{\mathbb{E}}\Big{[}Q(s,z,d)-\alpha_{z}% \underbrace{\text{KL}(\pi_{z}(z\mid s,d)\parallel p_{z}(\cdot\mid s,d))}_{% \text{skill argument guidance}}-\alpha_{d}\underbrace{\text{KL}(\pi_{d}(d\mid s% )\parallel p_{d}(\cdot\mid s))}_{\text{skill choice guidance}}\Big{]},start_UNDERACCENT start_ARG start_ROW start_CELL italic_s , italic_d ∼ italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( . | italic_s ) end_CELL end_ROW start_ROW start_CELL italic_z ∼ italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( . | italic_s , italic_d ) end_CELL end_ROW end_ARG end_UNDERACCENT start_ARG blackboard_E end_ARG [ italic_Q ( italic_s , italic_z , italic_d ) - italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT under⏟ start_ARG KL ( italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) ∥ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( ⋅ ∣ italic_s , italic_d ) ) end_ARG start_POSTSUBSCRIPT skill argument guidance end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT under⏟ start_ARG KL ( italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ∥ italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) ) end_ARG start_POSTSUBSCRIPT skill choice guidance end_POSTSUBSCRIPT ] ,

(4)

where αzsubscript𝛼𝑧\alpha_{z}italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT and αdsubscript𝛼𝑑\alpha_{d}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT control the prior regularization weights. The critic objective is also correspondingly modified (see Appendix Algorithm 4).

In summary, \method first extracts a set of discrete skills from offline image-action data (Section 4.1), then trains an action decoder to take low-level actions in the environment conditioned on a discrete skill and continuous latent (Section 4.2), and finally performs prior-guided reinforcement learning over these skills online in the target environment to learn new tasks (Section 4.3). See Algorithm 1 (appendix) for the pseudocode and Section B.1 for additional implementation details.

5 Experiments

Our experiments investigate the following questions: (1) Does \method discover meaningful, well-aligned skills from offline data? (2) Do \method-acquired skills help robots learn new tasks? (3) What components of \method are important in enabling transfer?

5.1 Experimental Setup

We evaluate \method on two long-horizon, continuous-control, robotic manipulation domains: Franka Kitchen [58] and LIBERO [59]. All environments use image observations and sparse rewards. For both Franka Kitchen and LIBERO, our method \method uses the R3M VLM [47] and K-means with K=8𝐾8K=8italic_K = 8 for offline skill extraction (Section 4.1). We list specific details below; see Section B.3 for more.

[Uncaptioned image]

Franka Kitchen: This environment, originally from Gupta et al. [22], Fu et al. [58] contains a Franka Panda arm operating in a kitchen environment. Similarly to Pertsch et al. [6], we test transfer learning of a sequence of 4 subtasks never performed in sequence in the dataset. Agents are given a reward of 1 for completing each subtask.

[Uncaptioned image]

LIBERO: LIBERO [59] consists of a Franka Panda arm interacting with many objects and drawers. We test transfer to four task suites, LIBERO-{Object, Spatial, Goal, 10} consisting of 10 unseen environments/tasks each, spanning various transfer scenarios (40 total tasks). LIBERO tasks are language conditioned (e.g., “turn on the stove and put the moka pot on it”); for pre-training and RL, we condition all methods on the language instruction. Due to LIBERO’s difficulty [60], for all pre-trained methods, we first fine-tune to a provided additional target task dataset with 50 demos per task before performing RL. During RL, we fine-tune on all tasks within each suite simultaneously. To the best of our knowledge, we are the first to report successful RL results on LIBERO tasks.

Baselines and Comparisons. We compare against: (1) an oracle (RAPS [9]), which is given ground truth discrete skills with continuous input arguments designed by humans specifically for the Franka Kitchen environment; (2) methods that pre-train with the same data—namely SPiRL [6] which extracts sequences of fixed-length random action trajectories as skills, and BC, behavior cloning using the same offline data but no temporally extended skills; and (3) SAC [57], i.e., RL without any offline data. See Appendix B for further implementation details. Unless otherwise stated, all reported experimental results are means and standard deviations over 5 seeds.

Refer to caption
Figure 4: 100 randomly sampled trajectories from the Franka Kitchen dataset after being clustered into skills and visualized in 2D (originally 2048) with PCA. Even in 2 dimensions, clusters can be clearly distinguished. We visualize 2 randomly sampled skills in each cluster, demonstrating that our skill assignment mechanism successfully aligns trajectories performing similar high-level behaviors.

5.2 Offline Skill Extraction

We first test \method’s ability to discover meaningful, well-aligned skills during skill extraction. In Figure 4, we plot K-means (K=8𝐾8K=8italic_K = 8) skill assignments in Franka Kitchen. We project VLM embedding differences down to 2-D with PCA for visualization. These skill assignments demonstrate that unsupervised clustering of VLM embedding differences can create distinctly separable clustering assignments. For example, skill 4 (Figure 4, top left) demonstrates a cabinet opening behavior. See additional visualizations for all environments in Section C.1. We also analyze quantitative clustering statistics in Section C.2. Next, let’s see how these skills help with learning new tasks.

5.3 Online Reinforcement Learning of New Tasks

Refer to caption
Figure 5: \method outperforms SPiRL in online RL across all comparisons, demonstrating the advantages of our semantically aligned skill-space for RL. SAC and BC struggle, demonstrating the need for skill-based RL. In LIBERO-{Object, Spatial, Goal}, return is equal to success rate.

We investigate the ability of all methods to transfer to new tasks in Figure 5. In Kitchen, \method matches the oracle performance while being 10x more sample-efficient than SPiRL, with SPiRL needing 3M timesteps to reach the same performance of \method at ~300k. While SPiRL performed well in Franka Kitchen in their paper using ground truth environment states instead of RGB images, it struggles in our much more challenging image-based experiments. In all LIBERO task suites, \method performs best; it outperforms SPiRL the most in LIBERO-10, the task suite with the longest-horizon tasks. Meanwhile, SAC and BC perform poorly, indicating our tasks are difficult to solve with just standard RL or offline data without skills.

This improvement of our method over SPiRL comes from the semantically aligned, discrete skill-space that \method learns instead of SPiRL’s continuous, randomly assigned skill-space. For example, to open a drawer, \method can learn to select a single discrete drawer-opening skill with minor argument modifications when its gripper is near that drawer. With SPiRL, the robot must memorize the continuous skill representing each type of drawer-opening behavior for each way to open a drawer; these continuous skills must also be distinguished from others for completely different behaviors. \method also enables easier exploration later in the task; the policy can easily try the same discrete skill if the robot hand is near a different drawer that needs to be opened. For further analysis, see Appendix E. Next, we perform an ablation study on \method components.

5.4 \method RL Ablation Studies

Refer to caption
Figure 6: Embedding ablations.

VLMs. We first ablate the use of VLMs from selecting features for clustering. Therefore, we compare against Action, where skill labels are generated by clustering robot action differences. We also compare against State where skills are labeled by clustering ground truth state differences (e.g., robot joints, states of all objects). State represents an oracle scenario as ground truth states of all relevant objects are difficult to obtain in the real world. We plot results in Franka Kitchen in Figure 6. \method with VLM-extracted skills performs best, as both ground truth state and raw environment action differences can be difficult to directly obtain high-level, semantically meaningful skills from.

Refer to caption
Figure 7: Kitchen K𝐾Kitalic_K ablations.

Number of Clusters. Finally, we ablate the number of K-means clusters. Intuitively, too few or too many clusters can make downstream RL more difficult as it trades off the ease of the RL policy selecting the correct discrete skill and the difficulty of deciding the correct continuous argument for that skill. In Figure 7, we plot average returns at 1M timesteps of \method in Kitchen with K=3,5,8,15𝐾35815K=3,5,8,15italic_K = 3 , 5 , 8 , 15. Final returns are relatively constant, with performance drop** only at K=15𝐾15K=15italic_K = 15. This indicates that \method is robust to the number of discrete skills unsupervisedly discovered.

6 Discussion and Limitations

We presented \method, a method for enabling efficient agent transfer learning by extracting a discrete set of input-argument parameterized skills from offline data for a robot to use in new tasks. Compared to standard RL, our method operates over temporally extended skills rather than low-level environment actions, providing greater flexibility and transferability to new tasks, as demonstrated by our comprehensive experiments. Our experiments demonstrated that \method performs well across 41 total tasks across 2 robot manipulation domains.

Limitations. However, while \method enables efficient transfer learning, we still need the initial dataset from environments similar to the target environments for learning skills from. It would be useful to extend \method to data from other robots or other environments significantly different from the target environment to ease the data collection burden—possibly wth sim to real techniques [61]. Furthermore, in future work, we plan to combine our method with offline RL [62, 63, 64, 65, 66, 67] to learn skills from suboptimal data without the need to interact with an environment, targeting even greater sample efficiency. Finally, \method requires image observations for the VLMs; skill learning from more input modalities would be interesting future work.

Acknowledgments

The majority of this work was performed while Jesse Zhang and Zuxin Liu were interns at Amazon Web Services. After the internships, this work was supported by a USC Viterbi Fellowship, compute infrastructure from AWS, Institute of Information & Communications Technology Planning & Evaluation (IITP) grants (No.RS-2019-II190075, Artificial Intelligence Graduate School Program, KAIST; No.RS-2022-II220984, Development of Artificial Intelligence Technology for Personalized Plug-and-Play Explanation and Verification of Explanation), a National Research Foundation of Korea (NRF) grant (NRF-2021H1D3A2A03103683, Brain Pool Research Program) funded by the Korean government (MSIT), Electronics and Telecommunications Research Institute (ETRI) grant funded by the Korean government foundation [24ZB1200, Research of Human-centered autonomous intelligence system original technology], and Samsung Electronics Co., Ltd (IO220816-02015-01). Finally, we thank Laura Smith and Sidhant Kaushik for their valuable feedback on early versions of the paper draft.

References

  • Fitts and Posner [1967] P. Fitts and M. Posner. Human Performance. Basic concepts in psychology series. Brooks/Cole Publishing Company, 1967. ISBN 9780134452470.
  • Anderson [1982] J. R. Anderson. Acquisition of cognitive skill. Psychological Review, 89:369–406, 1982.
  • Sutton et al. [1999] R. S. Sutton, D. Precup, and S. Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1):181–211, 1999. ISSN 0004-3702.
  • Schaal [2006] S. Schaal. Dynamic movement primitives–a framework for motor control in humans and humanoid robotics. Adaptive Motion of Animals and Machines, 01 2006.
  • Hausman et al. [2018] K. Hausman, J. T. Springenberg, Z. Wang, N. Heess, and M. Riedmiller. Learning an embedding space for transferable robot skills. In ICLR, 2018.
  • Pertsch et al. [2020] K. Pertsch, Y. Lee, and J. J. Lim. Accelerating reinforcement learning with learned skill priors. In Conference on Robot Learning (CoRL), 2020.
  • Zhang et al. [2021] J. Zhang, K. Pertsch, J. Yang, and J. J. Lim. Minimum description length skills for accelerated reinforcement learning. In Self-Supervision for Reinforcement Learning Workshop - ICLR 2021, 2021.
  • Ajay et al. [2021] A. Ajay, A. Kumar, P. Agrawal, S. Levine, and O. Nachum. {OPAL}: Offline primitive discovery for accelerating offline reinforcement learning. In International Conference on Learning Representations, 2021.
  • Dalal et al. [2021] M. Dalal, D. Pathak, and R. Salakhutdinov. Accelerating robotic reinforcement learning via parameterized action primitives. In NeurIPS, 2021.
  • Nasiriany et al. [2022] S. Nasiriany, H. Liu, and Y. Zhu. Augmenting reinforcement learning with behavior primitives for diverse manipulation tasks. In IEEE International Conference on Robotics and Automation (ICRA), 2022.
  • Zhang et al. [2023] G. Zhang, A. Jain, I. Hwang, S.-H. Sun, and J. J. Lim. Every policy has something to share: Efficient multi-task reinforcement learning via selective behavior sharing, 2023.
  • Zhang et al. [2024] J. Zhang, K. Pertsch, J. Zhang, and J. J. Lim. Sprint: Scalable policy pre-training via language instruction relabeling. In International Conference on Robotics and Automation, 2024.
  • Schmidhuber et al. [1997] J. Schmidhuber, J. Zhao, and M. Wiering. Shifting inductive bias with success-story algorithm, adaptive levin search, and incremental self-improvement. Machine Learning, 28(1):105–130, Jul 1997. ISSN 1573-0565.
  • Thrun [1996] S. Thrun. Is learning the n-th thing any easier than learning the first? In Advances in neural information processing systems, pages 640–646, 1996.
  • Taylor and Stone [2009] M. E. Taylor and P. Stone. Transfer learning for reinforcement learning domains: A survey. Journal of Machine Learning Research, 10(7), 2009.
  • Fakoor et al. [2020] R. Fakoor, P. Chaudhari, S. Soatto, and A. J. Smola. Meta-q-learning. In International Conference on Learning Representations, 2020.
  • Caccia et al. [2023] M. Caccia, J. Mueller, T. Kim, L. Charlin, and R. Fakoor. Task-agnostic continual reinforcement learning: Gaining insights and overcoming challenges. In Conference on Lifelong Learning Agents, 2023.
  • Singh et al. [2021] A. Singh, H. Liu, G. Zhou, A. Yu, N. Rhinehart, and S. Levine. Parrot: Data-driven behavioral priors for reinforcement learning. ICLR, 2021.
  • Zhang et al. [2021] J. Zhang, H. Yu, and W. Xu. Hierarchical reinforcement learning by discovering intrinsic options. In International Conference on Learning Representations, 2021.
  • Lee et al. [2018] Y. Lee, S.-H. Sun, S. Somasundaram, E. S. Hu, and J. J. Lim. Composing complex skills by learning transition policies. In International Conference on Learning Representations, 2018.
  • Shiarlis et al. [2018] K. Shiarlis, M. Wulfmeier, S. Salter, S. Whiteson, and I. Posner. Taco: Learning task decomposition via temporal alignment for control. ICML, 2018.
  • Gupta et al. [2019] A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman. Relay policy learning: Solving long-horizon tasks via imitation and reinforcement learning. CoRL, 2019.
  • Radford et al. [2021] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021.
  • Sontakke et al. [2023] S. A. Sontakke, J. Zhang, S. Arnold, K. Pertsch, E. Biyik, D. Sadigh, C. Finn, and L. Itti. RoboCLIP: One demonstration is enough to learn robot policies. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Pastor et al. [2009] P. Pastor, H. Hoffmann, T. Asfour, and S. Schaal. Learning and generalization of motor skills by learning from demonstration. In 2009 IEEE International Conference on Robotics and Automation, pages 763–768. IEEE, 2009.
  • Lin et al. [2024] H. Lin, R. Corcodel, and D. Zhao. Generalize by touching: Tactile ensemble skill transfer for robotic furniture assembly, 2024.
  • Oh et al. [2017] J. Oh, S. Singh, H. Lee, and P. Kohli. Zero-shot task generalization with multi-task deep reinforcement learning. In International Conference on Machine Learning, 2017.
  • Xu et al. [2018] D. Xu, S. Nair, Y. Zhu, J. Gao, A. Garg, L. Fei-Fei, and S. Savarese. Neural task programming: Learning to generalize across hierarchical tasks. In International Conference on Robotics and Automation, 2018.
  • Andreas et al. [2017] J. Andreas, D. Klein, and S. Levine. Modular multitask reinforcement learning with policy sketches. In International Conference on Machine Learning, pages 166–175. PMLR, 2017.
  • Kipf et al. [2019] T. Kipf, Y. Li, H. Dai, V. Zambaldi, E. Grefenstette, P. Kohli, and P. Battaglia. Compositional imitation learning: Explaining and executing one task at a time. ICML, 2019.
  • Shankar et al. [2019] T. Shankar, S. Tulsiani, L. Pinto, and A. Gupta. Discovering motor programs by recomposing demonstrations. In ICLR, 2019.
  • Merel et al. [2020] J. Merel, S. Tunyasuvunakool, A. Ahuja, Y. Tassa, L. Hasenclever, V. Pham, T. Erez, G. Wayne, and N. Heess. Catch & carry: Reusable neural controllers for vision-guided whole-body tasks. ACM. Trans. Graph., 2020.
  • Shankar and Gupta [2020] T. Shankar and A. Gupta. Learning robot skills with temporal variational inference. In International Conference on Machine Learning, pages 8624–8633. PMLR, 2020.
  • Lynch et al. [2020] C. Lynch, M. Khansari, T. Xiao, V. Kumar, J. Tompson, S. Levine, and P. Sermanet. Learning latent plans from play. In Conference on Robot Learning, pages 1113–1132, 2020.
  • Shi et al. [2022] L. X. Shi, J. J. Lim, and Y. Lee. Skill-based model-based reinforcement learning. In Conference on Robot Learning, 2022.
  • Wan et al. [2021] W. Wan, Y. Zhu, R. Shah, and Y. Zhu. Lotus: Continual imitation learning for robot manipulation through unsupervised skill discovery. In 2021 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2021.
  • Nair et al. [2021] A. Nair, A. Gupta, M. Dalal, and S. Levine. Awac: Accelerating online reinforcement learning with offline datasets, 2021.
  • Kumar et al. [2022] A. Kumar, J. Hong, A. Singh, and S. Levine. When should we prefer offline reinforcement learning over behavioral cloning? arXiv preprint arXiv:2204.05618, 2022.
  • Zheng et al. [2022] Q. Zheng, A. Zhang, and A. Grover. Online decision transformer, 2022.
  • Bacon et al. [2017] P.-L. Bacon, J. Harb, and D. Precup. The option-critic architecture. In Association for the Advancement of Artificial Intelligence, 2017.
  • Nachum et al. [2018] O. Nachum, S. S. Gu, H. Lee, and S. Levine. Data-efficient hierarchical reinforcement learning. In Neural Information Processing Systems, 2018.
  • Eysenbach et al. [2019] B. Eysenbach, A. Gupta, J. Ibarz, and S. Levine. Diversity is all you need: Learning skills without a reward function. In ICLR, 2019.
  • Warde-Farley et al. [2019] D. Warde-Farley, T. V. de Wiele, T. Kulkarni, C. Ionescu, S. Hansen, and V. Mnih. Unsupervised control through non-parametric discriminative rewards. In ICLR, 2019.
  • Gregor et al. [2019] K. Gregor, G. Papamakarios, F. Besse, L. Buesing, and T. Weber. Temporal difference variational auto-encoder. In International Conference on Learning Representations, 2019.
  • Sharma et al. [2020] A. Sharma, S. Gu, S. Levine, V. Kumar, and K. Hausman. Dynamics-aware unsupervised discovery of skills. In ICLR, 2020.
  • Laskin et al. [2022] M. Laskin, H. Liu, X. B. Peng, D. Yarats, A. Rajeswaran, and P. Abbeel. Cic: Contrastive intrinsic control for unsupervised skill discovery, 2022.
  • Nair et al. [2022] S. Nair, A. Rajeswaran, V. Kumar, C. Finn, and A. Gupta. R3m: A universal visual representation for robot manipulation. In CoRL, 2022.
  • Xiao et al. [2022] T. Xiao, I. Radosavovic, T. Darrell, and J. Malik. Masked visual pre-training for motor control. arXiv preprint arXiv:2203.06173, 2022.
  • Ma et al. [2023] Y. J. Ma, W. Liang, V. Som, V. Kumar, A. Zhang, O. Bastani, and D. Jayaraman. Liv: Language-image representations and rewards for robotic control, 2023.
  • Cui et al. [2022] Y. Cui, S. Niekum, A. Gupta, V. Kumar, and A. Rajeswaran. Can foundation models perform zero-shot task specification for robot manipulation? In R. Firoozi, N. Mehr, E. Yel, R. Antonova, J. Bohg, M. Schwager, and M. Kochenderfer, editors, Proceedings of The 4th Annual Learning for Dynamics and Control Conference, volume 168 of Proceedings of Machine Learning Research, pages 893–905. PMLR, 23–24 Jun 2022.
  • Rocamonde et al. [2023] J. Rocamonde, V. Montesinos, E. Nava, E. Perez, and D. Lindner. Vision-language models are zero-shot reward models for reinforcement learning. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023.
  • Wang et al. [2024] Y. Wang, Z. Sun, J. Zhang, Z. Xian, E. Biyik, D. Held, and Z. Erickson. Rl-vlm-f: Reinforcement learning from vision language foundation model feedback. In International conference on machine learning, 2024.
  • Anguera et al. [2012] X. Anguera, S. Bozonnet, N. Evans, C. Fredouille, G. Friedland, and O. Vinyals. Speaker diarization: A review of recent research. IEEE Transactions on audio, speech, and language processing, 20(2):356–370, 2012.
  • Kingma and Welling [2014] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning Representations, 2014.
  • Higgins et al. [2016] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, S. Mohamed, and A. Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, 2016.
  • Uchendu et al. [2023] I. Uchendu, T. Xiao, Y. Lu, B. Zhu, M. Yan, J. Simon, M. Bennice, C. Fu, C. Ma, J. Jiao, S. Levine, and K. Hausman. Jump-start reinforcement learning, 2023.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. ICML, 2018.
  • Fu et al. [2020] J. Fu, A. Kumar, O. Nachum, G. Tucker, and S. Levine. D4rl: Datasets for deep data-driven reinforcement learning. arXiv preprint arXiv:2004.07219, 2020.
  • Liu et al. [2023] B. Liu, Y. Zhu, C. Gao, Y. Feng, Q. Liu, Y. Zhu, and P. Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning, 2023.
  • Liu et al. [2024] Z. Liu, J. Zhang, K. Asadi, Y. Liu, D. Zhao, S. Sabach, and R. Fakoor. TAIL: Task-specific adapters for imitation learning with large pretrained models. In ICLR, 2024.
  • Zhang et al. [2021] G. Zhang, L. Zhong, Y. Lee, and J. J. Lim. Policy transfer across visual and dynamics domain gaps via iterative grounding. In Robotics: Science and Systems, 2021.
  • Fujimoto et al. [2019] S. Fujimoto, D. Meger, and D. Precup. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning, pages 2052–2062, 2019.
  • Peng et al. [2019] X. B. Peng, A. Kumar, G. Zhang, and S. Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning, 2019.
  • Levine et al. [2020] S. Levine, A. Kumar, G. Tucker, and J. Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020.
  • Singh et al. [2020] A. Singh, A. Yu, J. Yang, J. Zhang, A. Kumar, and S. Levine. Cog: Connecting new skills to past experience with offline reinforcement learning. CoRL, 2020.
  • Fakoor et al. [2021] R. Fakoor, J. W. Mueller, K. Asadi, P. Chaudhari, and A. J. Smola. Continuous doubly constrained batch reinforcement learning. Advances in Neural Information Processing Systems, 34:11260–11273, 2021.
  • Liu et al. [2023] Y. Liu, P. Chaudhari, and R. Fakoor. Budgeting counterfactual for offline rl. In Advances in Neural Information Processing Systems, volume 36, pages 5729–5751, 2023.
  • Virtanen et al. [2020] P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Bright, S. J. van der Walt, M. Brett, J. Wilson, K. J. Millman, N. Mayorov, A. R. J. Nelson, E. Jones, R. Kern, E. Larson, C. J. Carey, İ. Polat, Y. Feng, E. W. Moore, J. VanderPlas, D. Laxalde, J. Perktold, R. Cimrman, I. Henriksen, E. A. Quintero, C. R. Harris, A. M. Archibald, A. H. Ribeiro, F. Pedregosa, P. van Mulbregt, and SciPy 1.0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.
  • Christodoulou [2019] P. Christodoulou. Soft actor-critic for discrete action settings, 2019.
  • Zhu et al. [2020] Y. Zhu, J. Wong, A. Mandlekar, R. Martín-Martín, A. Joshi, S. Nasiriany, and Y. Zhu. robosuite: A modular simulation framework and benchmark for robot learning. arXiv preprint arXiv:2009.12293, 2020.
  • Reimers and Gurevych [2019] N. Reimers and I. Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 11 2019.
  • Sutton and Barto [2018] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.

Appendix A Full Algorithm

Algorithm 1 \method Algorithm, Section 4.
1:Dataset 𝒟𝒟\mathcal{D}caligraphic_D, VLM, Target MDP \mathcal{M}caligraphic_M, Optional target task fine-tuning dataset 𝒟subscript𝒟\mathcal{D}_{\mathcal{M}}caligraphic_D start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT
2:𝒟d,CMsubscript𝒟𝑑𝐶𝑀absent\mathcal{D}_{d},CM\leftarrowcaligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_C italic_M ← OfflineSkillExtraction(𝒟𝒟\mathcal{D}caligraphic_D, VLM) \triangleright Get discrete skill labels and clustering model, Algorithm 2
3:Init q(za¯,d),pa(a¯z,d),pd(ds),pz(zs,d)𝑞conditional𝑧¯𝑎𝑑subscript𝑝𝑎conditional¯𝑎𝑧𝑑subscript𝑝𝑑conditional𝑑𝑠subscript𝑝𝑧conditional𝑧𝑠𝑑q(z\mid\bar{a},d),p_{a}(\bar{a}\mid z,d),p_{d}(d\mid s),p_{z}(z\mid s,d)italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ) , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) \triangleright Skill argument encoder, skill decoder, discrete skill prior, continuous argument prior
4:q,pa,pd,pz𝑞subscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧absentq,p_{a},p_{d},p_{z}\leftarrowitalic_q , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ← OfflineSkillLearning(𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, q𝑞qitalic_q, pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, pdsubscript𝑝𝑑p_{d}italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT) \triangleright Learn skills offline, Algorithm 3
5:if 𝒟subscript𝒟\mathcal{D}_{\mathcal{M}}caligraphic_D start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT exists then
6:     𝒟,dsubscript𝒟𝑑absent\mathcal{D}_{\mathcal{M},d}\leftarrowcaligraphic_D start_POSTSUBSCRIPT caligraphic_M , italic_d end_POSTSUBSCRIPT ← Assign skills to 𝒟subscript𝒟\mathcal{D}_{\mathcal{M}}caligraphic_D start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT with existing clustering model CM𝐶𝑀CMitalic_C italic_M
7:     q,pa,pd,pz𝑞subscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧absentq,p_{a},p_{d},p_{z}\leftarrowitalic_q , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ← OfflineSkillLearning(𝒟,dsubscript𝒟𝑑\mathcal{D}_{\mathcal{M},d}caligraphic_D start_POSTSUBSCRIPT caligraphic_M , italic_d end_POSTSUBSCRIPT, q𝑞qitalic_q, pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, pdsubscript𝑝𝑑p_{d}italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT) \triangleright Optionally fine-tune on target task \mathcal{M}caligraphic_M
8:end if
9:SkillBasedOnlineRL(,pa,pd,pzsubscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧\mathcal{M},p_{a},p_{d},p_{z}caligraphic_M , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT) \triangleright RL on target task \mathcal{M}caligraphic_M, Algorithm 4
Algorithm 2 Offline Skill Extraction, Section 4.1.
1:procedure OfflineSkillExtraction(𝒟𝒟\mathcal{D}caligraphic_D, VLM)
2:     Embeds []absent\leftarrow[]← [ ] \triangleright Init VLM embedding differences
3:     for trajectory τ=[(s1,a1),,(sT,aT)]𝜏subscript𝑠1subscript𝑎1subscript𝑠𝑇subscript𝑎𝑇\tau=[(s_{1},a_{1}),...,(s_{T},a_{T})]italic_τ = [ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] in 𝒟𝒟\mathcal{D}caligraphic_D do
4:         for (si,ai)subscript𝑠𝑖subscript𝑎𝑖(s_{i},a_{i})( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) in τ𝜏\tauitalic_τ do
5:              ei=VLM(si)VLM(s1)subscript𝑒𝑖VLMsubscript𝑠𝑖VLMsubscript𝑠1e_{i}=\text{VLM}(s_{i})-\text{VLM}(s_{1})italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = VLM ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - VLM ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) \triangleright Embedding differences, Equation 1
6:              Embeds.append(eisubscript𝑒𝑖e_{i}italic_e start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)
7:         end for
8:     end for
9:     CM𝐶𝑀absentCM\leftarrowitalic_C italic_M ← Init (K-Means) clustering model
10:     Labels CMabsent𝐶𝑀\leftarrow CM← italic_C italic_M(Embeds) \triangleright Run unsupervised clustering to get cluster labels
11:     𝒟d{}subscript𝒟𝑑\mathcal{D}_{d}\leftarrow\{\}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← { } \triangleright Init skill labeled dataset
12:     for trajectory τ=[(s1,a1),,(sT,aT)]𝜏subscript𝑠1subscript𝑎1subscript𝑠𝑇subscript𝑎𝑇\tau=[(s_{1},a_{1}),...,(s_{T},a_{T})]italic_τ = [ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ] in 𝒟𝒟\mathcal{D}caligraphic_D do
13:         d1,,dTsubscript𝑑1subscript𝑑𝑇absentd_{1},...,d_{T}\leftarrowitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← Get labels from Labels
14:         d1,,dTsubscript𝑑1subscript𝑑𝑇absentd_{1},...,d_{T}\leftarrowitalic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← MedianFilter(d1,,dTsubscript𝑑1subscript𝑑𝑇d_{1},...,d_{T}italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT) \triangleright Smooth out labels, see Section B.1
15:         𝒟d𝒟d[(s1,a1,d1),,(sT,aT,dT)]subscript𝒟𝑑subscript𝒟𝑑subscript𝑠1subscript𝑎1subscript𝑑1subscript𝑠𝑇subscript𝑎𝑇subscript𝑑𝑇\mathcal{D}_{d}\leftarrow\mathcal{D}_{d}\cup[(s_{1},a_{1},d_{1}),...,(s_{T},a_% {T},d_{T})]caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ← caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ∪ [ ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , … , ( italic_s start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) ]
16:     end for
17:     return 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, CM𝐶𝑀CMitalic_C italic_M
18:end procedure
Algorithm 3 Offline Skill Learning, Section 4.2.
1:procedure OfflineSkillLearning(𝒟𝒟\mathcal{D}caligraphic_D, q𝑞qitalic_q, pasubscript𝑝𝑎p_{a}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, pdsubscript𝑝𝑑p_{d}italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT, pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT)
2:     while not converged do
3:         Sample τdsubscript𝜏𝑑\tau_{d}italic_τ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT from 𝒟dsubscript𝒟𝑑\mathcal{D}_{d}caligraphic_D start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
4:         Train q,pa,pd,pz𝑞subscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧q,p_{a},p_{d},p_{z}italic_q , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT with Equation 2
5:     end while
6:     return q,pa,pd,pz𝑞subscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧q,p_{a},p_{d},p_{z}italic_q , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT
7:end procedure
Algorithm 4 Skill-Based Online RL (with SAC [57]), Section 4.3. Red marks policy and critic loss differences against SAC.
1:procedure SkillBasedOnlineRL(,pa(a¯z,d),pd(ds),pz(zs,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑subscript𝑝𝑑conditional𝑑𝑠subscript𝑝𝑧conditional𝑧𝑠𝑑\mathcal{M},p_{a}(\bar{a}\mid z,d),p_{d}(d\mid s),p_{z}(z\mid s,d)caligraphic_M , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d )) \triangleright Section 4.3
2:     Freeze pa(a¯z,d),pd,pzsubscript𝑝𝑎conditional¯𝑎𝑧𝑑subscript𝑝𝑑subscript𝑝𝑧p_{a}(\bar{a}\mid z,d),p_{d},p_{z}italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT weights
3:     πd(ds)pd(ds)subscript𝜋𝑑conditional𝑑𝑠subscript𝑝𝑑conditional𝑑𝑠\pi_{d}(d\mid s)\leftarrow p_{d}(d\mid s)italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ← italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) \triangleright Init πdsubscript𝜋𝑑\pi_{d}italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT as discrete skill prior pdsubscript𝑝𝑑p_{d}italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT
4:     πz(zs,d)pz(zs,d)subscript𝜋𝑧conditional𝑧𝑠𝑑subscript𝑝𝑧conditional𝑧𝑠𝑑\pi_{z}(z\mid s,d)\leftarrow p_{z}(z\mid s,d)italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) ← italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d )\triangleright Init πzsubscript𝜋𝑧\pi_{z}italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT as cont. argument prior pzsubscript𝑝𝑧p_{z}italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT
5:     B{}𝐵B\leftarrow\{\}italic_B ← { } \triangleright Init buffer B
6:     for each rollout do
7:         l0𝑙0l\leftarrow 0italic_l ← 0
8:         dtπd(dst)similar-tosubscript𝑑𝑡subscript𝜋𝑑conditional𝑑subscript𝑠𝑡d_{t}\sim\pi_{d}(d\mid s_{t})italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Sample discrete skill
9:         ztπz(zs,dt)similar-tosubscript𝑧𝑡subscript𝜋𝑧conditional𝑧𝑠subscript𝑑𝑡z_{t}\sim\pi_{z}(z\mid s,d_{t})italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Sample continuous argument for skill
10:         a1,,aL,l1.lLa¯pa(a¯zt,dt)formulae-sequencesubscript𝑎1subscript𝑎𝐿subscript𝑙1subscript𝑙𝐿¯𝑎similar-tosubscript𝑝𝑎conditional¯𝑎subscript𝑧𝑡subscript𝑑𝑡a_{1},...,a_{L},l_{1}....l_{L}\leftarrow\bar{a}\sim p_{a}(\bar{a}\mid z_{t},d_% {t})italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … . italic_l start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ← over¯ start_ARG italic_a end_ARG ∼ italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) \triangleright Sample action sequence a1,,aLsubscript𝑎1subscript𝑎𝐿a_{1},...,a_{L}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT and progress predictions l1,,llsubscript𝑙1subscript𝑙𝑙l_{1},...,l_{l}italic_l start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_l start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT up to max sequence length L𝐿Litalic_L
11:         for a𝑎aitalic_a in a1.,,aLa_{1}.,...,a_{L}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT . , … , italic_a start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT or until l1𝑙1l\geq 1italic_l ≥ 1 do
12:              Execute actions in \mathcal{M}caligraphic_M, accumulating reward sum r~tsubscript~𝑟𝑡\tilde{r}_{t}over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
13:         end for
14:         BB{st,zt,r~t,st}𝐵𝐵subscript𝑠𝑡subscript𝑧𝑡subscript~𝑟𝑡subscript𝑠superscript𝑡B\leftarrow B\cup\{s_{t},z_{t},\tilde{r}_{t},s_{t^{\prime}}\}italic_B ← italic_B ∪ { italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_z start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over~ start_ARG italic_r end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT } \triangleright Add sample to buffer
15:         (s,z,r~,s)Bsimilar-to𝑠𝑧~𝑟superscript𝑠𝐵(s,z,\tilde{r},s^{\prime})\sim B( italic_s , italic_z , over~ start_ARG italic_r end_ARG , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_B \triangleright Sample from B𝐵Bitalic_B
16:         πd,πzmaxπd,πzQ(s,z,d)subscript𝜋𝑑subscript𝜋𝑧subscript𝜋𝑑subscript𝜋𝑧𝑄𝑠𝑧𝑑\pi_{d},\pi_{z}\leftarrow\underset{\pi_{d},\pi_{z}}{\max}\;Q(s,z,d)italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ← start_UNDERACCENT italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG italic_Q ( italic_s , italic_z , italic_d )
17:         αzKL(πz(zs,d)pz(s,d))\;\;\;\;\;\;\;\;\;\;\;\;\;-{\color[rgb]{1,0,0}\alpha_{z}\text{KL}(\pi_{z}(z% \mid s,d)\parallel p_{z}(\cdot\mid s,d))}- italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT KL ( italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) ∥ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( ⋅ ∣ italic_s , italic_d ) )
18:         αdKL(πd(ds)pd(s))\;\;\;\;\;\;\;\;\;\;\;\;\;-{\color[rgb]{1,0,0}\alpha_{d}\text{KL}(\pi_{d}(d% \mid s)\parallel p_{d}(\cdot\mid s))}- italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT KL ( italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ∥ italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) ) \triangleright Update policies, Equation 4
19:         QminQQ(s,z,d)=r(s,z,d)+γQ(s,z,d)𝑄subscript𝑄𝑄𝑠𝑧𝑑𝑟𝑠𝑧𝑑𝛾𝑄superscript𝑠superscript𝑧superscript𝑑Q\leftarrow\min_{Q}Q(s,z,d)=r(s,z,d)+\gamma Q(s^{\prime},z^{\prime},d^{\prime})italic_Q ← roman_min start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT italic_Q ( italic_s , italic_z , italic_d ) = italic_r ( italic_s , italic_z , italic_d ) + italic_γ italic_Q ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT )
20:         αzDKL(π(zs,d)pz(s,d))\;\;\;\;\;\;\;\;\;\;\;\;\;-{\color[rgb]{1,0,0}\alpha_{z}D_{KL}(\pi(z\mid s,d)% \parallel p_{z}(\cdot\mid s,d))}- italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π ( italic_z ∣ italic_s , italic_d ) ∥ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( ⋅ ∣ italic_s , italic_d ) )
21:         αdDKL(π(ds)pd(s))\;\;\;\;\;\;\;\;\;\;\;\;\;-{\color[rgb]{1,0,0}\alpha_{d}D_{KL}(\pi(d\mid s)% \parallel p_{d}(\cdot\mid s))}- italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_K italic_L end_POSTSUBSCRIPT ( italic_π ( italic_d ∣ italic_s ) ∥ italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) ) \triangleright Update critic
22:     end for
23:end procedure

We present the full \method pseudocode in Algorithm 1. Algorithm 2 details offline skill extraction using a VLM, Algorithm 3 details the offline skill learning procedure, and Algorithm 4 details how to perform online skill-based RL on downstream tasks using Soft Actor-Critic (SAC). Note that any entropy-regularized algorithm can be used here with similar modifications, not just SAC. Differences from SAC during online RL are highlighted in red. For further implementation details and hyperparameters of \method, see Section B.1.

Appendix B Experiment and Implementation Details

In this section, we list implementation details for \method (Section B.1), the specific environment setups (Section B.3), and details for how we implemented baselines (Section B.2).

B.1 \method Implementation Details

\method

implementation details follow in the same order as each method subsection was presented in the main paper in Section 4.

B.1.1 Offline Skill Extraction

We first extract skills from a dataset 𝒟𝒟\mathcal{D}caligraphic_D using a VLM by clustering VLM embedding differences of image observations in 𝒟𝒟\mathcal{D}caligraphic_D (see pseudocode in Algorithm 2).

Clustering.

We use K-means for the clustering algorithm as it is performant, time-efficient, and can be easily utilized in a batched manner if all of the embeddings are too large to fit in memory at once. When extracting skills from the offline dataset 𝒟𝒟\mathcal{D}caligraphic_D, we utilize K-means clustering on VLM embedding differences with K=8𝐾8K=8italic_K = 8 in Franka Kitchen and LIBERO, as we found K=8𝐾8K=8italic_K = 8 to produce the most visually pleasing clustering assignments in Franka Kitchen and we directly adapted the Franka Kitchen hyperparameters to LIBERO to avoid too much environment-specific tuning.

Median Filtering.

After performing K-means, we utilize a standard median filter, as is commonly performed in classical speaker diarization [53], to smooth out any possibly noisy assignments (see Figure 3). Specifically, we use the Scipy scipy.signal.medfilt(kernel_size=7) [68] filter for all environments. This corresponds to a median filter with window size 7 that slides over each trajectory’s labels and assigns the median label within that window to all 7 elements. Empirically, we found that this increased the average length of skills as it reduced the occurrence of short, noisy assignments.

B.1.2 Offline Skill Learning

Here, we train a VAE consisting of skill argument encoder q(za¯,d)𝑞conditional𝑧¯𝑎𝑑q(z\mid\bar{a},d)italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ), skill decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ), discrete skill prior pd(ds)subscript𝑝𝑑conditional𝑑𝑠p_{d}(d\mid s)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ), and continuous skill argument prior pz(zs,d)subscript𝑝𝑧conditional𝑧𝑠𝑑p_{z}(z\mid s,d)italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) (see pseudocode in Algorithm 3).

Model architectures.

We closely follow SPiRL’s model architecture implementations [6] as we build upon SPiRL. The encoder q(za¯,d)𝑞conditional𝑧¯𝑎𝑑q(z\mid\bar{a},d)italic_q ( italic_z ∣ over¯ start_ARG italic_a end_ARG , italic_d ) and decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) are implemented with recurrent neural networks. The skill priors are both standard multi-layer perceptrons. The skill argument space z𝑧zitalic_z has 5 dimensions. In Kitchen and LIBERO, our β𝛽\betaitalic_β for the β𝛽\betaitalic_β-VAE KL regularization term in Equation 2 is 0.0010.0010.0010.001.

Skill progress predictor.

During training, for GPU memory reasons, we sample skill trajectories with a maximum length as is common when training autoregressive models. In Franka Kitchen, this is heuristically set to 30303030 based on reconstruction losses and in LIBERO, this is set to 40404040. If a skill trajectory is longer than this maximum length, we simply sample a random contiguous sequence of the maximum length within the trajectory. To ensure that predicted action sequences stay in-distribution with what was seen during training, we also use these maximum lengths as maximum skill lengths during online RL; e.g., if a skill runs for 30303030 timesteps in Franka Kitchen without stop**, we simply resample the next skill (see 10 of Algorithm 4).

As discussed in Section 4.2, given the variable lengths of action sequences a¯¯𝑎\bar{a}over¯ start_ARG italic_a end_ARG, the decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ) is trained to generate a continuous skill progress prediction value l𝑙litalic_l at each timestep. This value represents the proportion of the skill completed at the current time. During online policy rollouts, the execution of the skill is halted when l𝑙litalic_l reaches 1. To learn this progress prediction value, we formulate it as follows: when creating labels for such a sequence, we assign a label to each time step, denoted as ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, based on its position in the sequence. Specifically, ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is set to tN𝑡𝑁\frac{t}{N}divide start_ARG italic_t end_ARG start_ARG italic_N end_ARG for each time step t𝑡titalic_t, where N𝑁Nitalic_N represents the sequence length. To train the model for this function, we use the standard mean-squared-error loss. This ensures that the model learns to predict the end of an action sequence while also ensuring that it receives dense, per-timestep supervision while training function.

Additional target task fine-tuning.

Optionally, for very difficult tasks, some target-task demonstrations may be needed [56, 60, 59]. We perform additional target task fine-tuning in LIBERO [59]. We use the learned clustering model that was trained to cluster the original dataset 𝒟𝒟\mathcal{D}caligraphic_D to directly assign labels to the task-specific dataset 𝒟subscript𝒟\mathcal{D}_{\mathcal{M}}caligraphic_D start_POSTSUBSCRIPT caligraphic_M end_POSTSUBSCRIPT without updating the clustering algorithm parameters (see Algorithm 1 6). Then, we fine-tune the entire model, q,pa,pd,pz𝑞subscript𝑝𝑎subscript𝑝𝑑subscript𝑝𝑧q,p_{a},p_{d},p_{z}italic_q , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT, with the same objective in Equation 2 on the labeled target-task dataset 𝒟,dsubscript𝒟𝑑\mathcal{D}_{\mathcal{M},d}caligraphic_D start_POSTSUBSCRIPT caligraphic_M , italic_d end_POSTSUBSCRIPT.

B.1.3 Skill-Based Online RL

For online RL, we utilize the pre-trained skill decoder pa(a¯z,d)subscript𝑝𝑎conditional¯𝑎𝑧𝑑p_{a}(\bar{a}\mid z,d)italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( over¯ start_ARG italic_a end_ARG ∣ italic_z , italic_d ), and the skill priors pd(ds),pz(zs,d)subscript𝑝𝑑conditional𝑑𝑠subscript𝑝𝑧conditional𝑧𝑠𝑑p_{d}(d\mid s),p_{z}(z\mid s,d)italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) for skill-based policy learning (see Algorithm 4).

Policy learning.

Our policy skill-based policy π(d,zs)𝜋𝑑conditional𝑧𝑠\pi(d,z\mid s)italic_π ( italic_d , italic_z ∣ italic_s ) is parameterized as a product of a discrete skill selection policy πd(ds)subscript𝜋𝑑conditional𝑑𝑠\pi_{d}(d\mid s)italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) and a continuous argument selection policy πz(zs,d)subscript𝜋𝑧conditional𝑧𝑠𝑑\pi_{z}(z\mid s,d)italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) (see Equation 4). To train with actor-critic RL, we sum over the policy losses in each discrete skill dimension weighted by the probability of that skill, similar to discrete SAC loss proposed by Christodoulou [69]:

dπd(ds)(Q(s,z,d)αzKL(πz(zs,d)pz(s,d))αdKL(πd(ds)pd(s))).\sum_{d}\pi_{d}(d\mid s)\Big{(}Q(s,z,d)-\alpha_{z}\text{KL}(\pi_{z}(z\mid s,d)% \parallel p_{z}(\cdot\mid s,d))-\alpha_{d}\text{KL}(\pi_{d}(d\mid s)\parallel p% _{d}(\cdot\mid s))\Big{)}.∑ start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ( italic_Q ( italic_s , italic_z , italic_d ) - italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT KL ( italic_π start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( italic_z ∣ italic_s , italic_d ) ∥ italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ( ⋅ ∣ italic_s , italic_d ) ) - italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT KL ( italic_π start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( italic_d ∣ italic_s ) ∥ italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ( ⋅ ∣ italic_s ) ) ) . (5)

Meanwhile, critic losses are computed with the skill d𝑑ditalic_d that the policy actually took. Our critic networks Q(s,z,d)𝑄𝑠𝑧𝑑Q(s,z,d)italic_Q ( italic_s , italic_z , italic_d ) take the image s𝑠sitalic_s and argument z𝑧zitalic_z as input and have a d𝑑ditalic_d-headed output for each of the d𝑑ditalic_d skills.

We do not use automatic KL tuning (standard in SAC implementations [57]) as we found it to be unstable; instead, we manually set entropy coefficients αdsubscript𝛼𝑑\alpha_{d}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT and αzsubscript𝛼𝑧\alpha_{z}italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT for the policy (Equation 4) and critic losses. In Kitchen, αd=0.1subscript𝛼𝑑0.1\alpha_{d}=0.1italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.1, αz=0.01subscript𝛼𝑧0.01\alpha_{z}=0.01italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0.01; in LIBERO αd=0.1subscript𝛼𝑑0.1\alpha_{d}=0.1italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = 0.1, αz=0.1subscript𝛼𝑧0.1\alpha_{z}=0.1italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = 0.1. These values are obtained by performing a search over αd={0.1,0.01}subscript𝛼𝑑0.10.01\alpha_{d}=\{0.1,0.01\}italic_α start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT = { 0.1 , 0.01 } and αz={0.1,0.01}subscript𝛼𝑧0.10.01\alpha_{z}=\{0.1,0.01\}italic_α start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = { 0.1 , 0.01 }.

B.2 Baseline Implementation Details

Oracle.

Our oracle baseline is RAPS [9]. We run RAPS to convergence and report final performance numbers because its expert-designed skills operate on a different control frequency; it takes hundreds of times more low-level actions per environment rollout. We only evaluated this method on Franka Kitchen as the authors did not evaluate on our other environments, and we found the implementation and tuning of their hand-designed primitives to work well on other environments to be non-trivial and difficult to make work.

SPiRL.

We adapt SPiRL, implemented on top of SAC [57], to our image-based settings and environments using their existing code to ensure the best performance. For each environment, we tuned SPiRL parameters (entropy coefficient, automatic entropy tuning, network architecture, etc.) first and then built our method upon the final SPiRL network architecture to ensure the fairest comparison. SPiRL uses the exact same datasets as ours but without skill labels. We also experimented with changing the length of SPiRL action sequences, and similar to what was reported in Pertsch et al. [6], we found that a fixed length of 10101010 worked best. We also found fixed prior coefficients KL divergence to perform better with SPiRL for our environments than automatic KL tuning.

BC.

We implement behavior cloning with network architectures similar to ours and using the same datasets. Our BC baseline learns an image-conditioned policy π(as)𝜋conditional𝑎𝑠\pi(a\mid s)italic_π ( italic_a ∣ italic_s ) that directly imitates single-step environment actions. We fine-tune pre-trained BC models for online RL with SAC [57].

SAC.

We implement Soft-Actor Critic [57] directly operating on low-level environment actions with an identical architecture to the BC baseline. It does not pre-train on any data.

B.3 Environment Implementation Details

Refer to caption
(a) Franka Kitchen
Refer to caption
(b) LIBERO
Figure 8: Our two image-based, continuous control robotic manipulation evaluation domains. (a) Franka Kitchen: The robot must learn to execute an unseen sequence of 4 sub-tasks in a row. (b) LIBERO: We evaluate 4 task suites of 10 tasks, each consisting of long-horizon, unseen tasks with new object, spatial, and goal transfer scenarios.
Franka Kitchen.

We use the Franka Kitchen environment from the D4RL benchmark [58] originally published by Gupta et al. [22] (see Figure 8(a)). The pre-training dataset comes from the “mixed” dataset in D4RL consisting of 601 human teleoperation trajectories each performing 4 subtasks in sequence in the environment (e.g., open the microwave). Our evaluation task comes from Pertsch et al. [6], where the agent has to perform an unseen sequence of 4 subtasks. The original dataset contains ground truth environment states and actions; we create an image-action dataset by resetting to ground truth states in the dataset and rendering the corresponding images. For all methods, we perform pre-training and RL with 64x64x3 RGB images and a framestack of 4. Sparse reward of 1 is given for each subtask, for a maximum return of 4. The agent outputs 7-dimensional joint velocity actions along with a 2-dimensional continuous gripper opening/closing action. Episodes have a maximum length of 280 timesteps.

LIBERO.

LIBERO [59] is a continual learning benchmark built upon Robosuite [70] (see Figure 8(b)). For skill extraction and policy learning, we use the agentview_rgb 3rd-person camera view images provided by the LIBERO datasets and environment. For pre-training, we use the LIBERO-90 pre-training dataset consisting of 4500 demonstrations collected from 90 different environment-task combinations each with 50 demonstrations. We condition all methods on 84x84x3 RGB images with a framestack of 2 along with language instructions provided by LIBERO. We condition methods on language by embedding instructions with a pre-trained, frozen sentence embedding model [71], all-MiniLM-L6-v2, to a single 384384384384-dimensional embedding and then feeding it to the policy. For \method, we condition on language by conditioning all networks on language; q,pz,pa,pd𝑞subscript𝑝𝑧subscript𝑝𝑎subscript𝑝𝑑q,p_{z},p_{a},p_{d}italic_q , italic_p start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT are all additionally conditioned on the language embedding and thus the skill-based policy is also conditioned on language. We also condition all networks in all baselines on this language embedding in addition to their original inputs.

When performing additional fine-tuning to LIBERO-{10, Goal, Spatial, Object}, for all methods (except SAC) we use the given task-specific datasets each containing 50 demonstrations per task before then performing online RL. In LIBERO-{Goal, Spatial, Object}, sparse reward is provided upon successfully completing the task, so the maximum return is 1.0. In LIBERO-10, tasks are longer-horizon and consist of two subtasks, so we provide rewards at the end of each subtask for a maximum return of 2.0. Episodes have a max length of 300 timesteps.

Appendix C Additional Experiments and Qualitative Visualizations

In this section, we perform additional experiments and ablation studies. In Section C.1, we visualize 2D PCA plots of clusters generated by \method in all environments. In Section C.2, we analyze statistics of the skill distributions generated by \method.

C.1 Additional PCA Cluster Visualizations

Refer to caption
(a) Franka Kitchen.
Refer to caption
(b) LIBERO.
Figure 9: 100 randomly sampled trajectories from all environment pre-training datasets after being clustered into skills and visualized in 2D with PCA. Clusters are well-separated, even in just 2-dimensions with a linear transfromation.

Here we display PCA skill cluster visualizations in all environments in Figure 9. Franka Kitchen clusterings are very distinguishable, even in 2 dimensions. (this is the same embedding plot as in Figure 4 in the main paper). LIBERO-90 clusters still demonstrate clear separation, but are not as separable after being projected down to 2 dimensions (from 2048 original dimensions). However, in Figure 12 we clearly see distinguishable behaviors among different skills in LIBERO.

C.2 Visualizing Cluster Statistics

Refer to caption
(a) Franka Kitchen, K=8𝐾8K=8italic_K = 8.
Refer to caption
(b) LIBERO, K=8𝐾8K=8italic_K = 8.
Figure 10: Skill/clustering statistics in all environments. We use the R3M VLM [47] and K=8𝐾8K=8italic_K = 8 for K-means. The top plots are skill length histograms for all skill trajectories combined, middle plots correspond to box-and-whisker plots with skill ID on the x-axis and lengths on the y-axis, and the bottom plots represent distributions of skill lengths separated by color for each skill ID.

We visualize skill clustering statistics in all pre-training environments in Figure 10. The plots demonstrate that average skill lengths are about 30 timesteps for all environments and that there is clear separation among the different skills just in terms of the distributions of skill lengths that they cover. For a qualitative look at the skills, see Appendix D.

Appendix D Visualizing skill trajectories

Here, we visualize skill trajectories in all environments. In Figure 11, we visualize purely randomly sampled clusters (i.e., without any cherry-picking) in Franka Kitchen, where we see skills are generally semantically aligned. For example, skill 3 trajectories correspond to manipulating knobs, skill 5 trajectories reach for the microwave door, and skill 7 trajectories are reaching for the cabinet handle.

We visualize LIBERO skills in Figure 12, where we can also see that skills are generally aligned.

{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 0; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 0; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 1; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 1; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 2; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 2; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 3; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 3; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 4; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 4; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 5; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 5; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 6; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 6; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 7; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 7; , ]

Refer to caption
Refer to caption
Figure 11: Kitchen skill visualizations. We randomly sample 2 labeled skill trajectories (no cherry-picking) and visualize the trajectory’s images in sequence after labeling with \method’s skill extraction phase. Clusters are generally semantically aligned.
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 0; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 0; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 1; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 1; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 2; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 2; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 3; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 3; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 4; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 4; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 5; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 5; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 6; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 6; , ]

Refer to caption
Refer to caption
{mdframed}

[ bottomline=false, leftline=false, linecolor=greentitle, innerrightmargin=25pt, singleextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 7; , firstextra= [greentitle] (P) rectangle ([xshift=-15pt]P|-O); \node[overlay,anchor=south east,rotate=90,font=] at (P) Cluster 7; , ]

Refer to caption
Refer to caption
Figure 12: LIBERO. We randomly sample 2 labeled skill trajectories (no cherry-picking) and visualize the trajectory’s images in sequence after labeling with \method’s skill extraction phase. Clusters are generally semantically aligned.

Appendix E \method RL Performance Analysis

Refer to caption
Figure 13: Skill lengths histogram of actually used \method skills in Franka Kitchen at training convergence. As explained in Section B.1, we limit skill execution lengths to 30 in Franka Kitchen.

Our method’s performance improvement over SPiRL is likely due to two reasons: longer average skills and a semantically structured skill-space instead of the random latent skills that SPiRL learns. In Section 5.3 we analyze the semantically structured skill-space. Here, we additionally analyze the longer average skills.

As plotted in Appendix Figure 10, \method extracts skills of various lengths, many of which are quite long. This translates into longer-executed skills: we plot a histogram of the lengths of the skills the skill-based policy actually learns to use at convergence in Franka Kitchen in Figure 13. \method-executed skills average 25 timesteps in length as compared to 10 for SPiRL. We experimented with longer skill lengths for SPiRL, but online RL performance suffered, a finding consistent with results presented in their paper [6].

Longer skills shorten the effective time horizon of the task by a factor of the average skill length for the skill-based agent because the skill-based agent operates on an MDP where transitions are defined by the end of execution of a skill which can be comprised of many low-level environment actions. By shortening the task time horizon, the learning efficiency of temporal-difference learning RL algorithms [72] can be improved by, for example, reducing value function bootstrap** error accumulation as there are less timesteps between a sparse reward signal and the starting state.