When Do Skills Help Reinforcement Learning?
A Theoretical Analysis of Temporal Abstractions
Abstract
Skills are temporal abstractions that are intended to improve reinforcement learning (RL) performance through hierarchical RL. Despite our intuition about the properties of an environment that make skills useful, a precise characterization has been absent. We provide the first such characterization, focusing on the utility of deterministic skills in deterministic sparse-reward environments with finite action spaces. We show theoretically and empirically that RL performance gain from skills is worse in environments where solutions to states are less compressible. Additional theoretical results suggest that skills benefit exploration more than they benefit learning from existing experience, and that using unexpressive skills such as macroactions may worsen RL performance. We hope our findings can guide research on automatic skill discovery and help RL practitioners better decide when and how to use skills.
1 Introduction
In most real-world sequential decision making problems, agents are only given sparse rewards for their actions. This makes reinforcement learning (RL) challenging, as agents can only recognize good behavior after long sequences of good decisions. This issue can be mitigated by leveraging temporal abstractions (Sutton et al., 1999), also known as skills. A skill is a high-level action — such as a fixed sequence of actions (macroaction) or a sub-policy with a termination condition (option) — that is expected to be useful in a large number of states. Skills can be hand-engineered to perform subtasks (Pedersen et al., 2016; He et al., 2011) or learned from experience (Machado et al., 2017; Bacon et al., 2017; Barreto et al., 2019; Kipf et al., 2019; Jiang et al., 2022; Li et al., 2022). Incorporating skills into the agent’s action space (hierarchical RL) allows it to act at a higher level and reach goals in fewer steps, which may improve exploration and thus RL performance.
Despite their appeal, skills have not seen widespread use. In fact, they were not involved in most major breakthroughs and applications of RL, such as surpassing human-level performance in all Atari games (Badia et al., 2020), RLHF for aligning LLMs with human preferences (Ouyang et al., 2022), AlphaTensor for faster matrix multiplication (Fawzi et al., 2022), and AlphaDev for faster sorting (Mankowitz et al., 2023). A reason skills have not been widely adopted is that they sometimes do not improve RL performance and it is unclear how to determine beforehand whether they would. While several methods have been developed to automatically discover skills, most of them require the practitioner to decide whether to use skills at all. To our knowledge, LEMMA (Li et al., 2022) is the only algorithm that automatically decides whether skills are useful by learning the optimal number of skills — zero would mean that skills do not help. However, this is accomplished by optimizing a heuristic objective that does not necessarily reflect the benefits to RL. Other skill discovery algorithms such as Option-Critic (Bacon et al., 2017), eigenoptions (Machado et al., 2017), deep skill chaining (Bagaria & Konidaris, 2019), LOVE (Jiang et al., 2022) and COPlanLearn (Nayyar et al., 2023) determine the number of skills using a hyperparameter. A better understanding of how exactly skills benefit RL may guide research in automatically determining whether skills would be useful in an environment and the optimal number to learn if they are. Such an understanding can also provide insight into why skills do not work in certain environments as well as help practitioners better decide whether to use skills for a given RL task.
Our work provides a theoretical analysis of when and how skills and hierarchical RL benefit RL performance in deterministic sparse-reward environments. We hope our insights will serve to guide research in automatic skill discovery including the automatic determination of whether to use skills, and allow practitioners to better understand the kinds of environments where skills are helpful. In summary, we make the following contributions:
-
•
We define two metrics — -exploration difficulty and -learning difficulty — that quantify the hardness of exploration and learning from experience in a deterministic sparse-reward environment with a finite action space. We show empirically that these metrics correlate strongly with the sample complexity of several RL algorithms (Section 3).
-
•
We define two closely related metrics that measure the incompressibility of solutions to states generated by the environment. Under mild assumptions, we prove lower bounds on the change in -learning difficulty and -exploration difficulty due to deterministic skills in terms of the incompressibility measures. We show that skills are better suited to decreasing -exploration difficulty rather than -learning difficulty, and less expressive skills are less apt at decreasing the difficulty metrics. In particular, for each difficulty metric, we demonstrate the existence of environments where incorporating macroactions provably increases it (Sections 4 and 5).
-
•
We show empirically that macroactions and deep neural options are less beneficial in environments with higher incompressibility (Section 6).
-
•
We describe how to derive skill learning objectives from our incompressibility metrics (Section 7).
All proofs are found in Appendix E. Code for experiments are publicly available at https://github.com/uranium11010/rl-skill-theory.
2 Preliminary Definitions
We first introduce basic definitions related to deterministic sparse-reward Markov decision processes (MDPs), which are the focus of this paper. We choose to focus on sparse-reward environments since skills are purported to alleviate the sparse-reward problem. Despite our focus on deterministic environments, a large number of environments both in the standard RL literature (e.g., the original Atari game environments (Bellemare et al., 2013) and MuJoCo (Todorov et al., 2012)) and in applications of RL (e.g., program synthesis (Ellis et al., 2019; Mankowitz et al., 2023) and mathematical reasoning (Kaliszyk et al., 2018; Poesia et al., 2021; Wu et al., 2021)) are deterministic. Furthermore, by focusing on a special case of MDPs, our hardness results — lower bounds on the change in difficulty due to skills — suggest that improving RL using skills in the general case of stochastic environments can be at least as hard. Finally, Section F.1 provides preliminary results on generalizing to stochastic environments, suggesting that many insights obtained from studying deterministic environments apply to stochastic ones as well.
Definition 2.1.
A deterministic sparse-reward MDP (DSMDP) is defined by a 4-tuple where is the state space, is the action space, is the deterministic transition function and is the goal state.
Note that environments that have multiple goal states can also be formulated as DSMDPs by merging these goal states into a single goal state. The CompILE2 environment introduced in Section 3.3 is one such example — see Appendix B for more details.
Borrowing terminology commonly used in symbolic reasoning domains, we say “solve a state” as a shorthand for “finding a sequence of actions that lead to the goal state,” and we call such a sequence of actions a solution. This is formalized below.
Definition 2.2.
A solution to a state of a DSMDP is a sequence of actions () such that applying the sequence of actions starting in results in the goal state :
(1) |
where denotes the result of applying action sequence to state . Here, is called the length of the solution. We will denote by the set of solutions to and the length of a shortest solution to .
Note that a state can have no solutions. For example, in domains where we’d like to formalize the notion of “death,” one could transition to a “dead state” that goes to itself for all actions taken, and that dead state has no solutions. In contrast, states that have at least one solution are called solvable states.
Some results in this paper assume that no two states share a solution, a property we call solution separability.
Definition 2.3.
A DSMDP is solution-separable if no sequence of actions is a solution to more than one state.
Any DSMDP with invertible transitions is solution-separable. Here, we say a DSMDP has invertible transitions if whenever and is either solvable or the goal. Examples include (a) all twisty puzzles such as the Rubik’s cube; (b) grid world domains where taking a vacuous action (e.g., walking into a wall or picking up a non-existent object) leads to instant death; (c) sliding puzzles where taking a vacuous action leads to instant death.
The following definition formalizes RL in the episodic setting as applied to a DSMDP.
Definition 2.4.
In reinforcement learning (RL) in the episodic setting, an agent interacts with an environment (MDP) in episodes to learn a policy that optimizes the expected cumulative reward from one episode. For a DSMDP, the optimal policy is
(2) |
Here, is the initial state distribution and is the discount factor. is the result of rolling out policy starting in state , stop** when either the goal state is reached or actions have been taken, where is called the horizon and sometimes considered part of the definition of an MDP. Note that when , then Equation 2 becomes maximizing the probability that the policy solves .
Now, we introduce skills. Whereas skills need not be deterministic in general, we are studying deterministic environments and will thus focus on deterministic skills.
Definition 2.5.
A deterministic skill in a DSMDP is a function from states to finite action sequences. In other words, for each state, we specify the sequence of actions to be taken if the agent initiates the skill in that state. Note that this sequence is allowed to be empty.
We will refer to deterministic skills as simply “skills.”
The prototypical example of an unexpressive class of skills is macroactions.
Definition 2.6.
A macroaction is a skill that produces the same sequence of actions of length greater than 1 regardless of the state in which the skill is initiated.
Incorporating skills into a DSMDP is called a skill augmentation, which is more precisely defined below.
Definition 2.7.
A DSMDP augmented with a finite set of skills is the DSMDP where , for , and for .111 Technically, is a partial function as is undefined if unrolling the skill reaches the goal state before the unrolling finishes. Thus, in this case, the agent is considered not to have reached the goal state. (However, our HRL implementation in our experiments follows the more common convention that the agent is considered successful in this situation.) We say is the -skill augmentation of . We call the base action space and the skill-augmented action space. Furthermore, if so that is a proper subset of , then we say the skill augmentation is strict.
For simplicity, when discussing a base environment and its skill augmentation , we will abuse notation by writing subscripts “” or “” in places where they should really be “” or “”, such as and for and . We allow repetition of skills and skills are also allowed to overlap with base actions. In such cases, and should be interpreted as multisets.
3 Quantifying RL Difficulty in a Deterministic Sparse-Reward Environment
To study how much skills can reduce the difficulty of applying RL to a DSMDP, we need to first quantify this difficulty. Unfortunately, existing MDP difficulty metrics fail to capture RL difficulty in DSMDPs since they were not designed to directly estimate sample efficiency or regret, but instead appear in loose asymptotic performance bounds of RL algorithms (see Appendix A for a brief survey). As a result, they correlate poorly with actual performance measures like total regret (Conserva & Rauber, 2022). We therefore aim to develop difficulty metrics for DSMDPs by directly estimating an RL performance measure — in our case, sample efficiency — and to verify them empirically.
Below, we introduce two metrics quantifying the difficulty of applying RL to a deterministic sparse-reward environment, assuming that the environments compared have the same state space (e.g., they are different skill augmentations of the same base environment). We motivate these metrics using heuristic arguments that estimate the sample efficiency of an RL agent in the episodic setting without assuming any particular RL algorithm. We then experimentally test how well the metrics correlate with the sample efficiency of 4 popular RL algorithms in 32 macroaction augmentations of each of 4 base environments.
3.1 Quantifying Difficulty in Learning from Experience
To quantify the complexity of learning a DSMDP from existing experience, suppose that the agent has gathered enough experience to effectively reduce the remaining learning problem to a planning problem. Then Lemma 3.1 shows that the number of iterations through the entire state space needed to learn the value of a state is linear in the minimum length of a solution to that state.
Lemma 3.1.
Suppose we apply value iteration with discount rate and learning rate to a DSMDP with a finite action space. In particular, we initialize for and , and at time , we update the entire table using
(3) |
If , then the number of time steps until the value of a solvable state becomes its true value (i.e., ) is . If , then the number of time steps until the value of a solvable state is within of its true value (i.e., ) is
Since each iteration has a complexity of , the total complexity for learning the value of a state is for constant . If we apply the same intuition to the RL setting, then we would expect that learning the optimal policy at a state requires “iterations,” where one “iteration” involves the agent sampling experiences that effectively cover the entire space of state-action pairs. Thus, as a rough estimation, approximately samples are needed to learn the policy at state . Here, is some effective size of the state space, counting only those states that we “care about,” i.e., those with positive or that are part of (short) solutions to states with positive . For constant , this estimation of the sample complexity motivates using a weighted average of over states to measure the complexity of learning from experience.
Definition 3.2.
Let be a DSMDP with finite action space . For a probability distribution on solvable states, the -learning difficulty of is defined as
(4) |
where is the length of a shortest solution to .
The distribution assigns higher importance to states that we care more about learning to solve. If denotes the initial state distribution of the MDP, then should be higher for states with higher . For simplicity, we can just take to be .
The -learning difficulty can be viewed as a generalization of diameter (Auer et al., 2008). While the diameter of an MDP is originally defined for the continuous learning setting, a natural extension to the episodic setting for a DSMDP is the maximum length of a solution to a state, . Ignoring the factor, this is the -learning difficulty when is zero for all but the state(s) with the largest .
3.2 Quantifying Difficulty in Exploration
-learning difficulty does not take into account the complexity of gathering the needed experience: learning a state starts to take place only after the agent has seen state-action pairs that form a chain leading from to the goal state. Thus, as a simplification, an agent’s learning process in the episodic setting can be roughly divided into two stages: the first stage is dominated by exploration, where the agent tries to find reward signal and gather experience; the second stage is dominated by learning, where the agent learns from the experience. The sample efficiency of the learning stage is captured by the -learning difficulty. Let us now motivate the definition of -exploration difficulty by estimating the sample efficiency of the exploration stage.
Suppose that the initial exploration policy is a uniformly random policy, and let denote the probability that such a policy solves in one episode. Assuming that the policy remains roughly uniform until the agent finally solves for the first time, the expected number of episodes until this happens is , and the number of environment steps taken is where is the horizon. To obtain an upper bound on the expected total number of steps taken to find a solution to every state, we simply sum this expression over all states to arrive at . Note that this can be a significant overestimate of the true sample complexity: solving a state often updates the agent in a way that helps it solve states whose solutions contain . We will address this issue later.
For a constant horizon and state space size, where is a uniform distribution over all states. As with the -learning difficulty, we generalize this to allow different weights to be assigned to different states. For example, if a state has small but the MDP’s initial state distribution assigns almost zero probability to , then we can afford not to learn to solve and this can be reflected by having . For simplicity, we can simply set to , as with the -learning difficulty.
We now address the issue of overestimating the sample complexity. In practice, this overestimation is more significant when for different are more disparate. In DSMDPs where states vary in difficulty (vary in ), solving easy states (states with large ) generally updates the agent in a way that helps it find solutions to harder states (states with small ). For this reason, we find empirically (Section D.2) that the arithmetic mean is outperformed by the geometric mean , which is lower than when there’s variety in . Although this estimation of exploration sample complexity is quite rough, it is difficult to make better estimates without knowing details of the MDP structure and RL algorithm. Also, the resultant definition of -exploration difficulty already performs well empirically on several environments for several RL algorithms (Section 3.3).
Finally, we take the logarithm of as that simplifies notation in our theoretical results. We also replace the fixed horizon with a random horizon sampled from a geometric distribution to simplify theoretical analysis.
Definition 3.3.
Let be a DSMDP with finite action space . For a probability distribution on solvable states and , the -discounted -exploration difficulty of is defined as
(5) |
where
(6) |
is the probability that the following policy solves : at every time step, terminate with probability and choose an action uniformly at random with probability . is also the probability that the uniformly random policy solves within a horizon of length , where is sampled from the geometric distribution with parameter .
3.3 Experiments
Q-Learning | 0.947 0.006 | 0.792 0.025 | 0.403 0.036 | 0.857 0.023 | |
---|---|---|---|---|---|
0.953 0.008 | 0.786 0.023 | 0.671 0.056 | 0.937 0.003 | ||
Value iteration | 0.933 0.009 | 0.825 0.018 | 0.693 0.051 | 0.785 0.031 | |
0.951 0.015 | 0.849 0.013 | 0.885 0.011 | 0.748 0.029 | ||
REINFORCE | 0.949 0.006 | 0.869 0.013 | 0.678 0.020 | 0.892 0.029 | |
DQN | 0.789 0.028 | 0.758 0.076 | 0.583 0.039 | 0.753 0.019 |
In motivating -learning difficulty and -exploration difficulty, we made significant approximations to estimate the sample complexity without assuming a particular environment or RL algorithm. Despite this, we show empirically that a combination of the two difficult metrics predicts sample complexity well across a variety of environments and RL algorithms.
We study four deterministic sparse-reward environments: (a) CliffWalking, a simple grid world (Sutton & Barto, 2018); (b) CompILE2, the CompILE grid world with visit length 2 (Kipf et al., 2019); (c) 8Puzzle, the 8-puzzle; (d) RubiksCube222, the 2x2 Rubik’s cube. For the computation of -learning difficulty and -exploration difficulty to be feasible, needs to have finite support over a sufficiently small number of states ( or less). To mitigate this limitation, we chose environments for which there exist larger versions with a similar MDP structure. For example, the 2x2 Rubik’s cube should behave similarly to the 3x3 cube, 4x4 cube, etc., and the 8-puzzle should behave similarly to the 15-puzzle, 24-puzzle, etc.
Each environment has 32 action space variants, with one being the base environment (the trivial skill augmentation) and 31 with different sets of macroactions. One macroaction augmentation is calculated using LEMMA (Li et al., 2022) on offline data derived from breadth-first search; 5 are variations of that macroaction augmentation; and 25 are generated randomly. More details are given in Appendix B.
We evaluate how well a combination of -learning difficulty and -exploration difficulty captures the sample complexity of 4 RL algorithms on the different variants of each environment. The algorithms are: (a) Q-learning (Watkins, 1989); (b) Value iteration (Bellman, 1957), modified to the RL setting, similar to (Agostinelli et al., 2019); (c) REINFORCE (Williams, 1992), made tabular by parameterizing the policy directly with the logits of the actions; (d) Deep Q-networks (DQN) (Mnih et al., 2015).
According to Sections 3.1 and 3.2, we expect to scale roughly linearly with the sample complexity of learning from experience and to scale roughly linearly with the sample complexity of exploration. We thus choose a weighted average () to represent the combined difficulty. The discount used in the -exploration difficulty is set to , where is the environment’s horizon. The sample complexity and the combined difficulty spanned several orders of magnitude in CliffWalking and CompILE2, so we took the logarithm of both before computing their Pearson correlation coefficient. The value of was chosen to maximize this correlation. The results are summarized in Table 1. Most correlation values are at least around 0.7, demonstrating that combining -learning difficulty and -exploration difficulty allows us to capture a significant portion of the variation in RL sample efficiency on different action space variants of the same environment.
We also conducted experiments to directly test Lemma 3.1 by computing the correlation between the number of iterations it takes value iteration to converge and the -weighted average solution length (Section D.1). In addition to state value iteration, we also considered Q-value iteration to simulate Q-learning. With two exceptions, all correlations are above 0.9, thus empirically corroborating Lemma 3.1.
4 Effect of Skills on Learning from Experience
Part of our goal is to understand what makes a particular set of skills helpful for an RL agent. One intuition articulated in prior work (Jiang et al., 2022; Kipf et al., 2019) is that skills help compress optimal trajectories, making them shorter and thus more likely to be found during exploration. But, conversely, data distributions can be provably incompressible when their entropy is too high (Cover, 1994). As a result, we expect that skills are less likely to be helpful when the distribution of optimal trajectories in the environment is incompressible. This intuition is made precise by Theorem 4.2, which states that the ratio between the new and old -learning difficulties after an -skill augmentation is lower-bounded by the product of an incompressibility measure and a factor penalizing large . Before stating the theorem, let’s first define this incompressibility measure.
Definition 4.1.
Let be a DSMDP with finite and its -skill augmentation. Let be a distribution over solvable states. The -merged -incompressibility is defined as
(7) |
Here, is the distribution of canonical shortest solutions in to states sampled from , where the canonical shortest solutions are chosen such that is maximized. Note that is the entropy of the state distribution after states with the same canonical solution in have been merged into one state. Thus, it has the property , where equality holds iff all states in the support of have different canonical solutions.
-merged -incompressibility can be understood as the coding efficiency of using base actions to write solutions to states sampled from as opposed to using a code optimized for the distribution of shortest solutions with skills. More precisely, we can write
(8) |
where is the optimal expected number of bits needed to encode a (canonical) shortest solution in to a state , and denotes the cross entropy between and . is the distribution of shortest solutions to states sampled from containing only base actions. is a uniform prior over base action sequences. is thus the expected number of bits required to encode a shortest solution using a fixed-length code over base actions , optimized for a termination symbol that appears at the end of each time step with probability .
We now introduce the theorem, which shows how -merged -incompressibility can be used to bound how much skills in can improve -learning difficulty.
Theorem 4.2.
Let be the -skill augmentation of the DSMDP with finite , and a probability distribution over solvable states. Then
(9) |
We can use Theorem 4.2 to understand the effect that the expressivity of skills has on their ability to improve -learning difficulty.222 See Section F.2 for a more formal treatment where the incompressibility measure in Theorem 4.2 is replaced with one defined explicitly in terms of a quantitative measure of expressivity. More expressive skills can encode more diverse behavior and thus allow a larger number of action sequences to be encoded as the same skill. This allows states to share solutions more often, which decreases and hence . As a result, the lower bound on the -learning difficulty ratio decreases. As concrete examples, if we place no restriction on what kinds of skills are allowed, then we can simply include a single skill that solves all solvable states, resulting in and . This is less than whenever , which is true for all RL environments of practical interest. If a skill is allowed to be a concrete sequence of actions and loops of actions, then states whose solutions involve different numbers of repetitions of the same component will have the same solution containing a skill with a loop whose body is that component. Thus, but is larger than the value of zero obtained when no restriction is placed on skills. Finally, if skills are restricted to macroactions, then distinct solutions remain distinct after rewriting with macroactions, and so the -merged -incompressibility achieves its maximum value. In solution-separable environments, this maximum value is equal to the unmerged -incompressibility (Definition 4.3), in which case Theorem 4.2 can be restated in terms of it (Corollary 4.4).
Definition 4.3.
Let be a DSMDP with finite and a distribution over solvable states. The unmerged -incompressibility is defined as
(10) |
where the -discounted unmerged -incompressibility
(11) |
It measures incompressibility on a scale from 0 to 1 if is solution-separable. Furthermore, unlike the -merged -incompressibility, it is a function of only and and is thus a general measure of the incompressibility of .
Corollary 4.4 (Corollary to Theorem 4.2).
In the setup to Theorem 4.2, suppose is solution-separable333 See Section F.3 for the version of this corollary that does not assume solution-separability. and is a macroaction augmentation. Then
(12) |
A direct consequence of the above corollary is that there exist environments where incorporating macroactions will always worsen -learning difficulty, no matter how many there are or what they are.
Corollary 4.5 (Corollary to Corollary 4.4).
In the setup to Theorem 4.2, suppose is solution-separable and is a strict macroaction augmentation. If
then .
5 Effect of Skills on Exploration
To study the properties of a DSMDP that make exploration difficult, we have derived a tight lower bound on the -exploration difficulty of a DSMDP in terms of the entropy of and a term representing how dense solutions to states are in the space of all solutions (Theorem 5.2).
Definition 5.1.
Let be a DSMDP with finite action space . For , the -discounted solution density of is defined as
(13) |
where
(14) |
is the probability that a uniformly random action sequence with length sampled from solves .
Theorem 5.2.
Let be the -skill augmentation of the DSMDP with a finite action space, and a probability distribution over solvable states. Then for ,
(15) |
Furthermore, if the state space is finite and , then for any , there exists an -skill augmentation of such that
(16) |
thus showing that the lower bound given above is tight for all finite DSMDPs and a large range of .
The fact that the lower bound grows with is intuitive: when there are many states that we care about learning to solve ( is large), it is hard for the agent to gather the experience needed to learn to solve all these states ( is large). However, incorporating skills only changes the action space and cannot affect . Skills thus improve exploration by increasing the -discounted solution density, which is interpreted as the density of solutions to states within the space of all action sequences. Action sequences of length equally divide a total density of , so that the combined density of all possible action sequences is 1. If is solution-separable, then , whereas if every action sequence solves some state, then . Skills improve exploration by increasing this density, similar to how skills reduce -merged -incompressibility by allowing more states to share solutions. More expressive skills are more apt at increasing solution density. For example, introducing macroactions in a solution-separable environment results in a solution-separable environment, so the density remains at most 1. If we introduce the logic of loops, then states whose solutions involve different repetitions of the same component can be solved by the same action sequence containing a loop skill, hence increasing the density. In the extreme case where no restriction is placed on the kind of skills allowed, we can introduce many skills, each of which automatically solves all solvable states. The resultant density is approximately , which is usually much larger than 1.
As a corollary to Theorem 5.2, increase in -exploration difficulty due to macroactions is lower-bounded by the -discounted unmerged -incompressibility (Equation 11) in solution-separable environments, thus providing the -exploration difficulty counterpart to Corollary 4.4.
Corollary 5.3 (Corollary to Theorem 5.2).
In the setup to Theorem 5.2, suppose is solution-separable, , and is a macroaction augmentation. Then
(17) |
Compared to Corollary 4.4, the factor penalizing large is absent, and the in has been removed. The resultant weaker bound suggests that skills are better suited to improving exploration than learning from experience. This is made more precise in Theorems 5.4 and 5.5 below, but before stating these results, we shall first give an intuitive explanation for why this is the case.
In discussing the effects of skills on learning from existing experience, there was a tradeoff between action space size and reducing solution lengths. Intuitively, while skills allow reward information to propagate to states faster, a large action space means a larger number of experiences to iterate through to efficiently cover the space of all state-action pairs . Such a tradeoff is not so clear in the effects of skills on exploration. To improve exploration, skills are chosen so that a uniformly random policy in the augmented action space is more likely to reach the goal. If skills are expressive enough, this should always be possible, unless the base action space is already close to optimal. Of course, the most general skills trivially improve -exploration difficulty by simply map** every solvable state to the goal, which gives . But there can be skills that achieve the maximum possible -merged -incompressibility (which appears in the lower bound for -learning difficulty increase in Theorem 4.2) but still decrease -exploration difficulty. This is made precise by the following theorem.
Theorem 5.4.
Let be a solution-separable DSMDP with finite as well as finite . Let be a probability distribution over solvable states. For all for which , there exists an -skill augmentation of such that:
-
•
There exist distinct shortest solutions in to all states in the support of (namely, achieves its maximum possible value and thus achieves its maximum possible value );
-
•
.
Corollary 5.5 (Corollary to Theorem 5.4).
Corollary 5.5 shows that there are environments where skills can benefit exploration but harm learning from experience. This again suggests that skills are more apt at improving exploration than learning.
As a final discussion on the effect that skills have on exploration, we answer the question: are there environments where unexpressive skills like macroactions always harm exploration? Unlike Corollary 4.4, there is no penalty factor in the lower bound given in Corollary 5.3. As a result, there is no environment where the lower bound is above 1, which would have implied that all macroaction augmentations increase -exploration difficulty. Nevertheless, the answer to the question is still affirmative. The following two theorems construct environments where incorporating macroactions always increases -exploration difficulty, no matter how many there are or what they are.
Theorem 5.6.
Let be a solution-separable DSMDP with a finite action space such that any state that has a length-1 solution only has length-1 solutions. Let be a probability distribution over solvable states. Suppose that and
Then for any strict macroaction augmentation of .
Theorem 5.7.
Let be a solution-separable DSMDP with a finite action space such that: 1) every action sequence is the solution to some state; 2) for every solvable state, all solutions to that state have the same length. Let be a probability distribution over solvable states such that for any whose solutions have the same length. Then
(18) |
for any strict -macroaction augmentation of .
A stronger version of this theorem (Section F.4) relaxes the conditions on and and the modified bound involves subtracting a corresponding KL-divergence term.
Stated in words, Theorem 5.6 says that macroactions harm exploration when most action sequences are solutions to some state and that a state’s assigned importance is close to the probability that a uniformly random action sequence solves it. Theorem 5.7 suggests that it suffices for to be roughly proportional to this probability across states whose solutions have the same length. These results make more precise our intuition that it is more difficult to use skills to improve exploration in environments where solutions to states look uniformly randomly distributed.
6 Experiments
Corollaries 4.4 and 5.3 suggest that solution-separable DSMDPs with lower unmerged -incompressibility can benefit more from macroactions. We test this prediction on the four environments studied in Section 3.3, which include both solution-separable (RubiksCube222) and non-solution-separable (CliffWalking, CompILE2, 8Puzzle) DSMDPs. For different complexity measures (-learning difficulty, -exploration difficulty, and sample complexity of four RL algorithms), Figure 1 shows the best complexity improvement ratio across the 31 (strict) macroaction augmentations of each base environment against the unmerged -incompressibility of the base environment. We observe a positive correlation regardless of the choice of and RL algorithm, thus corroborating our theoretical predictions: macroactions are more helpful in environments with lower unmerged -incompressibility.
While the definition of unmerged -incompressibility is motivated in the context of macroactions (Corollaries 4.4 and 5.3), experiments with general stochastic options discovered by LOVE (Jiang et al., 2022) show that it successfully captures the difficulty of applying HRL with general options in an environment. Table 2 shows the unmerged -incompressibility values of our four environments, along with the sample complexity improvement ratio from optionally applying HRL with options discovered by LOVE. The improvement from HRL decreases as the unmerged -incompressibility increases.
7 -Incompressibility for Skill Learning
Appendix G demonstrates two ways to use our incompressibility measures to derive objectives for skill learning. We show that, under mild approximations, these two objectives are equivalent to two minimum description length (MDL) objectives previously used in the skill learning literature. In particular, finding the that minimizes -merged -incompressibility corresponds to the objective used by LOVE (Jiang et al., 2022), and finding the skills such that the resultant skill-augmented environment has the highest unmerged -incompressibility corresponds to the objective used by LEMMA (Li et al., 2022).
Environment | ||
---|---|---|
CliffWalking | 0.000007 0.000007 | 0.0000 |
CompILE2 | 0.00023 0.00011 | 0.1475 |
8Puzzle | 0.64 0.19 | 0.5157 |
RubiksCube222 | 0.73 0.17 | 0.8072 |
8 Conclusion
We introduce the first theoretical analysis of the utility of RL skills, focusing on deterministic sparse-reward MDPs. With both theoretical motivation and empirical verification, we introduce metrics that quantify two aspects of RL complexity: exploration and learning from experience. We show both theoretically and experimentally that these metrics can be improved more in environments where solutions to states are more compressible. Further theoretical results suggest that skills benefit exploration more than learning from experience, and that less expressive skills are less beneficial to improving RL sample efficiency. Our work is a first step towards characterizing the properties of an environment that make skills helpful for RL, and we expect future theoretical work to generalize beyond deterministic sparse-reward MDPs with finite action spaces.
Acknowledgements
We thank anonymous referees for useful suggestions and discussions, as well as instructors of the MIT Advanced Undergraduate Research Opportunities Program (SuperUROP) for suggestions on presentation.
This work was funded by U.S. National Science Foundation (NSF) awards #1918771 and #1918839. In addition, ZL was supported by the MIT Advanced Undergraduate Research Opportunities Program (SuperUROP) and GP was supported by the Stanford Interdisciplinary Graduate Fellowship (SIGF).
Impact Statement
This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.
References
- Agostinelli et al. (2019) Agostinelli, F., McAleer, S., Shmakov, A., and Baldi, P. Solving the Rubik’s cube with deep reinforcement learning and search. Nature Machine Intelligence, 1(8):356–363, 2019.
- Auer et al. (2008) Auer, P., Jaksch, T., and Ortner, R. Near-optimal regret bounds for reinforcement learning. Advances in Neural Information Processing Systems, 21, 2008.
- Bacon et al. (2017) Bacon, P.-L., Harb, J., and Precup, D. The option-critic architecture. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 31, 2017.
- Badia et al. (2020) Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., Vitvitskyi, A., Guo, Z. D., and Blundell, C. Agent57: Outperforming the atari human benchmark. In International Conference on Machine Learning, pp. 507–517. PMLR, 2020.
- Bagaria & Konidaris (2019) Bagaria, A. and Konidaris, G. Option discovery using deep skill chaining. In International Conference on Learning Representations, 2019.
- Barreto et al. (2019) Barreto, A., Borsa, D., Hou, S., Comanici, G., Aygün, E., Hamel, P., Toyama, D., Mourad, S., Silver, D., Precup, D., et al. The option keyboard: Combining skills in reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
- Bellemare et al. (2013) Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. The Aarcade Learning Environment: An evaluation platform for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013.
- Bellman (1957) Bellman, R. A Markovian decision process. Journal of Mathematics and Mechanics, pp. 679–684, 1957.
- Choi (1994) Choi, K. P. On the medians of gamma distributions and an equation of Ramanujan. Proceedings of the American Mathematical Society, 121(1):245–251, 1994.
- Conserva & Rauber (2022) Conserva, M. and Rauber, P. Hardness in markov decision processes: Theory and practice. Advances in Neural Information Processing Systems, 35:14824–14838, 2022.
- Cover (1994) Cover, T. Information theory and statistics. In Proceedings of 1994 Workshop on Information Theory and Statistics, pp. 2. IEEE, 1994.
- Ellis et al. (2019) Ellis, K., Nye, M., Pu, Y., Sosa, F., Tenenbaum, J., and Solar-Lezama, A. Write, execute, assess: Program synthesis with a repl. Advances in Neural Information Processing Systems, 32, 2019.
- Fawzi et al. (2022) Fawzi, A., Balog, M., Huang, A., Hubert, T., Romera-Paredes, B., Barekatain, M., Novikov, A., R Ruiz, F. J., Schrittwieser, J., Swirszcz, G., et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610(7930):47–53, 2022.
- He et al. (2011) He, R., Brunskill, E., and Roy, N. Efficient planning under uncertainty with macro-actions. Journal of Artificial Intelligence Research, 40:523–570, 2011.
- Hukmani et al. (2021) Hukmani, K., Kolekar, S., and Vobugari, S. Solving twisty puzzles using parallel Q-learning. Engineering Letters, 29(4), 2021.
- Jiang et al. (2022) Jiang, Y., Liu, E., Eysenbach, B., Kolter, J. Z., and Finn, C. Learning options via compression. Advances in Neural Information Processing Systems, 35:21184–21199, 2022.
- Kakade & Langford (2002) Kakade, S. and Langford, J. Approximately optimal approximate reinforcement learning. In Proceedings of the Nineteenth International Conference on Machine Learning, pp. 267–274, 2002.
- Kaliszyk et al. (2018) Kaliszyk, C., Urban, J., Michalewski, H., and Olšák, M. Reinforcement learning of theorem proving. Advances in Neural Information Processing Systems, 31, 2018.
- Kipf et al. (2019) Kipf, T., Li, Y., Dai, H., Zambaldi, V., Sanchez-Gonzalez, A., Grefenstette, E., Kohli, P., and Battaglia, P. Compile: Compositional imitation learning and execution. In International Conference on Machine Learning, pp. 3418–3428. PMLR, 2019.
- Li et al. (2022) Li, Z., Poesia, G., Costilla-Reyes, O., Goodman, N., and Solar-Lezama, A. Lemma: Bootstrap** high-level mathematical reasoning with learned symbolic abstractions. NeurIPS’22 MATH-AI Workshop, 2022.
- Machado et al. (2017) Machado, M. C., Bellemare, M. G., and Bowling, M. A laplacian framework for option discovery in reinforcement learning. In International Conference on Machine Learning, pp. 2295–2304. PMLR, 2017.
- Maillard et al. (2014) Maillard, O.-A., Mann, T. A., and Mannor, S. “How hard is my MDP?” The distribution-norm to the rescue. Advances in Neural Information Processing Systems, 27, 2014.
- Mankowitz et al. (2023) Mankowitz, D. J., Michi, A., Zhernov, A., Gelmi, M., Selvi, M., Paduraru, C., Leurent, E., Iqbal, S., Lespiau, J.-B., Ahern, A., et al. Faster sorting algorithms discovered using deep reinforcement learning. Nature, 618(7964):257–263, 2023.
- Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Nayyar et al. (2023) Nayyar, R. K., Verma, S., and Srivastava, S. Learning generalizable symbolic options for transfer in reinforcement learning. In NeurIPS 2023 Workshop on Generalization in Planning, 2023.
- Ouyang et al. (2022) Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Pedersen et al. (2016) Pedersen, M. R., Nalpantidis, L., Andersen, R. S., Schou, C., Bøgh, S., Krüger, V., and Madsen, O. Robot skills for manufacturing: From concept to industrial deployment. Robotics and Computer-Integrated Manufacturing, 37:282–291, 2016.
- Poesia et al. (2021) Poesia, G., Dong, W., and Goodman, N. Contrastive reinforcement learning of symbolic reasoning domains. Advances in Neural Information Processing Systems, 34:15946–15956, 2021.
- Simchowitz & Jamieson (2019) Simchowitz, M. and Jamieson, K. G. Non-asymptotic gap-dependent regret bounds for tabular MDPs. Advances in Neural Information Processing Systems, 32, 2019.
- Sutton & Barto (2018) Sutton, R. S. and Barto, A. G. Temporal difference learning. In Reinforcement Learning: An Introduction, chapter 6. MIT Press, 2018.
- Sutton et al. (1999) Sutton, R. S., Precup, D., and Singh, S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning. Artificial Intelligence, 112(1-2):181–211, 1999.
- Todorov et al. (2012) Todorov, E., Erez, T., and Tassa, Y. MuJoCo: A physics engine for model-based control. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 5026–5033. IEEE, 2012.
- Watkins (1989) Watkins, C. J. C. H. Learning from delayed rewards. 1989.
- Williams (1992) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 1992.
- Wu et al. (2021) Wu, M., Norrish, M., Walder, C., and Dezfouli, A. Tacticzero: Learning to prove theorems from scratch with deep reinforcement learning. Advances in Neural Information Processing Systems, 34:9330–9342, 2021.
Appendix A Survey on Existing RL Difficulty Metrics
Here, we provide a brief survey on existing RL difficulty metrics and explain why they are inadequate for our purposes. See Conserva & Rauber (2022) for a more detailed survey and benchmark. We will be using the notation for an MDP with state space , action space , transition kernel , and reward kernel .
-
•
The environmental value norm of the optimal policy (Maillard et al., 2014) is given by
(19) where is the transition kernel of the MDP and is the value function of the optimal policy with discount factor . The variation in the values of next states quantifies the difficulty in obtaining accurate sample estimates of action values. However, in deterministic MDPs, which are our focus, the environmental value norm of the optimal policy is always zero and is therefore not applicable.
-
•
The distribution mismatch coefficient (Kakade & Langford, 2002) is given by
(20) where is the stationary distribution of the Markov chain induced by policy and is the stationary distribution of the Markov chain induced by the optimal policy. It measures how much the stationary distribution of states visited by the agent can differ from the optimal distribution. It is defined only for ergodic MDPs (otherwise the stationary distribution may not be uniquely defined) in the continuous setting, whereas we focus on deterministic MDPs (which are not ergodic when ) in the episodic setting.
-
•
The sum of reciprocals of suboptimality gaps (Simchowitz & Jamieson, 2019) is given by
(21) where and are the state and action value functions of the optimal policy. Larger allows the agent to more easily distinguish suboptimal actions from the optimal action and can thus reduce average total regret in the long run. However, as Conserva & Rauber (2022) points out, smaller makes it easier to find a near-optimal policy, which contributes to decreasing the sample complexity.
-
•
The diameter (Auer et al., 2008) is defined to be
(22) where denotes the expected time to reach starting in following policy . While this is defined for the continuous setting, a natural definition for the diameter of a DSMDP in the episodic setting would be
(23) where denotes the length of a shortest solution to . However, taking the supremum is overly pessimistic, and in many cases, there may be states that are far from the goal but that we do not care about solving. Our -learning difficulty takes this into account by using a weighted average of , multiplied by to take into account the additional sample complexity due to a large action space.
Appendix B Environments
Experiments were conducted on 4 base environments of varying complexity:
-
•
CliffWalking (Sutton & Barto, 2018), a toy grid world environment of size where the agent always begins in the bottom left corner and has to travel to the bottom right corner. The available actions are moving one step in each of the 4 cardinal directions. The agent returns to its original position whenever it touches a square in the bottom row other than the leftmost and rightmost squares.
-
•
CompILE2 is one of the CompILE grid world environments (Kipf et al., 2019). The agent navigates in an grid world with walls both lining the edges and within the grid. The world also has several objects of different kinds, possibly with several of each kind. The agent’s goal is to pick up several specified (kinds of) objects in order. In CompILE2, the agent has to pick up 2 objects. The available actions are moving one step in each of the 4 cardinal directions in addition to attempting to pick up the object in the current cell. The positions and types of the objects are fixed but the agent’s position is randomized at every reset, following Jiang et al. (2022). We did not choose 3 or more objects for the agent to pick up because we found that the agent could not find the positive reward signal without suitable skills in these cases, consistent with previous findings on the same environment (Kipf et al., 2019; Jiang et al., 2022). Since whether the goal is reached depends on the sequence of objects the agent has picked up, the state includes both the grid and the sequence of objects that the agent has picked up thus far. Since Kipf et al. (2019) did not publish the source code for the environment, we use the implementation by Jiang et al. (2022).
Because there can be several of the same kind of object on the grid, there are different sequences of objects the agent can pick up that amount to the same sequence of kinds of objects. There are thus multiple goal states, which are merged into one to comply with the definition of a DSMDP.
-
•
8Puzzle is the 8-puzzle, the version of the more well-known 15-puzzle. There are 8 tiles numbered 1 to 8 on a board so that there is one tile missing. The available actions are moving the position of the missing tile in each of the four cardinal directions. The solved state has the numbers 1 to 8 in order from left-to-right, top-to-bottom. The puzzle is scrambled from the solved state by applying a random legal action times where is uniform between 1 and 31. Here, 31 is the maximum distance from any state to the goal state. The puzzle is re-scrambled if the scramble solves the cube.
-
•
RubiksCube222 is the 2x2 Rubik’s cube, also called the pocket cube. The available actions are turning the front, right, or top faces clockwise by . The cube is scrambled by applying a random sequence of moves of length where is uniform between 1 and 11 and where each move is turning the front, right, or top face clockwise, , or counterclockwise and no two consecutive moves turn the same face. (Note that the action space used for scrambling is larger than the action space of the agent.) Here, 11 is the maximum number distance from a state to the solved state. We use the implementation provided by Hukmani et al. (2021).
For 8Puzzle and RubiksCube222, our choice of sampling the scramble length uniformly from 1 to some maximum follows Agostinelli et al. (2019).
Basic information about the 4 base environments is summarized in Table 3.
Environment | |||
---|---|---|---|
CliffWalking | 4 | 32 | 1 |
CompILE2 | 5 | 115,462 | 59 |
8Puzzle | 4 | 362,880 | 181,439 |
RubiksCube222 | 3 | 3,674,160 | 3,674,159 |
For each base environment, one of the 32 action space variants is just the base environment itself. The remaining 31 are (strict) macroaction augmentations generated as follows:
-
•
For CliffWalking, the LEMMA abstraction algorithm (Li et al., 2022) found one single macroaction from the offline trajectory data generated using breadth-first search (BFS). That single macroaction is just the shortest sequence of actions that solves the only possible starting state of the environment: (U = up, R = right, D = down, L = left)
-
–
URRRRRRRRRRRD
5 other sets of macroactions were derived from subsequences of near-optimal solutions to the starting state:
-
–
RR
-
–
RR, RRRR, RRRRRRRR
-
–
RRRRRRRRRRR
-
–
UUURRRR, RRR, DRDRD
-
–
URRRRRRRRRRR, RRRRRRRRRRRD
Furthermore, for each , we randomly generated 5 sets of distinct macroactions. A random macroaction with length () was generated as follows:
-
–
With probability 0.4, randomly choose between U and R with probabilities 0.3 and 0.7;
-
–
With probability 0.3, randomly choose between R and D with probabilities 0.7 and 0.3;
-
–
With probability 0.1, randomly choose between D and L with probabilities 0.7 and 0.3;
-
–
With probability 0.2, randomly choose between L and U with probabilities 0.3 and 0.7.
We didn’t choose probabilities uniform across all directions because this results in several sets of macroactions that cause the agent to drift leftward or downward during random exploration, and the agent almost never receives any positive reward signal. However, it was also the presence of drift that helped us generate variety in the learnability of the macroaction-augmented environments. Variation in the direction of the drift across different sets of macroactions resulted in sample efficiencies that varied across 7 orders of magnitude.
-
–
-
•
For CompILE2, LEMMA discovered the following set of macroactions: (L = left, U = up, R = right, D = down, P = pick up)
-
–
PUURRRP, LL, UU, DD
5 other sets of macroactions were derived from subsequences of subsets of these macroactions:
-
–
LL, UU, DD
-
–
LL, UU, RRR, DD
-
–
PUU, RRRP
-
–
PUURRRP
-
–
PUURRRP, LL, UU, RRR, DD
Furthermore, for each , we randomly generated 5 sets of distinct macroactions. A random macroaction with length () was generated as follows:
-
–
With probability 1/4, randomly choose among L, U and P with probabilities 0.4, 0.4 and 0.2;
-
–
With probability 1/4, randomly choose among U, R and P with probabilities 0.4, 0.4 and 0.2;
-
–
With probability 1/4, randomly choose among R, D and P with probabilities 0.4, 0.4 and 0.2;
-
–
With probability 1/4, randomly choose among D, L and P with probabilities 0.4, 0.4 and 0.2.
-
–
-
•
For 8Puzzle, LEMMA discovered the following set of macroactions: (U = up, R = right, D = down, L = left)
-
–
RD, LDR
5 other sets of macroactions were derived from subsets of these macroactions, possibly with reflection across the diagonal (a symmetry of the puzzle):
-
–
RD
-
–
LDR
-
–
RD, DR
-
–
LDR, URD
-
–
RD, DR, LDR, URD
Furthermore, for each , we randomly generated 5 sets of distinct macroactions. A random macroaction with length () was generated by sampling from U, R, D, L with probabilities 0.2, 0.3, 0.3, 0.2. The higher probabilities for R and D are intended to encourage moving the position of the missing tile towards the bottom-right corner.
-
–
-
•
For RubiksCube222, LEMMA generated the empty set. However, the 3 top-scoring macroactions were: (F = front face , R = right face , U = top face )
-
–
FF, RR, UU
5 other sets of macroactions were derived from subsets of these macroactions, possibly with more repetition of some base action:
-
–
FF
-
–
FF, FFF
-
–
FF, RR
-
–
FF, FFF, RR, RRR
-
–
FF, FFF, RR, RRR, UUU
Note that FF, RR, UU are half-turns of faces (denoted F2, R2, U2 in standard cube notation) and FFF, RRR, UUU are counter-clockwise turns (usually denoted F′, R′, U′).
Furthermore, for each , we randomly generated 5 sets of distinct macroactions. A random macroaction with length () was generated by sampling from F, R, U each with probability 1/3.
-
–
Appendix C Experimental Details
C.1 Hyperparameters
-
•
The learning rate is for Q-learning, value iteration and REINFORCE, and for DQN.
-
•
For the off-policy RL algorithms (Q-learning, value iteration, and DQN), the optimal epsilon schedule for epsilon greedy can vary by orders of magnitude across different action space variants of the same base environment. We therefore adopt an adaptive epsilon-greedy exploration policy where the probability of choosing a random action starts at and is decreased by every time the agent beats its highest test reward so far by , until .
-
•
Testing was performed with episodes ( episode for CliffWalking, which only has one starting state) using the greedy policy (Q-learning, value iteration, DQN) or the current policy (REINFORCE). For the purposes of computing sample complexity, the at which a reward or value error threshold is reached is computed by averaging over all values of where the reward/value error crosses above/below the threshold.
-
•
Experiments were run with a maximum of 100M environment steps. We applied early stop** with a test reward threshold of 0.95 (0.75 for RubiksCube222) and average value error threshold of 0.025 (0.1 for RubiksCube222).
-
•
The horizon is for all environments, including skill-augmented environments. In addition, to simulate a cost of applying too many base actions, we terminate an episode whenever the number of base actions reaches .
-
•
For Q-learning, value iteration, and DQN, the replay buffer size is and updates are performed once every episodes with a batch size of .
-
•
Details on the model architecture of DQN are given in Section C.3.
No extensive hyperparameter tuning was done as the purpose of our experiments was not to compare RL algorithms, but to compare the performance of one algorithm on different action space variants of the same base environment.
C.2 Computational Resources
Experiments were run on 28 NVIDIA GPUs (Quadro RTX 5000, GeForce GTX 1080 Ti, Tesla V100 SXM2 32GB, RTX 6000 Ada Generation). One experiment, which usually consisted of 32 runs of some RL algorithm on different macroaction augmentations of the same base environment, took between under a minute to about a week to finish. In total, all experiments were completed within one month.
C.3 Algorithm-Specific Details
-
•
Value iteration is modified to the RL setting in a way similar to Deep Approximate Value Iteration (DAVI) (Agostinelli et al., 2019). In DAVI, a state is chosen from some initial distribution and the value network is updated by minimizing the quadratic loss between the current state value and the Bellman update, thus requiring a forward pass that computes the values of all next states. In our version of value iteration, a state is chosen from the initial distribution and we apply a rollout of the epsilon-greedy policy. For each state in the rollout, we also compute all possible next states. Similar to Q-learning, these next states are stored along with in a replay buffer. When we sample a state (along with its next states) from the replay buffer, its value is updated in the direction of the Bellman update. Note that the fact that all possible next states are computed from each state in a rollout multiplies the number of environment steps taken by .
-
•
The policy in REINFORCE is parameterized directly by the logits. In other words, the weights are an matrix and .
-
•
The implementation of the deep neural net in DQN depends on the environment. A state embedding is first constructed from the input before passed into a linear projection head that outputs the action values of a state .
-
–
In CliffWalking, the input is a length-3 multihot vector at every location of the 4-by-12 grid (hence a binary tensor). In each multihot vector, the 3 indices represent the player, goal, and cliff. The state embedding is constructed by passing the input through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The CNN has a kernel size of 3 and padding of 1. The hidden dimension is 32 and the output embedding has dimension 16.
-
–
In CompILE2, the input has two components. The grid is represented as a length-12 multihot vector at every location of the 10-by-10 grid (hence a binary tensor). The 12 indices of each multihot vector represent the 10 types of objects, wall, and agent. The next object the agent has to pick up is represented as a length-10 one-hot vector. The grid is passed through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The result is concatenated with an embedding of the next object the agent has to pick up and passed through a linear projection to form the final embedding of the observation. The CNN has a a kernel size of 3 and padding of 0. The hidden dimension is 32 in the CNN layers and 128 in the MLP layers; the object embedding has dimension 16; the output embedding has dimension 128.
-
–
In 8Puzzle, the input is a length-9 onehot vector at every location of the 3-by-3 grid (hence a binary tensor) denoting the tile present at each location (or the absence thereof). The state embedding is constructed by passing the input through a 2-layer CNN with ReLU activation followed by a 2-layer MLP with ReLU activation. The CNN has a kernel size of 3 and padding of 1. The hidden dimension is 32 and the output dimension is 32.
-
–
In RubiksCube222, the input is a length-6 multihot vector for each of tiles of the cube (hence a binary tensor) denoting the color of each tile. The state embedding is constructed by flattening the input and passing it through a 4-layer MLP with ReLU activation. The hidden dimension is 64 and the output dimension is 32.
-
–
Appendix D Additional Empirical Tests of -Learning and -Exploration Difficulty
D.1 Empirically Verifying Lemma 3.1 for Motivating -Learning Difficulty
To test how well -learning difficulty captures learning from experience, we study the value iteration algorithm for planning with known transitions and rewards in a DSMDP. We consider two variants of value iteration: state value iteration for learning the values of states (Bellman, 1957), and action value iteration for learning the values of state-action pairs. The latter is like Q-learning (Watkins, 1989) but modified to update the values of all state-action pairs at once. Instead of the original Bellman update, each update uses a linear interpolation between the old value and the new value given by the Bellman update with a learning rate of (see Equation 3).
For each base environment, we test the correlation between average solution length and sample complexity on 32 macroaction augmentations of that environment. The results are summarized in Table 4. We find that the correlation between convergence time and average solution length is almost always greater than 0.9, with it occasionally being near-perfect (above 0.99).
Q-value iteration | 0.980 0.001 | 0.934 0.012 | 0.901 0.013 | 0.942 0.007 | |
---|---|---|---|---|---|
0.998 0.000 | 0.977 0.003 | 0.968 0.006 | 0.989 0.001 | ||
Value iteration | 0.977 0.001 | 0.942 0.005 | 0.902 0.015 | 0.942 0.007 | |
0.998 0.000 | 0.984 0.002 | 0.969 0.005 | 0.985 0.001 |
D.2 Arithmetic Mean Variant of -Exploration Difficulty Performs Worse Than the Geometric Mean
Table 5 shows the version of Table 1 where is redefined to be . (Up to a constant factor, estimates an upper bound on the sample complexity of the exploration stage of RL.) Comparing the results with Table 1, we find that with 3 exceptions (in 8Puzzle), all correlation values are no higher than those when the geometric mean is used.555 The correlation values of CliffWalking are exactly equal across the two tables because this environment has only one possible starting state, as a result of which the arithmetic and geometric means are exactly equal. This provides empirical validation for using the geometric mean as opposed to the arithmetic mean in our definition of -exploration difficulty.
Q-Learning | 0.947 0.006 | 0.661 0.049 | 0.301 0.047 | 0.366 0.081 | |
---|---|---|---|---|---|
0.953 0.008 | 0.631 0.061 | 0.442 0.043 | 0.763 0.019 | ||
Value iteration | 0.933 0.009 | 0.724 0.043 | 0.788 0.042 | 0.247 0.058 | |
0.951 0.015 | 0.732 0.035 | 0.877 0.011 | 0.694 0.021 | ||
REINFORCE | 0.949 0.006 | 0.732 0.039 | 0.715 0.020 | 0.537 0.139 | |
DQN | 0.789 0.028 | 0.752 0.075 | 0.621 0.025 | 0.576 0.023 |
Appendix E Proofs
Proof of Lemma 3.1.
(Note: This proof assumes refers to the natural logarithm.)
For , simple induction on shows that, at time , the states with value are exactly those states that can be solved with actions or less, and all other states have value .
For the case, let’s first consider the case where the DSMDP is a chain of states where state 0 is the only goal state and for any action and non-goal state . Then the value iteration formula becomes for and . For , we can write this as a differential equation
for , and . (We have switched to subscript notation to make it clearer that this is a linear system of ODEs in time.) Solving the system with the initial conditions and for yields
Note that decreases in , i.e., at any time , states closer to the goal have higher value.
If where and , then
which is less than for sufficiently large since .
Let . Then for and , we have
where the inequality marked (*) made use of the fact that the the median of a Poisson distribution with positive integer rate is exactly (Choi, 1994).
We have thus shown that we need to obtain . In other words, the time until the value estimate is within of its true value of 1 is
(24) |
Now let’s return to the general graph setting. In this situation, the invariants are as follows:
-
•
where is the next state on the shortest path from to any goal.
-
•
where is the solution to the value function in the case of a simple chain, as we just derived.
This invariants are preserved by the fact that is non-increasing in . Thus, replacing with in the formula for the chain DSMDP (Equation 24) yields the result for the general DSMDP case. ∎
Proof of Theorem 4.2.
(Note: We use the version of the geometric distribution with support excluding 0.)
For , let , the probability that a random sequence of length with actions chosen uniformly from is exactly . Then is a probability distribution over , so
where denotes the cross entropy between and .
Now, fix any . Then
Thus,
The last inequality used the fact that gives
Now, we have
which completes the proof. ∎
Proof of Corollary 4.5.
Since , we have . The function is decreasing for , so
Then
as desired. ∎
Proof of Theorem 5.2.
(25) | ||||
where we have used the fact that is a normalized probability distribution.
Now, suppose the state space is finite and . According to Equation 25, we want to show that we can make arbitrarily small with a suitable choice of . Construct as follows. Let the number of skills be some large number . For each solvable state with , let skills send directly to the goal state and the remaining send back to itself, where . (For solvable states with , simply let all skills send back to itself.) Let’s now show that as for every solvable state .
is the probability that an action sequence with actions uniformly chosen from and length solves . Among all such action sequences, the total probability of those that have a base action is no more than the total probability of all actions sequences that have a base action. The latter is given by
It now remains to show that the total probability of solutions to that consist only of skills approximates arbitrarily well as . For with , no such solutions exist and so their total probability is 0. For with ,
(as ) | ||||
By now, we have shown that as for every solvable state . Since is finite, this convergence is uniform, so the KL-divergence between and the normalized version of tends to zero as , as desired. ∎
Proof of Corollary 5.3.
Since is solution-separable and is a macroaction augmentation of , is also solution-separable. Thus, . By Theorem 5.2,
whereas
Thus,
as desired. ∎
Proof of Theorem 5.4.
The construction given in the proof of Theorem 5.2 allows us to make arbitrarily close to 0 and arbitrarily close to 1 with sufficient large . Recalling Equation 25, this means that for any , the construction gives
for sufficiently large .
On the other hand, let be distributions defined on solvable states in addition to a dummy state such that and whenever , whereas and . (Note that because is solution-separable.) Then since . This gives
(26) | ||||
As a result, for sufficiently large , .
Now, let’s show that the construction in the proof of Theorem 5.2 can be made more precise to allow all states with to have distinct canonical shortest solutions in . Simply choose large enough so that, for all with , the number of skills that send directly to is at least the number of states with . Then the number of shortest solutions to every with is at least the number of such , so it is possible to choose one shortest solution for every such so that all the chosen solutions are distinct. ∎
Proof of Corollary 5.5.
Define as in Theorem 5.4, so that and Theorem 4.2 gives
(27) |
which is identical to Equation 12. Then the proof that the additional condition in the corollary implies is identical to the proof of Corollary 4.5. ∎
Proof of Theorem 5.6.
Augment the state space of with a state that is solved by every length-1 sequence that is not already the solution to any other state. (Furthermore, all actions that do not result in the goal state instead transition to a dead state.) Denote by the resultant DSMDP and for simplicity of notation we write for and for . Let denote the -macroaction augmentation of . Then the solutions to in are exactly the same as those in since macroactions always have length greater than 1. We will write to mean . Let be a distribution over the solvable states of so that on the solvable states of and .
As in the proof of Theorem 5.4, we define distributions over the solvable states in addition to a dummy state to be equal to whenever , whereas and are such that are normalized probability distributions.
First, let’s show that
(28) |
If is distance 1 away from the goal in , then
where denotes the number of solutions to in (or equivalently, ), all of which have length 1. Thus,
where the last inequality used the fact that .
We will now use Equation 28 to prove the theorem. By the triangle inequality,
Pinsker’s inequality says that for any two probability mass functions . Thus, if
then
and so
Now, by Equation 26, this is equivalent to , as desired. ∎
The proof of Theorem 5.7 is omitted as the stronger version and its proof are given in Section F.4.
Appendix F Additional Theoretical Results
F.1 Preliminary Results on Stochastic Environments
Here, we provide preliminary generalizations of our results for stochastic sparse-reward MDPs, which are SDMDPs (Definition 2.1) where the transition kernel may be stochastic (i.e., is now a distribution over ).
In a (possibly stochastic) sparse-reward MDP, let be the probability that taking actions starting in results in the goal state. For an ordering of all positive-length action sequences in non-decreasing length, define
where is the largest such that . As a result, .
Let’s redefine to be the weighted mean , so that -learning difficulty (Equation 4) and -merged -incompressibility (Equation 7) are now defined using this new notion of shortest solution length. Furthermore, in the definition of -merged -incompressibility (Equation 7), redefine to be so that . Note that the new definitions match the old definitions when the environment is deterministic. The stochasticity effectively spreads the responsibility of being a “shortest solution” over several short solutions whose success probabilities add up to 1.
Theorem F.1 (Generalization of Theorem 4.2).
Under the above redefinitions for stochastic sparse-reward MDPs, Equation 9 of Theorem 4.2 continues to hold.
Proof.
The proof is identical to that of the original Theorem 4.2. ∎
In stochastic environments, we can keep the original definition of -exploration difficulty (Equation 5) since the probabilistic definition of continues to make sense when there’s stochasticity. (As a reminder, it is the probability that a uniformly random policy that terminates with probability before each step solves .) Similarly, we keep the definition of -discounted solution density (Definition 5.1), which is also defined in terms of .
Theorem F.2 (Generalization of the first half of Theorem 5.2).
Under the above redefinitions for stochastic sparse-reward MDPs, Equation 15 of Theorem 5.2 continues to hold.
Proof.
The proof is identical to that of the original Theorem 5.2. ∎
F.2 Incorporating Skill Expressivity in Theorem 4.2
In Theorem F.5 below, we provide a version of Theorem 4.2 that eliminates the dependence of on and makes it depend explicitly on a quantitative measure of skill expressivity instead. This new measure of incompressibility (Equation 29), which we call -expressive -incompressibility, decreases in . This is expected as an environment is more compressible when the available skills are more expressive.
Definition F.3 (Quantifying skill expressivity).
With respect to a DSMDP , define the behavior variety expressivity of a skill to be , i.e., the number of distinct action sequences that can produce.
Definition F.4 (-expressive -incompressibility).
For a DSMDP with finite , define its -expressive -incompressibility to be
(29) |
where the is taken over all choices of canonical (not necessarily shortest) solutions to all states.666 Recall that, given a choice of canonical solutions to all states, is the sum over of all states that have as their canonical solution. As a result, and equality holds in solution-separable DSMDPs. Note that expressivity occurs once in the denominator, so that larger results in smaller .
Theorem F.5 (Expressivity and -learning difficulty improvability).
Assuming the setup to Theorem 4.2, the following modified version of Equation 9 holds:
(30) |
where is the maximum behavior variety expressivity of a skill in the skill augmentation. Higher expressivity thus reduces incompressibility and allows skills to improve -learning difficulty more, as expected.
Proof.
Given any choice of canonical shortest solutions in , define the random variables and as follows. For , is the canonical solution to in , and is the same solution but with skills expanded into base actions. Then the distribution of is just , and let be the distribution of .
Note that
(31) |
Furthermore, since any can expand to at most different base action sequences,
(32) |
In addition, recall from the proof of Theorem 4.2 that, for any ,
(33) |
Thus, substituting Equations 32 and 33 into Equation 31 yields
where
since and . Thus,
which is true for all , as desired. ∎
F.3 Relaxing Solution-Separability Assumption in Corollary 4.4
Corollary F.6 (Generalization of Corollary 4.4).
Relaxing the solution-separability assumption, Corollary 4.4 holds if we replace in the definition of with . Here, is the distribution of canonical solutions to states sampled from , and the minimum is taken over all possible choices of canonical solutions. Thus, can be understood as the entropy of the state distribution if states with the same canonical solution are merged into one “super-state.”
Proof.
The result follows directly from Theorem F.5 by setting . ∎
F.4 Stronger Version of Theorem 5.7
Note: For notational simplicity, we will write to mean .
Before stating the stronger version of Theorem 5.7, we need to first define solution-length separations of state spaces.
Definition F.7.
For a DSMDP , let denote the set of solvable states. The solution-length separation of is the result of separating every solvable state into a set of sub-states corresponding to the lengths of solutions to . Formally, we write
Furthermore, for a sub-state of corresponding to solution length , we naturally define its solutions to be the length- solutions to . Formally,
where the is used to make it explicit that we’re applying the operation to sub-states.
Functions on defined using solutions to states can therefore be naturally extended to . For example, for is just , and
For an arbitrary function , there is a family of natural extensions to . Specifically, we say that is a solution-length-separated additive extension if, for all ,
and for . For example, as defined above is a solution-length-separated additive extension to .
Theorem F.8 (Generalization of Theorem 5.7).
Let be the -macroaction augmentation of the solution-separable DSMDP with a finite action space, and a probability distribution over solvable states. Then there exists a solution-length-separated additive extension to in such that
(34) |
Here, , where
is the total probability (under ) of sub-states with solution length and
is the probability that a uniformly random action sequence of length is a solution to . To make a normalized probability distribution, we introduce a dummy sub-state for each solution length with and . Note that because of solution-separability, and it is equal to zero when every action sequence of length is the solution to some state.
Proof of Theorem F.8.
For each solvable state , denote by the set of sub-states resultant from separating by solution length in the base environment. Define the solution-length-separated additive extension to such that for , or more precisely,
Here,
denotes the of the -macroaction augmentation of (the solution-length separation of with respect to ), and not the solution-length separation of (the -macroaction augmentation of ).
Then
by Jensen’s inequality, so
It thus suffices to lower-bound the latter. Let’s consider base solution lengths separately.
Fix some . Define to be the conditional distribution of on sub-states with solution length . In other words, if denotes the set of sub-states with solution length , then is a distribution over defined as
We write
(35) |
where denotes the set of sub-states with solution length , along with a dummy state for every length- action sequence that isn’t a solution to any state. Note that for each dummy state so that , whereas where is the action sequence assigned as the solution to . As usual for dummy states, we define .
Let’s first lower-bound the first term on the RHS of Equation 35:
(36) |
Let’s now upper-bound the sum in the second term on the RHS of Equation 35. We write
which is a function where is the number of macroactions of length divided by and is the maximum length of any macroaction. To see that this is a function of only and , notice that changing the number of macroactions of every length as well as the number of base actions by the same factor (which keeps all unchanged) will result in times more sequences such that and expands to base actions, whereas the summand is multiplied by a factor of for these sequences. The two factors cancel out, thus leaving the entire sum unchanged.
Now, let’s derive a recursive formula for where the are treated as parameters. To do this, we separate the sum over into cases depending on whether the first action in is a macroaction, and its length if yes. If the first action in is a base action, then the rest of expands to length , so the contribution to the sum is , where is the number of base actions divided by . If the first action in is a macroaction of length , then the rest of expands to length , so the contribution to the sum is . To summarize,
where it is understood that for . The base case is . Since the sum of the coefficients in the recursive formula equals 1, is just a weighted average of . Thus, if for then for all .
Let’s show by induction on that
(37) |
for all . It suffices to show that for .
For , . For , and .
Now, for , assume the and cases hold.
Let’s upper-bound for the following two cases separately: (i) ; (ii) .
(i) Define for where . Define the sequence with for and . Then the inductive hypothesis gives for . It is easy to show by induction on that for , so for ,
where is restricted to the range . Since is increasing for , its maximum is reached when , i.e.,
(ii) Note that recursively expanding the recursion formula for until we reach the base cases results in a polynomial in . It is easy to see by induction on that, for , no term contains where and there is a single term containing which is just . So for some polynomial , and
for some polynomial . Substituting results in
for some polynomial . This is linear in where , so
where
by the inductive hypothesis. Now let’s upper-bound .
Note that, regardless of the value of , we have and for by the inductive hypothesis, since these values of are independent of . Thus,
Thus, we have shown that , which completes the inductive step. This concludes the proof of Equation 37.
Appendix G Relating -Incompressibility to Skill Learning
The intuition that skills should optimally compress successful trajectories has been previously used by skill-discovery algorithms like LOVE and LEMMA. Using the incompressibility measures introduced in this paper, we can approach skill learning more rigorously. There are two approaches to converting -incompressibility into a skill-learning objective.
The first approach is to find that minimizes the lower bound on the -learning difficulty increase ratio as given in Theorem 4.2. This is equivalent to minimizing
(39) |
where the factor is proportional to the -merged -incompressibility. Usually, is large, as a result of which the maximizing satisfies and . Thus, minimizing becomes equivalent to minimizing
(40) |
When is known or given as a hyperparameter, then the objective is to minimize
(41) |
Note that in practice, it is not possible to compute , the distribution of shortest solutions using actions from to states generated by the environment. However, we do have a training set of offline experience, so we can use our skills to rewrite these solutions and define to be the resultant empirical distribution of abstracted solutions. The appearing in the objectives should thus be interpreted as as calculated from our training set.
However, the resultant approximation of will be a significant under-approximation if the training set is much smaller than the number of states that cover most of the state space under . In this case, we recommend modeling with the assumption that it is generated by sampling i.i.d. actions from a distribution over , with solution length sampled from a distribution . Then the maximum-likelihood (ML) estimates of and are just the empirical distribution of actions and the empirical distribution of solution lengths in the abstracted training set. If we define to be the distribution of action sequences defined by this choice of and , then we can approximate , where is the average solution length. Under this approximation, becomes
(42) |
(We can similarly apply this approximation to and .) It is often the case that is much smaller than , so neglecting that term results in the objective
(43) |
Note that is exactly the minimum description length (MDL) objective used by LOVE (Jiang et al., 2022). It represents the average number of bits required to encode an abstracted solution, where the encoding of actions is optimized for the empirical distribution of actions in the abstracted training set.
The second approach to deriving a skill learning objective from -incompressibility is based on the idea that the maximally abstracted environment is the least compressible. Using unmerged -incompressibility to measure incompressibility, this corresponds to the maximization objective
(44) |
Similar to in , in is often large, in which case the maximizing satisfies and . Under this approximation, the maximization objective becomes the minimization objective
(45) |
As with , cannot be computed exactly, so we approximate it with the average solution length in the abstracted training set, i.e., . As a result,
(46) |
which is just but with a uniform distribution for . It can thus also be interpreted as an MDL objective where the encoding of actions is a uniform code. Note that this is exactly the objective used by LEMMA (Li et al., 2022).