Oracle-Efficient Reinforcement Learning
for Max Value Ensembles
Abstract
Reinforcement learning (RL) in large or infinite state spaces is notoriously challenging, both theoretically (where worst-case sample and computational complexities must scale with state space cardinality) and experimentally (where function approximation and policy gradient techniques often scale poorly and suffer from instability and high variance). One line of research attempting to address these difficulties makes the natural assumption that we are given a collection of heuristic base or constituent policies upon which we would like to improve in a scalable manner. In this work we aim to compete with the max-following policy, which at each state follows the action of whichever constituent policy has the highest value. The max-following policy is always at least as good as the best constituent policy, and may be considerably better. Our main result is an efficient algorithm that learns to compete with the max-following policy, given only access to the constituent policies (but not their value functions). In contrast to prior work in similar settings, our theoretical results require only the minimal assumption of an ERM oracle for value function approximation for the constituent policies (and not the global optimal policy or the max-following policy itself) on samplable distributions. We illustrate our algorithmโs experimental effectiveness and behavior on several robotic simulation testbeds.
1 Introduction
Computationally efficient RL algorithms are known for simple environments with small state spaces such as tabular Markov decision processes (MDPs) ย (Kearns and Singh, 2002; Brafman and Tennenholtz, 2002), but practical applications often require dealing with large or even infinite state spaces. Learning efficiently in these cases requires computational complexity independent of the state space, but this is statistically impossible without strong assumptions on the class of MDPsย (Jaksch etย al., 2010; Lattimore and Hutter, 2012; Du etย al., 2019; Domingues etย al., 2021). Even in structured MDPs that admit statistically efficient algorithms, learning an optimal policy can still be computationally intractableย (Kane etย al., 2022; Golowich etย al., 2024).
These obstacles to practical RL motivate the study of ensembling methodsย (Lee etย al., 2021; Peer etย al., 2021; Chen etย al., 2021; Hiraoka etย al., 2022), which assume access to multiple sub-optimal policies for the same MDP and aim to leverage these constituent policies to improve upon them. There are now several provably efficient ensembling algorithms, but their guarantees require strong assumptions on the representation of the target policy learned by the algorithm. Brukhim etย al. (2022) use the boosting framework for ensembling developed in the supervised learning settingย (Freund and Schapire, 1997) to learn an optimal policy, assuming access to a weak learner for a parameterized policy class. To efficiently converge to an optimal policy, the target policy must be expressible as a depth-two circuit over policies from a base class which is efficiently weak-learnable. The convergence guarantees additionally require strong bounds on the worst-case distance between state-visitation distributions of the target policy and policies from the base class.
Another line of ensembling work considers a weaker objective than learning an optimal policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024). These works instead aim to learn a policy competitive with a max-aggregation policy, which take whichever action maximizes the advantage function with respect to a max-following policy at the current state. When these works have provable guarantees, they require the assumption that the target max-aggregation policy can be approximated in an online-learnable parametric class, as well as the assumption that policy gradients within the class can be efficiently estimated with low variance and bias.
Our goal is to learn a policy competitive with a similar but incomparable benchmark to that ofย Cheng etย al. (2020) under comparatively weak assumptions. We give an efficient algorithm for learning a policy competitive with a max-following policy (Definitionย 2.1), assuming the learner has access to a squared-error regression oracle for the value functions of the constituent policies. Our algorithm exclusively queries this oracle on distributions over states that are efficiently samplable, thereby reducing the problem of learning a max-following competitive policy to supervised learning of value functions. Notably, our learnability assumptions pertain only to the value functions of the constituent policies and not to the more complicated class of max-following benchmark policies or their value functions. Our algorithm is simple and effective, which we demonstrate empirically in Sectionย 5.
It is natural to wonder if access to an oracle such as ours could be leveraged to instead efficiently learn an optimal policy, obviating the need for weaker benchmarks (and our results). However, it was recently shown byย (Golowich etย al., 2024) that learning an optimal policy in a particular family of block MDPs is computationally intractable under reasonable cryptographic assumptions, even when the learner has access to a squared-error regression oracle. Their oracle captures a general class of regression tasks that includes value function estimation, and therefore also captures our oracle assumption. Our work shows that when we instead consider the simpler objective of efficiently learning a policy that competes with max-following, a regression oracle is in fact sufficient. We leave open the interesting question of whether such an oracle is necessary.
1.1 Results
Our main contribution is a novel algorithm for improving upon a set of given policies that is oracle efficient with respect to a squared-error regression oracle, and therefore scalable in large state spaces (Algorithmย 1, Theoremย 3.1). We consider the episodic RL setting in which the learner interacts with its environment for episodes of a fixed length . The algorithm incrementally constructs an improved policy over iterations, learning an improved policy for step of the episode at iteration . This incremental approach allows the algorithm to explicitly construct efficiently samplable distributions over states visited by the improved policy at step by simply executing the current policy for steps. It can then query its oracle to obtain approximate value functions for all constituent policies with respect to this distribution. This in turn allows the algorithm to learn an improved policy for step by following the policy with highest estimated value. By incrementally constructing an improved policy over steps of the episode, we can avoid making assumptions like those ofย Brukhim etย al. (2022) about the overlap between state-visitation distributions of the target policy and the intermediate policies constructed by the algorithm.
Because our oracle only gives us approximate value functions, we take as our benchmark class the set of approximate max-following policies (Definitionย 2.3). This is a superset of the class of max-following policies and contains all policies that at each state follow the action of some constituent policy with near-maximum value at that state. In Sectionย 4, we prove that for any set of constituent policies, the worst approximate max-following policy is competitive with the best constituent policy (Lemmaย 4.1) and provide several example MDPs illustrating how our benchmark relates to other natural benchmarks.
Finally, we demonstrate the practical feasibility of our algorithm using a heuristic version on a set of robotic manipulation tasks from the CompoSuite benchmarkย Mendez etย al. (2022); Hussing etย al. (2023). We demonstrate that in all cases, the max-following policy we find is at least as good as the constituent policies and in several cases outperforms it significantly.
1.2 Related work
As discussed above, our work is related to a recent line of research learning a max-aggregation policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024), which can be viewed as a one-step look-ahead max-following policy and is incomparable to the class of max-following policies (see the appendix ofย Cheng etย al. (2020) for example MDPs demonstrating this fact). These works all assume online learnability of the target policy class, which is strictly stronger than our batch learnability assumption for constituent policy value functions.
The work ofย Cheng etย al. (2020) proposes an algorithm (MAMBA) that uses policy gradient methods, and the convergence of the learned policy to their benchmark depends on the bias and variance of those policy gradients. Liu etย al. (2023, 2024) builds on the work of (Cheng etย al., 2020). Their algorithm MAPS-SE modifies MAMBA to promote exploration when there is uncertainty about which constituent policy has the greatest value at a state, via an upper confidence bound (UCB) approach to policy selection. Reducing uncertainty about the constituent policiesโ value functions reduces the bias and variance of the gradient estimates, improving convergence guarantees. However, policy gradient techniques are known to generally have high varianceย (Wu etย al., 2018), and this appears to affect the practical performance of MAPS-SE in certain cases (see Sectionย 5 for additional discussion).
The boosting approach to policy ensembling ofย Brukhim etย al. (2022) also necessitates strong assumptions. This follows from the computational separation inย Golowich etย al. (2024), which shows that our oracle assumption is insufficient to learn an optimal policy, whereas the assumptions made inย Brukhim etย al. (2022) enable convergence to optimality. This work also gives convergence guarantees that are independent of any relationship between the starting state distribution, the state-visitation distributions of the base policy class, and the state-visitation distribution of the target policy, whereas bounds on the closeness of these distributions is required for convergence inย Brukhim etย al. (2022).
There are other lines of work on policy improvement, which consider improving upon a single base policy and therefore do not address the challenge of ensemblingย (Sun etย al., 2017; Schulman etย al., 2015; Chang etย al., 2015). Empirical work on ensemble imitation learning (IL) also studies the problem of leveraging multiple base policies for learningย (Li etย al., 2018; Kurenkov etย al., 2019), but these works lack provable guarantees of efficient convergence to a meaningful benchmark.
(Song etย al., 2023) provide a survey of a variety of more complex techniques to ensemble policies, mainly from a practical perspective. Barreto etย al. (2017, 2020) decompose complex tasks into a set of multiple smaller tasks where they use transfer learning, but they make strong assumptions about the joint parametrization of rewards for various tasks and about the representations of the tasks.
2 Preliminaries
We consider an episodic fixed-horizon Markov decision process (MDP)ย (Puterman, 1994) which we formalize as a tuple where is the set of states, the set of actions, is a reward function, the transition dynamics, a distribution over starting states and the horizon (Sutton and Barto, 2018). will denote the set . In the beginning, an initial state is sampled from . At any time , the agent is in some state and chooses an action based on a function map** from states to distributions over actions . As a consequence, the agent traverses to a new next state sampled from and obtains a reward . Without loss of generality, we assume that rewards bounded within . The sequence of functions used by the agent is referred to as its policy, and is denoted . A trajectory is the sequence of (state, action) pairs taken by the agent over an episode of length , and is denoted . We will use the notation to refer to sampling a trajectory by first sampling a starting state , and then executing policy from .
The goal of the learner is to maximize the expected cumulative reward over episodes of length . We further define the value function as the expected cumulative return of following some policy from some state as . Due to the finite horizon of the episodic setting, we will also need to refer to the expected cumulative reward from state under policy from time . We denote this time-specific value function by . Finally, the key object of interest is a max-following policy. Given access to a set of arbitrarily defined policies and their respective value functions which we denote by the shorthand , a max-following policy is defined as a policy that at every step follows the action of the policy with the highest value in that state.
Definition 2.1 (Max-following policy class).
Fix a set of policies for a common MDP and an episode length . The class of max-following policies is defined
Note that for any collection of constituent policies there may be many max-following policies, due to ties between the value functions. Different max-following policies may have different expected return, and we refer the reader to Observationย 4.5 for an example demonstrating this fact.
We assume access to a value function oracle that allows us to approximate a value function of a policy under a samplable distribution at any specified time . This oracle is intended to capture the common assumption that the value function of a policy can be efficiently well-approximated by a function from a fixed parameterized class. In practice, one might imagine implementing this oracle as a neural network minimizing the squared error to a target value function.
Definition 2.2 (Oracle for value function estimates).
We denote by an oracle satisfying the following guarantee for a policy . For any , and any , given as input a time and sampling access to any efficiently samplable distribution , the oracle outputs such that . We use the notation to denote with fixed accuracy parameter . We will also use the shorthand .
Looking ahead to Sectionย 3, we note that for every distribution on which Algorithmย 1 queries an oracle, is not only efficiently samplable, but samplable by executing an explicitly constructed policy for steps in MDP , starting from . Thus, for any distribution , policy , and time for which we query , we could efficiently obtain an unbiased estimate of by following a known for steps from , and then switching to for the remainder of the episode. We mention this to highlight that our oracle is not eliding any technical obstacles to sampling in the episodic setting. It is simply abstracting the supervised learning task of converting unbiased estimates of into an approximation with small squared error with respect to .
Lastly, we define our benchmark class of policies. Given a set of constituent policies , our benchmark defines for each state and time a set of permissible actions: any action taken by a policy for which the value is sufficiently close to the maximum value . The class of approximate max-following policies is then any policy that exclusively takes permissible actions. We refer the reader to Sectionย 4 for further explanation of this benchmark.
Definition 2.3 (Approximate max-following policies).
We define a set of -good policies at state and time , selected from a set , as follows.
Then we define the set of approximate max-following policies for to be
3 The learning algorithm
In this section, we introduce our algorithm for learning an approximate max-following policy, (Algorithmย 1. This algorithm learns a good approximation of a max-following policy at step , assuming access to a good approximation of a max-following policy for all previous steps.
For the first step (), the algorithm learns a good approximation for all constituent policies on the starting distribution . These approximate value functions can in turn be used to define the first action taken by the approximate max-following policy, namely . Following from generates a samplable distribution over states , and so our oracle assumption allows us to obtain good estimates with respect to for all . We can then define the second action of the approximate max-following policy, and so on, for all steps.
Theorem 3.1.
For any , any MDP with starting state distribution , any episode length , and any policies defined on , let and . Then makes oracle queries and outputs such that
Proof.
For all , , let denote the approximate value function obtained from in Algorithmย 1. We then define, for every , the set of states for which some approximate value function has large absolute error () and the set of bad trajectories that pass through a state in for any : and . We will show that there exists an approximate max-following policy such that for any trajectory , . We then bound the probability , and the contribution to from these trajectories, proving the claim.
Let denote the value of the policy that follows at time and state . From the definition of the bad set and the setting of , for any state ,
In other words, if a state is not bad at time , then for a policy that has value within of the true max value . It then follows from the definition of the class of approximate max-following policies (Definitionย 2.3) that there exists some such that for all , for all , .
For any trajectory , . Then for any trajectory , , and therefore
For , we have lower and upper-bounds and . We can then write:
It remains to upper-bound . We have already argued . Observing that , it is sufficient to show to prove the claim. For all , let , and note that this is the distribution supplied to the oracle at iteration of Algorithmย 1. It follows from our oracle assumption (Definitionย 2.2) that for all , . We apply Markovโs inequality to conclude that for all ,
Union bounding over the constituent policies gives , from the definition of . Union bounding over the trajectory length , we then have It follows that
completing the proof. โ
4 The approximate max-following benchmark
In this section, we provide additional context for our benchmark class of approximate max-following policies. We show that the worst policy in our benchmark class competes with the best fixed policy from the set of constituent policies. We also provide examples of MDPs that showcase properties of the set of (approximate) max-following policies.
Lemma 4.1 (Worst approximate max-following policy competes with best fixed policy).
For any and any episode length , let . Then for any MDP with starting state distribution , and any policies defined on ,
It is an immediate corollary of Theoremย 3.1 and Lemmaย 4.1 that the policy learned by Algorithmย 1 competes with the best constituent policy.
Corollary 4.2.
For any , any MDP with starting state distribution , any episode length , and any policies defined on , let , and let denote the policy output by . Then
We provide diagrams of MDPs as examples for the observations that we make below. States in are denoted by the labels on the nodes. Actions in are indicated by arrows from given states with deterministic transition dynamics and the rewards are labeled over the corresponding arrows. Arrows may be omitted for transitions that are self-loops with reward .
Observation 4.3.
The worst approximate max-following policy can be arbitrarily better than the best constituent policy.
Consider in Figureย 1(a) two policies on this MDP: and , for all . Note that for any episode length , for all , . For any , comprises policies such that , , and . Therefore for any episode length , and state , . In this example, any approximate max-following policy is also an optimal policy, whose gap in expected return with the best constituent policy can be made arbitrarily large by increasing .
Observation 4.4.
A max-following policy cannot always compete with an optimal policy.
In Figureย 1(b), consider policies , , and , for all . At state , is the only policy with non-zero value. Thus, any max-following policy will take action from , receiving reward and then reward 0 for the remainder of the episode. Given a starting state distribution supported entirely on , for any episode length , the optimal policy will obtain cumulative reward , whereas any max-following policy will only obtain reward .
Observation 4.5.
Different max-following policies may have different expected cumulative reward.
We again consider Figureย 1(b), but suppose now the starting state distribution is supported entirely on . For all , and so a max-following policy may take any action from . A max-following policy that always takes actions or from will only ever obtain cumulative reward 0, but a max-following policy that takes action will move to and (so long as more than one step remains in the episode) will then take action and move to state , where it will stay to obtain cumulative reward .
If the value functions of constituent policies are exactly known, it is easy to construct a max-following policy, but the learner may not have access to these functions. If the learner only has access to approximations and follows whichever policy has the larger approximate value at the current state, the resulting policy can have much lower expected cumulative reward than the max-following policy. This is true even for state-wise bounds on the value approximation error. This observation previously motivated our definition of the approximate max-following class (Definitionย 2.3).
Observation 4.6.
Small value function approximation errors can be an obstacle to learning a max-following policy.
In Figureย 2(a), we again consider policies and for all states , color coding the actions taken by with red and with blue in Figureย 2(a). For starting state distribution supported entirely on , a max-following policy will take action , , and for the remainder of the episode, obtaining reward . However, given only approximate value functions with state-wise absolute error bound for all states and times , the policy that takes action for can have much lower expected cumulative reward than a max-following policy. For example if and in our Figureย 2(a) example, then will have expected return 0.
Observation 4.7.
A max-following policyโs value function is not always of the same parametric class as the constituent policiesโ value functions.
As a simple first example, consider an MDP with states and actions . Every action leads to a self-loop (for all , ) and for a fixed action, rewards are affine functions of the state (e.g. and ). We consider two policies: and for all . Notice that for episode length , and . Since the dynamics keep the state at the same fixed place independent of the action, the max-following policy at state will simply be the max of the two individual value functions at and therefore its parametric class will be piecewise linear, unlike the constituent policiesโ which are affine (see Figureย 2(b)). To provide a more complex MDP example, we consider a traditional control problem with continuous state and action spaces: the discrete linear quadratic regulator. In this example the constituent linear policies have quadratic value functions, but the max-following policy is not of the same parametric class. See Appendixย A for further discussion.
5 Experiments
We proceed to examine our MaxIteration algorithm in a set of experiments that uses neural network function approximation as oracles. These experiments aim to provide a scenario to demonstrate the usefulness of max-following. While previous works in this line of research have studied the ability to integrate knowledge from the constituent policies to increase performance of a learnable policyย (Cheng etย al., 2020; Liu etย al., 2023, 2024) our algorithm offers an alternative approach. We consider a common scenario from the field of robotics where one has access to older policies from a robotic simulator that were used in previous projects. As long as the dynamics of the MDP of interest do not differ, such old policies can be simply be re-used in new applications. In such cases, training completely from scratch can be incredibly expensive due to the vast search spaceย (Schulman etย al., 2017; Haarnoja etย al., 2018). We note that this setup is related to the one used byย Barreto etย al. (2017, 2020) but we do not put any constraints on the reward functions.
Experimental setupย ย ย A recent robotic simulation benchmark called CompoSuiteย (Mendez etย al., 2022) and its corresponding offline datasetsย (Hussing etย al., 2023) offer an instantiation of such a scenario. CompoSuite consists of four axes: robot arms, objects, objectives and obstacles. Tasks are simply constructed by combining one element from each axis.We consider tasks with a fixed IIWA robotic manipulator and no obstacle. This leaves us with a total of 16 tasks. These 16 tasks are randomly grouped into pairs of two. Each group is one experiment where the policies trained on tasks correspond to our constituents. To create a new target task, we change one element per task, creating novel combinations for each group. For example, we start with the constituent policies that can 1) put and place a box into a trashcan and 2) push a plate. The target task can be to push the box. We train our constituent policies on the expert datasets using the offline RL algorithm Implicit Q-learningย (Kostrikov etย al., 2022) (IQL). This ensures we obtain very strong constituent policies for their respective tasks. After training the constituents, we run and the baselines for a short amount of time in the simulator. We report mean performance and standard error over 5 seeds using an evaluation of episodes.
Algorithmsย ย ย For practical purposes, we use a heuristic version of which does not re-compute the max-following policy at every step but rather after multiple steps. For our baselines, we ran the code provided byย (Liu etย al., 2023) to train the MAPS algorithm but were unable to obtain non-trivial return even after a reasonable amount of tuning. MAPS has been shown to have difficulties with leveraging very performant constituent policies such as the ones we are using (see the Walker experiment byย Liu etย al. (2023) in Figure 1 (d) in which the algorithm struggles to be competitive with the best, high-return constituent policy). They conjecture that in this case, their estimates of the constituent value functions will be less accurate in early training, resulting in gradient estimates with large bias and variance, weakening their convergence guarantees.
For now, we opt to use IQLโs in fine-tuning capabilities that offer a policy improvement style method on top of the best-performing constituent policy for comparison. Fine-tuning provides a strong baseline in the sense that it has access to the already trained value functions of the constituent policies providing it with inherently more starting information. For comparability, we limit the number of episodes available for fine-tuning to the same number of episodes available for training MaxIteration. For more details we refer to Appendixย C.
Experimental Results
Figureย 3 contains a set of demonstrative results. The full results are deferred to Appendixย C. The selected results in Figureย 3 highlight three properties of :
-
1.
There are cases where max-following not only increases the return but actually leads to solving a task successfully even when none of the constituent policies achieve success.
-
2.
With successful constituent policies, max-following can significantly increase the success rate.
-
3.
max-following can sometimes increase return but not necessarily lead to success demonstrating the need to better understand which attributes make up good constituent policies in the future.
![Refer to caption](x1.png)
![Refer to caption](x2.png)
![Refer to caption](x3.png)
![Refer to caption](x4.png)
The results in Appendixย C demonstrate that in all cases, is at least as good as the best constituent policy which is not the case for algorithms from prior workย (Liu etย al., 2023) as discussed earlier. Moreover, consistently leads to greater return improvement than fine-tuning given the same amount of data. Fine-tuning with substantially more resources would eventually surpass the performance of as is limited to competing with the max-following benchmark which can be suboptimal.
6 Conclusion
We introduce , an algorithm to efficiently learn a policy that is competitive with the approximate max-following benchmark (and hence also with all constituent policies). We provide empirical evidence that max-following utilizing skill-learning enables us to learn how to complete tasks that it would be inefficient to learn from scratch, but that are superior to other individually trained experts for fixed given skills.
Limitations and Future Work
Our goal in this work has been to learn a policy that competes with an approximate max-following policy under minimal assumptions. However, we still assume efficient batch learnability of constituent value functions, which will not always be feasible in practice. While it seems likely that our oracle assumption is necessary for learning an approximate max-following policy, we leave proving this claim for future work. We also leave consideration of alternative ensembling approaches to future work. Max-value ensembling is sensitive to slight differences in the values between constituent policies whereas, e.g., softmax takes into account the relative โweightingโ of values. In addition, it would be interesting to characterize the amount of improvement we can obtain over our constituent policies or prove conditions under which our approximate max-following policy is competitive with a true max-following policy or the optimal policy. One could also extend this analysis to ensembling methods like softmax and study the nature of guarantees in that setting. Extending beyond MDPs to the partially observable setting, and to the discounted infinite-horizon setting, would also add richness to the class of problems we could consider.
Acknowledgements and Disclosure of Funding
The authors are partially supported by ARO grant W911NF2010080, DARPA grant HR001123S0011, the Simons Foundation Collaboration on Algorithmic Fairness, and NSF grants FAI-2147212 and CCF-2217062.
References
- Amit etย al. [2020] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In Halย Daumรฉ III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 269โ278. PMLR, 13โ18 Jul 2020.
- Barreto etย al. [2017] Andre Barreto, Will Dabney, Remi Munos, Jonathanย J Hunt, Tom Schaul, Hadoย P van Hasselt, and David Silver. Successor features for transfer in reinforcement learning. In I.ย Guyon, U.ย Von Luxburg, S.ย Bengio, H.ย Wallach, R.ย Fergus, S.ย Vishwanathan, and R.ย Garnett, editors, Advances in Neural Information Processing Systems, volumeย 30, 2017.
- Barreto etย al. [2020] Andrรฉ Barreto, Shaobo Hou, Diana Borsa, David Silver, and Doina Precup. Fast reinforcement learning with generalized policy updates. Proceedings of the National Academy of Sciences, 117(48):30079โ30087, 2020. doi: 10.1073/pnas.1907370117.
- Bertsekas [2012] Dimitri Bertsekas. Dynamic programming and optimal control: Volume I, volumeย 4. Athena scientific, 2012.
- Brafman and Tennenholtz [2002] Ronenย I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement learning. Journal of Machine Learning Research, 3(Oct):213โ231, 2002.
- Brukhim etย al. [2022] Nataly Brukhim, Elad Hazan, and Karan Singh. A boosting approach to reinforcement learning. Advances in Neural Information Processing Systems, 35:33806โ33817, 2022.
- Chang etย al. [2015] Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daumรฉย III, and John Langford. Learning to search better than your teacher. In International Conference on Machine Learning, pages 2058โ2066. PMLR, 2015.
- Chen etย al. [2021] Xinyue Chen, Che Wang, Zijian Zhou, and Keithย W. Ross. Randomized ensembled double q-learning: Learning fast without a model. In International Conference on Learning Representations, 2021.
- Cheng etย al. [2020] Ching-An Cheng, Andrey Kolobov, and Alekh Agarwal. Policy improvement via imitation of multiple oracles. Advances in Neural Information Processing Systems, 33:5587โ5598, 2020.
- Domingues etย al. [2021] Omarย Darwiche Domingues, Pierre Mรฉnard, Emilie Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In Algorithmic Learning Theory, pages 578โ598. PMLR, 2021.
- Du etย al. [2019] Simonย S Du, Shamย M Kakade, Ruosong Wang, and Linย F Yang. Is a good representation sufficient for sample efficient reinforcement learning? arXiv preprint arXiv:1910.03016, 2019.
- Freund and Schapire [1997] Yoav Freund and Robertย E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119โ139, 1997.
- Glorot etย al. [2011] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse rectifier neural networks. In Geoffrey Gordon, David Dunson, and Miroslav Dudรญk, editors, Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volumeย 15 of Proceedings of Machine Learning Research, pages 315โ323, Fort Lauderdale, FL, USA, 11โ13 Apr 2011. PMLR.
- Golowich etย al. [2024] Noah Golowich, Ankur Moitra, and Dhruv Rohatgi. Exploration is harder than prediction: Cryptographically separating reinforcement learning from supervised learning. arXiv preprint arXiv:2404.03774, 2024.
- Haarnoja etย al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volumeย 80 of Proceedings of Machine Learning Research, pages 1861โ1870. PMLR, 10โ15 Jul 2018.
- Hiraoka etย al. [2022] Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto, Takashi Onishi, and Yoshimasa Tsuruoka. Dropout q-functions for doubly efficient reinforcement learning. In International Conference on Learning Representations, 2022.
- Hussing etย al. [2023] Marcel Hussing, Jorgeย A. Mendez, Anisha Singrodia, Cassandra Kent, and Eric Eaton. Robotic manipulation datasets for offline compositional reinforcement learning, 2023.
- Jaksch etย al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(51):1563โ1600, 2010.
- Kane etย al. [2022] Daniel Kane, Sihan Liu, Shachar Lovett, and Gaurav Mahajan. Computational-statistical gap in reinforcement learning. In Conference on Learning Theory, pages 1282โ1302. PMLR, 2022.
- Kearns and Singh [2002] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49:209โ232, 2002.
- Kostrikov etย al. [2022] Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=68n2s9ZJWF8.
- Kurenkov etย al. [2019] Andrey Kurenkov, Ajay Mandlekar, Roberto Martin-Martin, Silvio Savarese, and Animesh Garg. Ac-teach: A bayesian actor-critic method for policy learning with an ensemble of suboptimal teachers. arXiv preprint arXiv:1909.04121, 2019.
- Lattimore and Hutter [2012] Tor Lattimore and Marcus Hutter. Pac bounds for discounted mdps. In Algorithmic Learning Theory, pages 320โ334, Berlin, Heidelberg, 2012. Springer Berlin Heidelberg. ISBN 978-3-642-34106-9.
- Lee etย al. [2021] Kimin Lee, Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning. PMLR, 2021.
- Li etย al. [2018] Guohao Li, Matthias Mueller, Vincent Casser, Neil Smith, Dominikย L Michels, and Bernard Ghanem. Oil: Observational imitation learning. arXiv preprint arXiv:1803.01129, 2018.
- Liu etย al. [2023] Xuefeng Liu, Takuma Yoneda, Chaoqi Wang, Matthew Walter, and Yuxin Chen. Active policy improvement from multiple black-box oracles. In International Conference on Machine Learning, pages 22320โ22337. PMLR, 2023.
- Liu etย al. [2024] Xuefeng Liu, Takuma Yoneda, Rick Stevens, Matthew Walter, and Yuxin Chen. Blending imitation and reinforcement learning for robust policy improvement. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=eJ0dzPJq1F.
- Mendez etย al. [2022] Jorgeย A. Mendez, Marcel Hussing, Meghna Gummadi, and Eric Eaton. Composuite: A compositional reinforcement learning benchmark. In 1st Conference on Lifelong Learning Agents, 2022.
- Peer etย al. [2021] Oren Peer, Chen Tessler, Nadav Merlis, and Ron Meir. Ensemble bootstrap** for q-learning. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8454โ8463. PMLR, 18โ24 Jul 2021.
- Puterman [1994] Martinย L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., USA, 1st edition, 1994. ISBN 0471619779.
- Schulman etย al. [2015] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
- Schulman etย al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. CoRR, abs/1707.06347, 2017. URL http://arxiv.longhoe.net/abs/1707.06347.
- Seno and Imai [2022] Takuma Seno and Michita Imai. d3rlpy: An offline deep reinforcement learning library. Journal of Machine Learning Research, 23(315):1โ20, 2022. URL http://jmlr.org/papers/v23/22-0017.html.
- Song etย al. [2023] Yanjie Song, Ponnuthuraiย Nagaratnam Suganthan, Witold Pedrycz, Junwei Ou, Yongming He, Yingwu Chen, and Yutong Wu. Ensemble reinforcement learning: A survey. Applied Soft Computing, page 110975, 2023.
- Sun etย al. [2017] Wen Sun, Arun Venkatraman, Geoffreyย J Gordon, Byron Boots, and Jย Andrew Bagnell. Deeply aggrevated: Differentiable imitation learning for sequential prediction. In International conference on machine learning, pages 3309โ3318. PMLR, 2017.
- Sutton and Barto [2018] Richardย S Sutton and Andrewย G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- Tunyasuvunakool etย al. [2020] Saran Tunyasuvunakool, Alistair Muldal, Yotam Doron, Siqi Liu, Steven Bohez, Josh Merel, Tom Erez, Timothy Lillicrap, Nicolas Heess, and Yuval Tassa. dm_control: Software and tasks for continuous control. Software Impacts, 6:100022, 2020. ISSN 2665-9638. doi: https://doi.org/10.1016/j.simpa.2020.100022.
- Wu etย al. [2018] Cathy Wu, Aravind Rajeswaran, Yan Duan, Vikash Kumar, Alexandreย M Bayen, Sham Kakade, Igor Mordatch, and Pieter Abbeel. Variance reduction for policy gradient with action-dependent factorized baselines. arXiv preprint arXiv:1803.07246, 2018.
Appendix A MDP Examples
A.1 LQR max-following parametric class vs. constituent policies
subject to |
To motivate the use of max-following policies in a richer class of MDPs, we consider a traditional control problem with continuous state and action spaces: the discrete linear quadratic regulator. Note that here we analyze the infinite horizon discounted case so that we can analyze the time-invariant value function, but episodic analogues exist. Consider the following setting where is a discount factor, and . Here, we consider the simple case where and . We know that the optimal policy is of the form [Bertsekas, 2012] and we set two policies that are only stable along one component and unstable along the other of the form and . It is important to note that the value functions of the individual policies and the optimal policies have exact quadratic forms like , but the max-following policy is not necessarily within the same parametric class. For example, is the solution to the Lyapunov equation and . A similar formula exists for policy .
In LQR, for the controllers described above, a max-following policy is able to attain higher value than the individual expert policies that have an unstable direction in one axis. Moreover, we see that the optimal policy is obviously superior to all the other policies, but that a max-following policy is more competitive with it than the other individual expert policies. A max-following policy is ultimately able to benefit from the stabilizing component of each axis of the individual policies, which ultimately lets it perform better than any given individual one.
Appendix B Additional Proofs
See 4.1
Proof.
We will prove the claim inductively, showing that for all , if we run any approximate max-following policy for steps, and then continue following the policy chosen at step for the rest of the episode, then our expected return is not much worse than if we had followed any fixed for the whole episode.
Somewhat more formally, recalling the definition of the set of approximate max-following policies (Definitionย 2.3), at every time and state , a policy takes action for a such that . Letting denote the that follows at state and time , we will show that if at some step we have
for all , then the same holds for for all .
In the base case, , the claim
for all and all , follows straightforwardly from the definition of and setting of , since
We now prove the inductive step. We wish to show that if at step , we have for some
then continuing to follow at step and following thereafter reduces expected return by . Now if for , it must be the case that
otherwise . It follows that
(by definition of and ) | ||||
(from ) | ||||
(by definition of ) | ||||
(by inductive hypothesis) |
and so the claim holds for time , for any for which it holds for time . We showed the base case hold for all , and therefore we have
for all . In particular, for we conclude that
and it follows that
โ
Appendix C Additional information about experiments
For our experiments, we use a heuristic version of that operates in rounds. First, the algorithm collects a set of trajectories using every policy to initialize the respective value functions. Then, in every round the algorithm for every policy exectues the max-following policy for steps and the switches to the respective constituent policy. At the end of each round, value functions of constituent policies are updated. is uniformly spaced along the full horizon and thus, depends on the number of rounds and the horizon. The total number of episodes is an upper bound on the number of samples collected which is what we determine to compare run-times between and IQL. Finally, we use a discounting which has been shown to have regularizing effects on the value function updatesย [Amit etย al., 2020].
For IQL, we use the d3rlpy implementationsย [Seno and Imai, 2022] and code provided byย Hussing etย al. [2023].
C.1 Hyperparameters
Both algorithms are run for steps initially (to initialize value functions for and to pre-fill the buffer for IQL) before doing updates and then for steps for online training.
All neural networks use ReLUย [Glorot etย al., 2011] Multi-layer perceptrons with layers and a hidden dimension of per layer.
Optimizer | Adam |
---|---|
Adam | |
Adam | |
Adam | |
Value Function Learning Rate | |
Number of rounds | 50 |
Number of gradient steps per round | 40,000 |
Batch Size | |
Optimizer | Adam |
---|---|
Adam | |
Adam | |
Adam | |
Actor Learning Rate | |
Critic Learning Rate | |
Batch Size | #Tasks |
n_steps | |
n_critics | |
expectile | |
weight_temp | |
max_weight |
C.2 Full results on CompoSuite
![Refer to caption](x5.png)
![Refer to caption](x6.png)
![Refer to caption](x7.png)
![Refer to caption](x8.png)
![Refer to caption](x9.png)
![Refer to caption](x10.png)
![Refer to caption](x11.png)
![Refer to caption](x12.png)
![Refer to caption](x13.png)
![Refer to caption](x14.png)
![Refer to caption](x15.png)
![Refer to caption](x16.png)
![Refer to caption](x17.png)
![Refer to caption](x18.png)
![Refer to caption](x19.png)
![Refer to caption](x20.png)
![Refer to caption](x21.png)
C.3 Results on DM Control
We run our MaxIteration algorithm on the DM Control benchmarksย [Tunyasuvunakool etย al., 2020] similar to the MAPSย [Liu etย al., 2023] setup. In their setup, the constituent policies correspond to different checkpointed models in one run of the online Soft-Actor criticย [Haarnoja etย al., 2018] algorithm. As a result, it is generally true that the latest checkpointed model will outperform the previous two checkpoints meaning one constituent policy is strictly better everywhere than the others. We report the final performance over 5 seeds using 16 evaluation trajectories in Figureย 5. The results show that our algorithm behaves as expected and always uses the best oracle. Without policy improvement operator, this setup does not allow us to exceed the performance of the constituent policies.
![Refer to caption](x22.png)
![Refer to caption](x23.png)
![Refer to caption](x24.png)
C.4 Computational Resources
Our experiments were conducted using a total of GPUs inclusing both server-grade (e.g., NVIDIA RTX A6000s) and consumer-grade (e.g., NVIDIA RTX 3090) GPUs. Training the constituent policies from offline data takes less than hours. Our MaxIteration algorithm takes about hours to train while the baseline fine-tuning takes around hour. A large chunk of the runtime cost stems from executing the simulator.