Contextualized Hybrid Ensemble Q-learning:
Learning Fast with Control Priors

Emma Cramer
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Bernd Frauenknecht^†^†footnotemark:
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Ramil Sabirov^†^†footnotemark:
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Sebastian Trimpe
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University These authors contributed equally.

Abstract

Combining Reinforcement Learning (RL) with a prior controller can yield the best out of two worlds: RL can solve complex nonlinear problems, while the control prior ensures safer exploration and speeds up training. Prior work largely blends both components with a fixed weight, neglecting that the RL agent’s performance varies with the training progress and across regions in the state space. Therefore, we advocate for an adaptive strategy that dynamically adjusts the weighting based on the RL agent’s current capabilities. We propose a new adaptive hybrid RL algorithm, Contextualized Hybrid Ensemble Q-learning (CHEQ). CHEQ combines three key ingredients: (i) a time-invariant formulation of the adaptive hybrid RL problem treating the adaptive weight as a context variable, (ii) a weight adaption mechanism based on the parametric uncertainty of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient RL. Evaluating CHEQ on a car racing task reveals substantially stronger data efficiency, exploration safety, and transferability to unknown scenarios than state-of-the-art adaptive hybrid RL methods.

1 Introduction

Deep reinforcement learning (RL) methods have shown great success in challenging control problems such as gameplay (Mnih et al., 2015; Silver et al., 2018a; OpenAI et al., 2019) and robotic manipulation (Gupta et al., 2021; Büchler et al., 2022). Despite the great potential of RL methods, their data inefficiency, unstructured exploration behavior, and inability to generalize to unknown scenarios represent a significant hurdle to their application to real-world problems.

A prime reason for limited real-world applications is the task-agnostic architecture of state-of-the-art RL approaches (Schulman et al., 2017; Haarnoja et al., 2018) that does not incorporate prior knowledge on how to solve the task at hand. In contrast, control theory provides a rich set of methods for deriving near-optimal controllers in many applications. This motivates the drive for hybrid RL methods (Silver et al., 2018b; Johannink et al., 2019) that blend control priors with deep RL policies. Hybrid algorithms thus combine the prior controller’s generalization capabilities and informed behavior with the power of deep RL for solving general nonlinear problems.

Notwithstanding the conceptual benefit of hybrid RL formulations, how to systematically combine the control prior with the RL agent largely remains an open problem. The majority of prior work (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020; Ceola et al., 2024) proposes a fixed weighting between the control prior and the RL agent. A fixed blending, however, disregards the fact that the capability of the RL agent depends on training time and state. In general, as more data is observed, the RL agent improves its behavior, ultimately outperforming the control prior in large portions of the domain. The core idea of our approach is to adapt the weighting between RL agent and control prior based on the agent’s confidence. As the RL agent improves over time, this induces a time-variant weighting mechanism. This time dependency leads to structural problems of prior formulations in uncertainty-adapted hybrid RL (Cheng et al., 2019; Rana et al., 2023).

We provide a unified view on hybrid RL that allows us to classify prior work within a general framework. Analyzing this framework highlights the necessity for a novel adaptive hybrid RL formulation with descriptive, time-invariant dynamics. We define the contextualized hybrid Markov decision process (MDP), introducing the adaptive weight as a context variable. Building upon this formulation, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm that systematically adapts the weighting between RL agent and control prior based on an uncertainty estimate of a critic ensemble. CHEQ combines the contextualized hybrid RL formulation with uncertainty-based weight adaption and existing ensemble-based acceleration techniques for data-efficient RL.

We evaluate our algorithm on a racing task (Schier et al., 2023), which requires operating a car close to its stability limits in order to achieve maximum return. We find that compared to prior work in adaptive hybrid RL, the CHEQ algorithm shows (i) reduced failures during training, (ii) increased sample efficiency, and (iii) improved transfer behavior on unseen race tracks.

In summary, our main contributions are:

•

A unified framework that allows us to classify existing approaches and reveal key limitations.
•

A hybrid MDP formulation, introducing the adaptive weight as a context variable and thus addressing structural problems of prior work in hybrid RL with adaptive weighting.
•

A novel hybrid RL algorithm, CHEQ, that systematically adapts the weighting between RL agent and control prior based on Q-ensemble disagreement.

2 Related Work

This section discusses relevant prior work combining RL and a control prior. We distinguish two types; hybrid RL with fixed and adaptive weighting between the RL agent and controller.

Hybrid Reinforcement Learning with Fixed Weighting. Two concurrent works (Silver et al., 2018b; Johannink et al., 2019) first combined RL and a control prior and introduced the term residual RL. In residual RL, the control prior is assumed to be fixed, and the RL agent learns a residual on top of this. In this work, we use the general term hybrid RL to include approaches that adapt the controller’s weight. Silver et al. (2018a) and Johannink et al. (2019) show advantages of hybrid RL, such as sample efficiency, improved sim-to-real transfer, and robustness towards uncertainties. Hybrid RL with fixed weights has then successfully been applied to real robot insertion tasks (Schoettler et al., 2020), peg insertion under uncertainty (Ranjbar et al., 2021), driving (Kerbel et al., 2022) and to learn a residual RL policy on top of a pre-trained RL agent (Ceola et al., 2024). A fixed mixing, however, does not allow one to consider the improving capabilities of the RL agent.

Hybrid Reinforcement Learning with Adaptive Weighting. Daoudi et al. (2023) assume a given controller confidence function, employing a controller in instances of high confidence and an RL agent in other scenarios. Our work focuses on the RL agent’s confidence and proposes to estimate the confidence based on a critic ensemble. Similar to our approach, Hoel et al. (2020a; b) train an ensemble of bootstrapped Q-networks for a driving task with discrete actions. They evaluate the uncertainty as the coefficient of variation of Q-estimates and resort to safe fallback actions in case of high uncertainty. However, they do not combine controller and RL agent but switch between both. In this work, we investigate a seamless blending approach for continuous control. Rana et al. (2020b) estimate the policy uncertainty using Monte-Carlo dropout and based on this uncertainty either sample from a residual policy or the controller alone. Rana et al. (2020a) directly fuse a prior control distribution with an RL policy in a multiplicative fashion and anneal the influence of the control prior over training time. Rana et al. (2023) use a policy ensemble to estimate how certain the RL agent is in the current action. The combined action is then computed as the Bayesian posterior of control prior and policy distribution. Cheng et al. (2019) use the TD-error as an uncertainty estimate and combine controller and RL agent based on this. Both Rana et al. (2023) and Cheng et al. (2019) base their adaption mechanism on a form of policy uncertainty. Both approaches train based on the combined action, which becomes brittle when facing large distributional shifts. We further discuss this limitation in Section 5.

3 Background

The following introduces the key components and the general concept of hybrid RL.

Reinforcement Learning. RL is a method for solving sequential decision problems based on the interaction between an agent and an environment (Sutton & Barto, 2018). The environment is modeled as a discounted Markov decision process defined by the tuple $\mathcal{M}=(\mathcal{S},\mathcal{A},p,r,\rho_{0},\gamma)$ , with state space $\mathcal{S}$ , action space $\mathcal{A}$ , and start state distribution $\rho_{0}$ . The commonly unknown transition function $p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}})$ describes transitions between states $\mathbf{s}_{t}\in\mathcal{S}$ and actions $\mathbf{a}^{\mathrm{RL}}_{t}\in\mathcal{A}$ . During transitions, rewards $r_{t}\in\mathbb{R}$ are emitted according to a reward function $r_{t+1}\sim r(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})$ . The objective of the RL agent is to learn a policy $\pi^{\mathrm{RL}}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}_{t})$ that maximizes the expected cumulative sum of rewards discounted by $\gamma\in(0,1)$ . This results in the RL objective

J(\pi^{\mathrm{RL}})=\max_{\pi^{\mathrm{RL}}}\mathbb{E}_{\pi^{\mathrm{RL}},% \mathcal{M}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\right].

(1)

The discounted sum of rewards is referred to as return and is accumulated along trajectories under the policy $\pi^{\mathrm{RL}}$ and the environment MDP $\mathcal{M}$ . State value functions condition expected return on a particular state $V^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t})=\mathbb{E}_{\pi^{\mathrm{RL}},\mathcal{M% }}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}\mid\mathbf{s}_{t}\right]$ while, action value or Q-functions condition expected return on specific state action pairs $Q^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})=\mathbb{E}_% {\pi^{\mathrm{RL}},\mathcal{M}}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}% \mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t}\right]$ .

Control Prior. The prior policy $\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid\mathbf{s}_{t})$ represents prior knowledge for solving the RL objective (1), while typically not providing the optimal policy over the whole domain $\mathcal{S}\times\mathcal{A}$ . This work focuses on control priors based on classic control theory. These can be derived with limited effort in many applications and often provide a good baseline for interaction with $\mathcal{M}$ . We assume a control prior that is time-invariant and without an internal state.

Hybrid Reinforcement Learning. Hybrid RL combines the control prior and the RL agent by blending their actions via some mixing function $\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})$ depending on a weight $\boldsymbol{\lambda}_{t}$ .

4 A Unified View on Hybrid Reinforcement Learning

Next, we develop a unified view on hybrid RL that allows classifying prior methods (cf. Section 2). In the standard RL setup depicted in Figure 1(a), the RL agent $\pi^{\mathrm{RL}}$ interacts with the time-invariant MDP $\mathcal{M}$ with dynamics $p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})$ that represents the controlled system. Hybrid RL (see Figure 1(b)) incorporates a control prior $\pi^{\mathrm{prior}}$ which requires reformulating the standard framework. Here, $\pi^{\mathrm{RL}}$ and $\pi^{\mathrm{prior}}$ apply a combined action $\mathbf{a}^{\mathrm{mix}}_{t}$ to $\mathcal{M}$ . The mixing function $f$ generates a combined action by blending the individual actions based on a weighting vector $\boldsymbol{\lambda}_{t}$ provided by a weight adaption function $\Lambda$ . Within this generalized framework, prior work in hybrid RL can be categorized based on the choice of $f$ and $\Lambda$ .

4.1 Mixing Function $f$

We consider mixing functions based on a weighted sum with a weighting vector $\boldsymbol{\lambda}_{t}=[\lambda^{\mathrm{prior}}_{t},\lambda^{\mathrm{RL}}_{% t}]^{\top}$

\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})=\lambda^{\mathrm{prior}}_{t}\cdot% \mathbf{a}^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\cdot\mathbf{a}^{% \mathrm{RL}}_{t}.

(2)

This formulation allows to distinguish a residual and a regularized setting.

In the residual setting, $\lambda^{\mathrm{prior}}_{t}$ is typically constant while $\lambda^{\mathrm{RL}}_{t}$ can be variable $\boldsymbol{\lambda}_{t}=[1,\lambda^{\mathrm{RL}}_{t}]^{\top}$ (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020). Thus, the RL agent interacts with the closed control loop between $\pi^{\mathrm{prior}}$ and $\mathcal{M}$ and learns a residual action on top of $\mathbf{a}^{\mathrm{prior}}_{t}$ . Consequently, $\boldsymbol{\lambda}_{t}$ modulates the RL agent’s impact on the closed loop dynamics. As the control prior is not scaled down, it might interpret the RL agent as a disturbance and counteract it (Ranjbar et al., 2021), which can limit the overall performance of residual formulations.

In the regularized setting, both weights are adaptable such that $\lambda^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}=1$ (Cheng et al., 2019; Rana et al., 2023). This results in a mixing function of the form

\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=(1-\lambda^{\mathrm{RL}}_{t})\cdot% \mathbf{a}^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\cdot\mathbf{a}^{% \mathrm{RL}}_{t}.

(3)

with $\lambda^{\mathrm{RL}}_{t}\in[0,1]$ . In the limit $\lambda^{\mathrm{RL}}_{t}=0$ , the control prior interacts with $\mathcal{M}$ without the interference of the RL agent, while the regularized setting reduces to the standard RL problem for $\lambda^{\mathrm{RL}}_{t}=1$ . Thus, $\lambda^{\mathrm{RL}}_{t}$ indicates not only the impact of $\mathbf{a}^{\mathrm{RL}}_{t}$ but also whether the RL agent interacts with the open loop dynamics of $\mathcal{M}$ or the closed loop dynamics as in the residual setting. Consequently, the control prior can be interpreted as a regularization of the RL agent. Our proposed algorithm operates in the regularized setting, allowing the agent to take over complete control when $\lambda^{\mathrm{RL}}_{t}=1$ .

4.2 Weight Adaption Function $\Lambda$

Hybrid RL approaches can further be classified, based on the choice of the weight adaption function $\Lambda$ modulating the weighting vector $\boldsymbol{\lambda}_{t}$ of the mixing function $f$ .

A large body of work, which we refer to as fixed-weight hybrid RL (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020; Ranjbar et al., 2021; Ceola et al., 2024) chooses $\boldsymbol{\lambda}_{t}$ fixed throughout training. Neglecting the time- and state-dependent capabilities of the RL agent.

Approaches that adapt $\boldsymbol{\lambda}_{t}$ , which we refer to as adaptive hybrid RL methods, rely on different mechanisms. Scheduling approaches (Rana et al., 2020a) change the weight explicitly with time, i.e. $\boldsymbol{\lambda}_{t}=\Lambda(t)$ , typically increasing the weight of the RL agent as training progresses. Domain-based approaches (Kulkarni et al., 2022; Daoudi et al., 2023) adapt the weights based on the point of operation within the domain $\mathcal{S}\times\mathcal{A}$ , i.e. $\boldsymbol{\lambda}_{t}=\Lambda(\mathbf{s}_{t},\mathbf{a}_{t})$ . Uncertainty-based approaches (Cheng et al., 2019; Rana et al., 2023) adapt the weight based on the confidence of the RL agent, indicated by an uncertainty estimate $u(\mathbf{s}_{t},\mathbf{a}_{t},t)$ , giving more weight to the RL agent when it has high confidence. Thus, they aim to leverage the benefits of the RL agent whenever possible, while resorting to a safe controller in situations where the RL agent has not seen enough data. The time dependency of the uncertainty estimate, however, increases the complexity of the hybrid RL setting, requiring a reformulation of the learning problem. In Section 5, we discuss the shortcomings of prior formulations and propose our own.

5 Contextualized Hybrid Reinforcement Learning

In Section 5.1, we propose a novel contextualized formulation of the adaptive hybrid RL problem and illustrate its benefits over prior approaches in Section 5.2. Based on that framework, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm in Section 5.3.

5.1 General Concept of Contextualized Hybrid Reinforcement Learning

Based on the unified view provided in Section 4, we propose a general formulation for the hybrid RL problem we call contextualized hybrid RL.

The environment in the hybrid setting not only consists of the controlled system $\mathcal{M}$ but also comprises the control prior, the mixing function, and the weight adaption function. We consider both the control prior and the mixing function to be time-invariant. In contrast, the weight adaption function $\Lambda$ can have time-varying behavior, i.e. $\boldsymbol{\lambda}_{t}=\Lambda(t,\dots)$ , as discussed in Section 4.2. This leads to time-varying dynamics of the hybrid environment, violating the assumption of time-invariance in the MDP formulation (Bellemare et al., 2023). Instead, we exclude $\Lambda$ from the definition of the hybrid environment and introduce the adaptive weight vector $\boldsymbol{\lambda}_{t}$ as a context variable to the agent and the environment (see Figure 1(c)). We model the hybrid environment introducing the contextualized hybrid MDP $\hat{\mathcal{M}}=(\mathcal{S},\mathcal{A},\mathcal{W},\hat{p},r,\rho_{0},\gamma)$ with $\mathcal{W}$ the set of weighting vectors $\boldsymbol{\lambda}_{t}$ and the contextualized dynamics function $\hat{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t% },\boldsymbol{\lambda}_{t})$ .

The MDP formulation $\hat{\mathcal{M}}$ induces the contextualized hybrid RL objective

\hat{J}(\pi^{\mathrm{RL}})=\max_{\pi^{\mathrm{RL}}}\mathbb{E}_{\pi^{\mathrm{RL% }},\hat{\mathcal{M}}}\left[\sum_{t=0}^{T}\gamma^{t}r_{t+1}\right],

(4)

which enforces to learn a policy $\hat{\pi}^{\mathrm{RL}}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}^{\mathrm{RL% }}_{t},\boldsymbol{\lambda}_{t})$ that maximizes expected return in $\hat{\mathcal{M}}$ . Introducing $\boldsymbol{\lambda}_{t}$ as a context variable further yields the contextualized hybrid value functions $\hat{V}^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\boldsymbol{\lambda}_{t})=\mathbb{E% }_{\pi^{\mathrm{RL}},\hat{\mathcal{M}}}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_% {k+1}\mid\mathbf{s}_{t},\boldsymbol{\lambda}_{t}\right]$ and $\hat{Q}^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},% \boldsymbol{\lambda}_{t})=\mathbb{E}_{\pi^{\mathrm{RL}},\hat{\mathcal{M}}}% \left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}\mid\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t}\right]$ . Thus, we can optimize (4) using standard RL methods by additionally conditioning on $\boldsymbol{\lambda}_{t}$ . The general mechanism of contextualized hybrid RL is illustrated in Algorithm 1.

Algorithm 1 Contextualized Hybrid Reinforcement Learning

1:RL policy

\hat{\pi}^{\mathrm{RL}}_{\phi}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})

, control prior

\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid s)

, mixing function

f(\mathbf{a}^{\mathrm{RL}}_{t},\mathbf{a}^{\mathrm{prior}}_{t},\boldsymbol{% \lambda}_{t})

, weight adaption function

\Lambda

, replay buffer

\mathcal{D}\leftarrow\emptyset

2:for each episode do

3: Sample initial state

\mathbf{s}_{0}\sim\rho_{0}

, initialize

\boldsymbol{\lambda}_{0}

4: for each step do

5: Sample RL action

\mathbf{a}^{\mathrm{RL}}_{t}\sim\pi^{\mathrm{RL}}_{\phi}\left(a^{\mathrm{RL}}_% {t}\mid\mathbf{s}_{t},\boldsymbol{\lambda}_{t}\right)

6: Sample control prior action

\mathbf{a}^{\mathrm{prior}}_{t}\sim\pi^{\mathrm{prior}}_{\phi}\left(a^{\mathrm% {RL}}_{t}\mid\mathbf{s}_{t}\right)

7: Get combined action

\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{RL}}_{t},\mathbf{a}^{% \mathrm{prior}}_{t},\boldsymbol{\lambda}_{t})

8: Observe state transition

\mathbf{s}_{t+1},r_{t+1}\sim p\left(\cdot,\cdot\mid\mathbf{s}_{t},a^{\mathrm{% mix}}_{t}\right)

9: Store

\left(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\boldsymbol{\lambda}_{t},% \mathbf{s}_{t+1},r_{t+1}\right)

into replay buffer

\mathcal{D}

10: Get next adaptive weight

\boldsymbol{\lambda}_{t+1}=\Lambda

11: Sample set of transitions

\left(s,a,\lambda,s^{\prime},r\right)\sim\mathcal{D}

12: Optimize

\phi

with respect to (4) using RL with sampled transitions.

13: end for

14:end for

Prior work takes different approaches to formulating the hybrid learning problem. Approaches with a time-invariant weight adaption function, such as fixed-weight hybrid methods, include $\Lambda$ in the definition of the environment MDP $\bar{\mathcal{M}}=(\mathcal{S},\mathcal{A},\bar{p},r,\rho_{0},\gamma)$ with dynamics $\bar{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})$ . This formulation is directly applicable to the standard RL objective (1), however, does not generalize to time-varying adaption mechanisms such as uncertainty-adapted methods. Approaches with a time-varying adaption mechanism (Cheng et al., 2019; Rana et al., 2023) typically formulate the hybrid RL problem concerning the controlled system MDP and the combined action with dynamics $p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{mix}}_{t})$ . This likewise yields a formulation that is directly applicable to (1), however, this can lead to problems as the agent is unaware of the downstream mixing process. Furthermore, this introduces a distributional shift between trained policy and data-collecting behavior policy. The distributional shift can lead to training instability and divergence (Kumar et al., 2020; Fujimoto et al., 2018).

5.2 Illustrative Example

We exemplify the strength of the contextualized hybrid RL formulation based on $\hat{\mathcal{M}}$ introduced in Section 5.1 by comparing it to prior approaches on the cart pole system depicted in Figure 2(a). The goal is to balance the pole upright while kee** the cart close to its initial position. The system is controlled via continuous forces on the cart. We choose $\pi^{\mathrm{prior}}$ to apply a constant force to the left, which destabilizes formulations unconscious of the mixing process. We investigate a time-invariant fixed weight setting as well as a time-varying schedule-based weight adaption setting to highlight the capability of the respective formulations to deal with both scenarios. This simple example illustrates that the contextualized hybrid MDP formulation can deal with destabilizing control priors and time-varying weight adaption functions while prior formulations fail.

Fixed Weighting. First, we consider a residual setting with fixed weights $\lambda^{\mathrm{RL}}_{t}=\lambda^{\mathrm{prior}}_{t}=0.5$ . Figure 2(b) depicts the performance of RL agents trained under $\hat{\mathcal{M}}$ , $\bar{\mathcal{M}}$ , and $\mathcal{M}$ with respective dynamics $\hat{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t% },\boldsymbol{\lambda}_{t})$ , $\bar{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})$ , and $p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{mix}}_{t})$ . While agents trained under $\hat{\mathcal{M}}$ and $\bar{\mathcal{M}}$ learn to stabilize the cart pole, the hybrid formulation concerning $\mathcal{M}$ fails. When formulating the RL problem concerning $\mathbf{a}^{\mathrm{mix}}_{t}$ the RL agent observes the combined action in its data and therefore learns the combined action in its policy. This, however, neglects the fact that the policy action is mixed with the controller action before being applied to $\mathcal{M}$ . Assuming the cart pole is not moving, and the pole is upright, an agent trained under $\mathcal{M}$ provides the optimal combined action, namely applying no force, while $\mathbf{a}^{\mathrm{prior}}_{t}$ pushes the pole to the left. This results in $\mathbf{a}^{\mathrm{mix}}_{t}$ pointing to the left, causing the pole to fall while giving the agent no mechanism to observe and counteract this phenomenon. Instead, formulating the hybrid RL problem concerning $\mathbf{a}^{\mathrm{RL}}_{t}$ allows the agent to observe the mixing mechanism and compensate for the destabilizing control prior.

Adaptive Weighting. Second, we consider an adaptive hybrid RL problem with time-varying $\Lambda$ . We choose a schedule-based approach with a regularizing mixing function (3) and $\lambda^{\mathrm{RL}}_{t}\in[0,1]$ linearly increasing over time. Figure 2(c) shows the performance of agents trained under formulations based on $\hat{\mathcal{M}}$ , $\bar{\mathcal{M}}$ , and $\mathcal{M}$ . While agents trained under $\bar{\mathcal{M}}$ and $\mathcal{M}$ fail, the formulation based on $\hat{\mathcal{M}}$ succeeds. In the beginning, when the RL agent is given only low weight, the formulation under $\mathcal{M}$ suffers from a high distributional shift between the action of the behavior policy $\mathbf{a}^{\mathrm{mix}}_{t}$ and the action of the target policy $\mathbf{a}^{\mathrm{RL}}_{t}$ . The large distributional shift causes the agent to diverge (Kumar et al., 2020; Fujimoto et al., 2018). Although the distributional shift decreases with increasing weight lambda, the agent does not manage to recover. The formulation under $\bar{\mathcal{M}}$ fails due to the missing information about the time-variant behavior of the mixing process. The proposed contextualized hybrid RL formulation solves these issues by formulating the task concerning $\mathbf{a}^{\mathrm{RL}}_{t}$ and introducing the context variable $\boldsymbol{\lambda}_{t}$ .

5.3 Contextualized Hybrid Ensemble Q-learning (CHEQ)

Based on the contextualized hybrid RL formulation introduced in Section 5.1, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm. At the heart of CHEQ is a critic ensemble that (i) provides an uncertainty estimate enabling an uncertainty-adapted hybrid RL mechanism, and (ii) allows to incorporate ensemble-based acceleration techniques for data-efficient RL.

We base CHEQ on the Soft Actor-Critic (SAC) (Haarnoja et al., 2018) algorithm and a regularizing mixing function (3) with $\boldsymbol{\lambda}_{t}=[(1-\lambda^{\mathrm{RL}}_{t}),\lambda^{\mathrm{RL}}_% {t}]^{\top}$ . The weight adaption mechanism relies on a critic ensemble comprising of $E$ contextualized Q-functions with parameters $\theta_{e}$ , $e\in\{1,\dots,E\}$ and corresponding target Q-functions with parameters $\bar{\theta}_{e}$ , $e\in\{1,\dots,E\}$ . We update the critics with the mechanism of Randomized Ensemble Double Q-learning (REDQ) (Chen et al., 2021) and enforce sufficient independence between Q-estimates using Bernoulli masking of the training data (Osband et al., 2016; Lee et al., 2021; Mai et al., 2022). Model ensembles estimate parametric uncertainty, referred to as epistemic uncertainty, from disagreement between individual models within the ensemble. If different critics disagree about the outcome of taking action $\mathbf{a}^{\mathrm{RL}}_{t}$ in $\mathbf{s}_{t}$ while weighting with $\boldsymbol{\lambda}_{t}$ , this indicates a weak understanding of the task in the particular area of $\mathcal{S}\times\mathcal{A}\times\mathcal{W}$ . Thus, the control prior should be prioritized over the RL agent in such situations. Therefore, epistemic uncertainty represents a suitable quantity for adapting the weighting of control prior and RL agent. We define epistemic uncertainty as the standard deviation of critic predictions

u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=\sqrt% {\frac{1}{E}\sum_{e=1}^{E}\left(\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}% ^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})-\mu(\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})\right)^{2}}

(5)

with $\mu(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=% \frac{1}{E}\sum_{e=1}^{E}\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})$ . We aim to give low weight to the RL agent in areas of high uncertainty and vice versa. Thus, the weight adaption function $\Lambda(u(\mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}},\lambda^{\mathrm{RL}}_{t% }))$ maps the critics epistemic uncertainty to the weighting factor $\lambda^{\mathrm{RL}}_{t}\in[\lambda_{\mathrm{min}},\lambda_{\mathrm{max}}]% \subseteq[0,1]$ via the piece-wise linear function

\lambda^{\mathrm{RL}}_{t+1}=\begin{cases}\lambda_{\mathrm{max}}&\text{if }u(% \mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}},\lambda^{\mathrm{RL}}_{t})<u_{% \mathrm{min}}\\ \frac{u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})% -u_{\mathrm{max}}}{u_{\mathrm{min}}-u_{\mathrm{max}}}(\lambda_{\mathrm{max}}-% \lambda_{\mathrm{min}})+\lambda_{\mathrm{min}}&\text{if }u(\mathbf{s}_{t},% \mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})\in[u_{\mathrm{min}},u_% {\mathrm{max}}]\\ \lambda_{\mathrm{min}}&\text{if }u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t}% ,\lambda^{\mathrm{RL}}_{t})>u_{\mathrm{max}}.\end{cases}

(6)

Besides providing an uncertainty estimate of the RL agent, the critic ensemble used in CHEQ has proven effective in mitigating overestimation bias (Thrun & Schwartz, 1993) in Q-learning-based approaches (Lan et al., 2021; Wang et al., 2021; Chen et al., 2021). The Update-To-Data (UTD) ratio describes the number of gradient steps per environment interaction. Due to the reduction of the overestimation bias, the critic ensemble allows for increasing the UTD ratio while maintaining stable learning. This substantially improves the data efficiency of value-based actor-critic methods (Chen et al., 2021). A detailed pseudocode algorithm of CHEQ is provided in Algorithm 2 of Appendix A.

6 Experiments

We evaluate CHEQ on a racing task and compare it to standard RL, fixed-weighting hybrid RL, and state-of-the-art adaptive hybrid RL. In our experiments, CHEQ yields substantial improvements in (i) data efficiency compared to other hybrid methods, as well as (ii) exploration safety, and (iii) zero-shot transferability to unknown scenarios as compared to all competitor approaches.

6.1 Experimental Setup

We base our evaluation on a car racing setting adapted from (Schier et al., 2023). Achieving high returns requires advanced trajectory planning and control while operating the vehicle close to stability limits, including tire slip. The control prior is a trajectory-following task along the center line of the track using a Stanley controller (Thrun et al., 2006) for lateral and a proportional controller for longitudinal control. Further details are provided in Appendix B.

We compare CHEQ against the standard RL approaches SAC (Haarnoja et al., 2018) and REDQ (Chen et al., 2021), fixed-weighting hybrid RL based on SAC, and the state-of-the-art adaptive hybrid RL methods Controller Regularized RL (CORE) (Cheng et al., 2019) and Bayesian Controller Fusion (BCF) (Rana et al., 2023). In all experiments, we provide results for CHEQ with a high UTD ratio (CHEQ-UTD20) to demonstrate the capabilities of the approach and a low UTD ratio (CHEQ-UTD1) for a fair comparison to SAC-based methods. All implementations¹¹1The code is available at github.com/Data-Science-in-Mechanical-Engineering/cheq . are based on either the Clean RL library (Huang et al., 2022) or the original paper implementation (Rana et al., 2023). We provide a detailed description of the hyperparameter settings in Appendix A.1.

We train all our approaches on ten random seeds and one fixed race track. We report return and cumulative training failures. Runs are considered a failure when the car leaves the track. For zero-shot transfer, we evaluate ten trained models per algorithm on ten unseen racetracks. Return and failure plots show the respective mean (solid lines) and \qty95 confidence interval (shaded areas).

6.2 Evaluation on the Car Racing Environment

We compare CHEQ to standard RL, fixed-weight hybrid RL, and adaptive hybrid RL concerning learning performance (see Figure 3) and zero-shot transfer to unknown tracks (see Figure 4).

Comparison against Fixed-Weight Hybrid RL and Standard RL. Comparing the CHEQ algorithm based on SAC (CHEQ-UTD1) to a standalone SAC agent and the control prior in Figure 3(a) illustrates the general benefit of hybrid RL. While the control prior operates safely without failing, it shows limited performance due to the conservative driving policy. SAC shows strong asymptotic performance at the cost of frequent failures throughout training. CHEQ-UTD1 considerably outperforms SAC concerning data efficiency and exploration safety, learning faster with fewer failures while yielding comparable asymptotic performance. The comparison of CHEQ to fixed-weight hybrid RL methods further illustrates the advantage of an adaptive weighting scheme. The fixed-weight hybrid RL approaches (0.5-SAC, 0.7-SAC) combine the control prior with a SAC agent using the mixing function in (3) with $\lambda^{\mathrm{RL}}_{t}=0.5$ and $\lambda^{\mathrm{RL}}_{t}=0.7$ , respectively. Here, the choice of $\lambda^{\mathrm{RL}}_{t}$ represents a trade-off between exploration safety and asymptotic performance, where a higher $\lambda^{\mathrm{RL}}_{t}$ enables better performance while reducing safety. A fixed weight of $\lambda^{\mathrm{RL}}_{t}=0.5$ arguably reduces failures compared to CHEQ-UTD1, however, this comes at the cost of substantially lower performance.

Update-To-Data Ratio. As discussed in Section 5.3, the critic ensemble of CHEQ allows the use of acceleration techniques originally proposed in the REDQ algorithm. Increasing the UTD ratio to $20$ notably improves data efficiency as compared to SAC, both as a standalone RL algorithm (REDQ) and as an adaptive hybrid RL algorithm (CHEQ-UTD20). The speed-up in training helps to reduce training failures as REDQ reports a drastically reduced number of failures compared to SAC. The benefit is further amplified in the adaptive hybrid formulation of CHEQ-UTD20 as indicated by its strong initial performance and the ability to reduce the mean cumulative fails to less than $20$ .

Comparison against state-of-the-art Adaptive Hybrid RL. Finally, Figure 3(c) compares CHEQ to the most relevant adaptive hybrid RL methods. CHEQ-UTD1 shows similar data efficiency and performance compared to CORE and BCF while considerably reducing accumulated fails. CHEQ-UTD20 substantially outperforms all competitor approaches in all performance metrics. A more detailed hyperparameter analysis of the prior approaches, as well as results for reformulations of CORE and BCF as contextualized hybrid RL methods are provided in in Appendix C.

Zero-shot Transfer. Next, we perform a zero-shot transfer of the trained agents. Returns are depicted in Figure 4(a) while Figure 4(b) shows the success rate of the respective methods. CHEQ-UTD1, CHEQ-UTD20 and the control prior achieve a success rate of \qty97, \qty95, and \qty100, respectively. The other standard and hybrid RL methods frequently fail in unseen scenarios. While the CHEQ variants fail slightly more often than the controller, they drive notably faster, i.e., they achieve higher returns. Figure 4(c) illustrates the adaption mechanism of CHEQ on one example track. In challenging and unseen curves, the agent gradually hands over to the control prior as can be seen in Figure 4(c). We find that in the few failure cases (3 out of 100 for CHEQ-UTD1 and 5 out of 100 for CHEQ-UT20), the agent correctly identifies its uncertainty, and hands over to the controller, but the controller is unable to navigate the situation safely. We provide an illustration of all transfer tracks, as well as the weight adaption of CHEQ-UTD20 on these tracks in Appendix C.2. In summary, CHEQ shows strong zero-shot transfer behavior, driving faster than the controller with only a few failure cases.

Summary. We summarize the trade-off between failures and asymptotic performance in Figure 5. Figure 5(a) illustrates the training results of the respective approaches while Figure 5(b) depicts the transfer results. Fixed weight hybrid RL effectively reduces failures as compared to standard SAC. This, however, comes at the cost of asymptotic performance. Our adaptive CHEQ algorithm avoids this trade-off, achieving high return with only a few failures. In zero-shot transfer, the CHEQ agent again performs best due to its ability to detect unforeseen situations reliably and then fall back to the safe control prior.

7 Conclusion

This work addresses how to systematically combine an RL agent with a control prior. We propose a novel formulation of the adaptive hybrid RL problem which introduces the adaptive weighting parameter as a context variable of the MDP, and based on this, propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm. CHEQ combines a reliable critic uncertainty-based weight adaption mechanism with the data efficiency of critic ensemble methods, yielding substantially stronger results than state-of-the-art adaptive hybrid RL methods on a racing task concerning data efficiency, exploration safety, and transferability.

Acknowledgments

We thank Paul Brunzema, Johanna Menn, and David Stenger for their helpful comments. We also thank Devdutt Subhasish and Lukas Kesper for their help with the cartpole example. This work was funded in part by the German Federal Ministry of Education and Research (“Demonstrations- und Transfernetzwerk KI in der Produktion (ProKI-Netz)” initiative, grant number 02P22A010) and the German Federal Ministry for Economic Affairs and Climate Action (project EEMotion). Computations were performed with computing resources granted by RWTH Aachen University under the projects <thes1594>, <rwth1490>, and <rwth1501>.

References

Bellemare et al. (2023) Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
Büchler et al. (2022) Dieter Büchler, Simon Guist, Roberto Calandra, Vincent Berenz, Bernhard Schölkopf, and Jan Peters. Learning to Play Table Tennis From Scratch Using Muscular Robots. IEEE Transactions on Robotics, 2022.
Ceola et al. (2024) Federico Ceola, Lorenzo Rosasco, and Lorenzo Natale. RESPRECT: Speeding-up Multi-fingered Gras** with Residual Reinforcement Learning, 2024. arXiv:2401.14858 [cs].
Chen et al. (2021) Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model, 2021.
Cheng et al. (2019) Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control Regularization for Reduced Variance Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019.
Daoudi et al. (2023) Paul Daoudi, Bogdan Robu, Christophe Prieur, Ludovic Dos Santos, and Merwan Barlier. Enhancing Reinforcement Learning Agents with Local Guides. In International Conference on Autonomous Agents and Multiagent Systems, 2023.
Fujimoto et al. (2018) Scott Fujimoto, Herke Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018.
Gupta et al. (2021) Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine. Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 2018.
Hoel et al. (2020a) Carl-Johan Hoel, Tommy Tram, and Jonas Sjöberg. Reinforcement Learning with Uncertainty Estimation for Tactical Decision-Making in Intersections. In IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), 2020a.
Hoel et al. (2020b) Carl-Johan Hoel, Krister Wolff, and Leo Laine. Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation. In IEEE Intelligent Vehicles Symposium (IV), 2020b.
Huang et al. (2022) Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms. Journal of Machine Learning Research, 2022.
Johannink et al. (2019) Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. In International Conference on Robotics and Automation (ICRA), 2019.
Kerbel et al. (2022) Lindsey Kerbel, Beshah Ayalew, Andrej Ivanco, and Keith Loiselle. Residual Policy Learning for Powertrain Control. IFAC-PapersOnLine, 2022.
Kulkarni et al. (2022) Padmaja Kulkarni, Jens Kober, Robert Babuška, and Cosimo Della Santina. Learning assembly tasks in a few minutes by combining impedance control and residual recurrent reinforcement learning. Advanced Intelligent Systems, 2022.
Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
Lan et al. (2021) Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning, 2021.
Lee et al. (2021) Kimin Lee, Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning. PMLR, 2021.
Mai et al. (2022) Vincent Mai, Kaustubh Mani, and Liam Paull. Sample efficient deep reinforcement learning via uncertainty estimation. arXiv preprint arXiv:2201.01666, 2022.
Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540), 2015.
OpenAI et al. (2019) OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with Large Scale Deep Reinforcement Learning, 2019.
Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, volume 29, 2016.
Rana et al. (2020a) Krishan Rana, Vibhavari Dasagi, Ben Talbot, Michael Milford, and Niko Sünderhauf. Multiplicative Controller Fusion: Leveraging Algorithmic Priors for Sample-efficient Reinforcement Learning and Safe Sim-To-Real Transfer. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020a.
Rana et al. (2020b) Krishan Rana, Ben Talbot, Vibhavari Dasagi, Michael Milford, and Niko Sünderhauf. Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies For Deployment in Unknown Environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020b.
Rana et al. (2023) Krishan Rana, Vibhavari Dasagi, Jesse Haviland, Ben Talbot, Michael Milford, and Niko Sünderhauf. Bayesian controller fusion: Leveraging control priors in deep reinforcement learning for robotics. The International Journal of Robotics Research, 2023.
Ranjbar et al. (2021) Alireza Ranjbar, Ngo Anh Vien, Hanna Ziesche, Joschka Boedecker, and Gerhard Neumann. Residual Feedback Learning for Contact-Rich Manipulation Tasks with Uncertainty. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
Schier et al. (2023) Maximilian Schier, Christoph Reinders, and Bodo Rosenhahn. Learned Fourier Bases for Deep Set Feature Extractors in Automotive Reinforcement Learning. In IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 2023.
Schoettler et al. (2020) Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
Schulman et al. (2017) J. Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. ArXiv, 2017.
Silver et al. (2018a) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 2018a.
Silver et al. (2018b) Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018b.
Sutton & Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
Thrun & Schwartz (1993) Sebastian Thrun and Anton Schwartz. Issues in Using Function Approximation for Reinforcement Learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993.
Thrun et al. (2006) Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, Kenny Lau, Celia Oakley, Mark Palatucci, Vaughan Pratt, Pascal Stang, Sven Strohband, Cedric Dupont, Lars-Erik Jendrossek, Christian Koelen, Charles Markey, Carlo Rummel, Joe van Niekerk, Eric Jensen, Philippe Alessandrini, Gary Bradski, Bob Davies, Scott Ettinger, Adrian Kaehler, Ara Nefian, and Pamela Mahoney. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics, 2006.
Wang et al. (2021) Hang Wang, Sen Lin, and Junshan Zhang. Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback. In Advances in Neural Information Processing Systems, volume 34, 2021.

Appendix A Algorithmic Details

Algorithm 2 shows the pseudocode of the Contextualized Hybrid Ensemble Q-learning algorithm.

Algorithm 2 CHEQ

Initialize control prior

\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid\mathbf{s}_{t})

, contextualized RL policy

\hat{\pi}^{\mathrm{RL}}_{\phi}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}_{t},% \lambda^{\mathrm{RL}}_{t})

, contextualized critic ensemble

\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{% \mathrm{RL}}_{t})

e\in\{1,\dots,E\}

, contextualized target critic ensemble

\hat{Q}_{\bar{\theta}_{e}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda% ^{\mathrm{RL}}_{t})

e\in\{1,\dots,E\}

, replay buffer

\mathcal{D}\leftarrow\emptyset

, weighting interval

[\lambda_{\mathrm{min}},\lambda_{\mathrm{max}}]

, uncertainty limits

[u_{\mathrm{min}},u_{\mathrm{max}}]

, UTD ratio

G

, Bernoulli masking rate

\kappa

, minimization targets

F

, Polyak averaging factor

\tau

for each epoch do

\mathbf{s}_{0}\sim\rho_{0}

\lambda^{\mathrm{RL}}_{0}=\lambda_{\mathrm{min}}

for each epoch step do

\mathbf{a}^{\mathrm{RL}}_{t}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot\mid% \mathbf{s}_{t},\lambda^{\mathrm{RL}}_{t})

\mathbf{a}^{\mathrm{prior}}_{t}\sim\pi^{\mathrm{prior}}(\cdot\mid\mathbf{s}_{t})

\mathbf{a}^{\mathrm{mix}}_{t}=(1-\lambda^{\mathrm{RL}}_{t})\mathbf{a}^{\mathrm% {prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\mathbf{a}^{\mathrm{RL}}_{t}

u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})

according to (5)

\lambda_{t+1}=\Lambda(u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{% \mathrm{RL}}_{t}))

according to (6)

\mathbf{s}_{t+1},r_{t+1}\sim\hat{p}(\cdot,\cdot\mid\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})

for

e=1,\dots,E

Sample Bernoulli Mask

m_{t}^{e}\sim Ber(\kappa)

end for

\mathcal{D}\leftarrow\mathcal{D}\cup\{(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}% _{t},\lambda^{\mathrm{RL}}_{t},\mathbf{s}_{t+1},r_{t+1}),m_{t}^{1},\dots,m_{t}% ^{E}\}

for

G

updates do

Sample mini-batch

\mathcal{B}=\{(\mathbf{s},\mathbf{a}^{\mathrm{RL}},\lambda^{\mathrm{RL}},% \mathbf{s}^{\prime},r\}

from

\mathcal{D}

Sample a set

\mathcal{F}

with

|\mathcal{F}|=F

uniform at random from

\{1,\dots,E\}

\tilde{\mathbf{a}}^{\prime\mathrm{RL}}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot% \mid\mathbf{s}^{\prime},\lambda^{\mathrm{RL}})

y=r+\gamma\left(\min_{e\in\mathcal{F}}\hat{Q}_{\bar{\theta}_{e}}(\mathbf{s}^{% \prime},\tilde{\mathbf{a}}^{\prime\mathrm{RL}},\lambda^{\mathrm{RL}})-\alpha% \log\hat{\pi}^{\mathrm{RL}}_{\phi}(\tilde{\mathbf{a}}^{\prime\mathrm{RL}}\mid% \mathbf{s}^{\prime},\lambda^{\mathrm{RL}})\right)

for

e=1,\dots,E

Update

\theta_{e}

with gradient descent using

\mathbbm{1}_{m^{e}}\nabla_{\theta_{e}}\frac{1}{|\mathcal{B}|}\sum_{(\mathbf{s}% ,\mathbf{a}^{\mathrm{RL}},\lambda^{\mathrm{RL}},r,\mathbf{s}^{\prime})\in% \mathcal{B}}\left(\hat{Q}_{\theta_{e}}(\mathbf{s},\mathbf{a},\lambda^{\mathrm{% RL}})-y\right)^{2}

\bar{\theta}_{e}\leftarrow\tau\bar{\theta}_{e}+(1-\tau)\theta_{e}

end for

update

\phi

with gradient ascent using

\tilde{\mathbf{a}}_{\mathrm{RL}}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot\mid% \mathbf{s},\lambda^{\mathrm{RL}}))

\nabla_{\phi}\frac{1}{|\mathcal{B}|}\sum_{\mathbf{s}\in\mathcal{B}}\left(\frac% {1}{E}\sum_{e=1}^{E}\hat{Q}_{\theta_{e}}(\mathbf{s},\tilde{\mathbf{a}}_{% \mathrm{RL}},\lambda^{\mathrm{RL}})-\alpha\log\hat{\pi}^{\mathrm{RL}}_{\phi}(% \tilde{\mathbf{a}}_{\mathrm{RL}}\mid\mathbf{s},\lambda^{\mathrm{RL}}))\right)

end for

A.1 Hyperparameters Settings

We build our SAC implementation based on CleanRL (Huang et al., 2022). All SAC-specific hyperparameters are kept consistent between all approaches and reported in Table 1.

Table 1: Shared Hyperparameters.

Hyperparameter	Value
number of steps	$1.5\text{\times}{10}^{6}$
batch size	256
learning rate actor	$3\text{\times}{10}^{-4}$
learning rate critic	$3\text{\times}{10}^{-4}$
target entropy $H_{t}$	$-3$
replay buffer size	$1\text{\times}{10}^{6}$
discount factor $\gamma$	$0.99$
gradient update start	$1\text{\times}{10}^{3}$ steps
Polyak averaging factor $\tau$	$0.005$

CHEQ (UTD1 and UTD20) uses an ensemble of $E=5$ critics. We set the upper bound of the uncertainty as $u_{\mathrm{max}}=0.15$ and the lower bound as $u_{\mathrm{min}}=0.03$ . Further we set $\lambda_{\mathrm{max}}=1.0$ and $\lambda_{min}=0.2$ . We use a Bernoulli masking rate of $\kappa=0.8$ and $F=2$ minimization targets.

BCF trains an ensemble of policy networks. We maintain the original ensemble size from (Rana et al., 2023) which uses ten policy networks. We set the standard deviation of the control prior in BCF as $\sigma^{\mathrm{prior}}=6.0$ .

For the uncertainty estimate in CORE, we set $A=7$ , $C=0.02$ . Note that in the original paper, $A$ is denoted as $\lambda_{\mathrm{max}}$ , which we change to avoid ambiguous notation.

SAC uses a UTD ratio of $1$ . REDQ implementation uses an ensemble size of $5$ and a UTD ratio of $20$ .

For all algorithms, we include a random sampling phase for the first $1\text{\times}{10}^{3}$ steps where we sample the RL action uniformly random and do not update our agent. In this setting, we keep $\lambda^{\mathrm{RL}}_{t}$ small for the hybrid agents. For CHEQ we vary $\lambda^{\mathrm{RL}}$ between [0.2, 0.3]. As CORE and BCF are unable to observe changes in $\lambda^{\mathrm{RL}}$ we use a fixed $\lambda^{\mathrm{RL}}=0.2$ which has shown to be favorable in our experiments. After the random sampling phase, agent training starts, but $\lambda^{\mathrm{RL}}$ is kept small for another $4\text{\times}{10}^{3}$ steps and afterward, $\lambda$ adaption starts.

For performance evaluation, we conduct a greedy evaluation run every $20$ episodes. Evaluation happens in an adapted setting, together with the controller where the weight is calculated as in the training procedure.

Appendix B Environment Details

B.1 Racing Environment

We test our agent on the simulated racing task adapted from Schier et al. (2023). Figure 6 shows an example of the environment.

The vehicle uses a dynamic single-track model with a coupled Dugoff tire model. The throttle, brake, and steering are continuous actions. The vehicle is a front-wheel drive. The RL agent may learn to control brake balance by applying throttle and brake individually. We define the state of the RL agent as $\mathbf{s}_{t}=(v_{x},v_{y},\omega,\beta,o_{\mathrm{track}})$ , with the ego vehicle’s velocity vector $\mathbf{v}_{\mathrm{ego}}=(v_{x},v_{y})$ in vehicle reference frame, steering angle $\beta$ , and yaw rate $\omega$ . The observation of the track $o_{\mathrm{track}}=(\mathbf{x},\mathbf{y})^{T}$ is given as a vector of $20$ Cartesian distances $(x_{i},y_{i})$ to the centerline of the track. The points are sampled equidistantly from the \qty60 track segment ahead.

We use the original reward formulation from Schier et al. (2023) where the RL agent receives a penalty $r_{\mathrm{collision}}$ whenever it collides with the track boundary and a penalty $r_{\mathrm{fail}}$ for leaving the track with the center of mass. The latter case also terminates the episode. The RL agent receives a positive reward for driving fast: the scalar projection of its velocity vector $v_{\mathrm{ego}}$ onto the forward track direction $n_{\mathrm{track}}$ . The complete reward is then given by

r(s,a)=-r_{\mathrm{fail}}-0.2\cdot r_{\mathrm{collision}}+0.01\cdot\mathbf{n}_% {\mathrm{track}}\cdot\mathbf{v}_{\mathrm{ego}}.

(7)

B.2 Control Prior

For the racing task, we design a simple path-following controller with adaptive speeds. For the lateral control, we use a Stanley Controller (Thrun et al., 2006) following the steering control law

\delta(t)=\psi(t)+\frac{k_{\mathrm{cross}}\cdot e(t)}{v(t)+k_{\mathrm{soft}}},

where $\psi(t)$ denotes the heading error, $e(t)$ denotes the crosstrack error of the front axle and $v(t)$ describes the velocity of the vehicle.

For the longitudinal control, we design two symmetric P-controllers; one for the brake and one for the throttle. First, we compute the target velocity dependent on the curve radius $R(t)$ of the track directly in front of the vehicle as

v_{\mathrm{target}}(t)=\min\{k_{r}\cdot R(t),v_{\mathrm{max}}\},

where $v_{\mathrm{max}}$ is the maximum desired velocity. Then, we design the throttle control as

\mathrm{throt}(t)=\begin{cases}k_{v}(t)*(v_{target}(t)-v(t)),&\quad v_{target}% (t)-v(t)\geq 0\\ 0,&\quad\text{else}\end{cases}

and the brake control as

\mathrm{br}(t)=\begin{cases}k_{v}(t)*(v(t)-v_{target}(t)),&\quad v_{target}(t)% -v(t)\leq 0\\ 0,&\quad\text{else,}\end{cases}

with shared gain $k_{v}$ . Following this control law, the control prior accelerates if it is going too slow and brakes if it is going too fast. It never uses the brake and the throttle at the same time.

To avoid aggressive braking behavior when the RL agent hands over to the controller in risky situations (high velocity around curves), we additionally introduce a simple clipped linear gain schedule on $k_{v}$ attenuating the braking control for higher velocities as

k_{v}(t)=\mathrm{clip}\left(\frac{k_{v}^{\mathrm{max}}-k_{v}^{\mathrm{min}}}{v% _{\mathrm{low}}-v_{\mathrm{high}}}(v(t)-v_{\mathrm{low}})+k^{\mathrm{max}}_{v}% ;k_{v}^{\mathrm{max}},k_{v}^{\mathrm{min}}\right).

We tuned the controller gains and coefficients to $k_{\mathrm{cross}}=0.5[$\mathrm{1}\mathrm{/}\mathrm{s}$]$ , $k_{\mathrm{soft}}=1[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]$ , $k_{r}=0.4[$\mathrm{1}\mathrm{/}\mathrm{s}$]$ , $v_{\mathrm{max}}=8[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]$ , $k^{\mathrm{max}}_{v}=0.25[$\mathrm{s}\text{\,}{\mathrm{m}}^{-1}$]$ , $k_{v}^{\mathrm{min}}=0.05[$\mathrm{s}\text{\,}{\mathrm{m}}^{-1}$]$ , $v_{\mathrm{low}}=8[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]$ and $v_{\mathrm{high}}=28[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]$ .

Appendix C Additional Results and Ablation Study

In this section, we formulate contextualized hybrid variants of CORE and BCF, C-CORE and C-BCF. Further, we present additional results on hyperparameter sensitivity and the distribution of $\lambda^{\mathrm{RL}}$ for CHEQ-UTD1, CORE, and BCF. For a reasonable comparison, we focus mainly on CHEQ-UTD1 using a UTD ratio of $1$ .

C.1 Contextualized Hybrid Variants of Prior Work

To further substantiate our claim that the contextualized hybrid RL formulation aids the training progress, we developed contextualized variants of the CORE and BCF algorithm, which we call C-CORE and C-BCF. To use the adaptive weight as a context variable, a weighting parameter $\lambda_{t}^{\mathrm{RL}}$ needs to be determined.

The CORE algorithm comes with a direct weight estimate(Cheng et al., 2019), which can be written as

\lambda_{t}^{\mathrm{RL}}=\frac{1}{1+\lambda_{t}^{\mathrm{CORE}}}

where $\lambda_{t}^{\mathrm{CORE}}=A(1-e^{-C\lvert\delta_{t-1}\rvert})$ with the TD-error $\delta_{t-1}$ and $C$ , $A$ being tuning parameters²²2CORE (Cheng et al., 2019) uses the term $\lambda_{\mathrm{max}}$ instead of $A$ . As we use $\lambda_{\mathrm{max}}$ in a different context, we stick to $A$ here..

For C-BCF, we derive a pseudo-weight, as the BCF algorithm (Rana et al., 2023) does not have a straightforward weighting factor $\lambda^{\mathrm{RL}}$ . In BCF, at timestep $t-1$ , the fusion of the prior $\mathcal{N}_{\psi,t-1}(\mu_{\psi,t-1},\sigma_{\psi,t-1})$ and the SAC-ensemble $\mathcal{N}_{\pi,t-1}(\mu_{\pi,t-1},\sigma_{\pi,t-1})$ results in a Gaussian distribution with mean

\mu_{\mathrm{fuse},t-1}=\frac{\sigma_{\psi,t-1}^{2}}{\sigma_{\psi,t-1}^{2}+% \sigma_{\pi,t-1}^{2}}\cdot\mu_{\pi,t-1}+\frac{\sigma_{\pi,t-1}^{2}}{\sigma_{% \psi,t-1}^{2}+\sigma_{\pi,t-1}^{2}}\cdot\mu_{\psi,t-1}.

Thus, in BCF the weight has the dimension of the action space, whereas the contextualized mechanism requires a scalar weight $\lambda^{\mathrm{RL}}_{t}$ . For C-BCF, we compute a scalar weight

\lambda^{\mathrm{RL}}_{t}=\frac{1}{N}\sum_{i=1}^{N}\left(\frac{\sigma_{\psi,t-% 1}^{2}}{\sigma_{\psi,t-1}^{2}+\sigma_{\pi,t-1}^{2}}\right)_{i}.

The next action $\mathbf{a}^{\mathrm{mix}}_{t}$ is computed according to Equation 3 where $\mathbf{a}^{\mathrm{RL}}_{t}\sim\mathcal{N}_{\pi,t}$ and $\mathbf{a}^{\mathrm{prior}}_{t}=\mu_{\psi,t}$ .

For C-BCF and C-CORE we use the same warm-up phase as for our CHEQ agent.

C.2 Additional Results and Hyperparameter Tuning for CHEQ

Our algorithm has only two important hyperparameters, upper and lower bounds of the uncertainty $u_{\mathrm{max}}$ and $u_{\mathrm{min}}$ . We chose these hyperparameters by conducting one training run and investigating the uncertainty range within this run. CHEQ is generally robust against changes in these thresholds. We observe slightly lower final return and fewer fails for lower upper bounds $u_{\mathrm{max}}$ . This is to be expected as frequent handover to the control prior results in lower velocities and thus lower return. Figure 7 shows the return and the number of fails during training for our CHEQ-UTD1 variants. For CHEQ-UTD20 we were able to use the same upper and lower uncertainty bounds as for CHEQ-UTD1.

Figure 8 shows the distribution of the weight $\lambda^{\mathrm{RL}}$ over the training progress. We find that for CHEQ-UTD1 the agent starts with an almost uniform distribution of the weight and slowly moves towards a $\lambda^{\mathrm{RL}}=1$ regime. Even in later training stages, the agent hands over to the controller from time to time.

Lastly, we investigated the transfer behavior of the CHEQ-UTD20 agent further. Figure 9 shows the ten transfer tracks. We plot $\lambda^{\mathrm{RL}}$ over the track for one evaluated model. Here, we find that the agent frequently becomes uncertain and hands over to the controller, especially in unknown curves. In plot C we see one of the two failure cases (out of 100 runs) that we experience during transfer. We find that the agent becomes uncertain and hands over to the controller. In this specific scenario, however, the controller is not able to safely navigate the situation and leaves the track.

C.3 Additional Results and Hyperparameter Tuning for CORE and C-CORE

In Figure 10 we compare different parameters $C$ . We find stable training progress and high return for $C=0.02$ but since this uses high $\lambda^{\mathrm{RL}}$ values, this setting results in many failures. Figure 8 illustrates the high $\lambda$ regime of the $C=0.02$ agent. C-CORE using our contextualized hybrid framework, notably outperforms CORE in terms of asymptotic return, training stability, and the number of training failures. Using the contextualized formulation, C-CORE can use a much wider $\lambda^{\mathrm{RL}}$ distribution (see Figure 8).

C.4 Additional Results and Hyperparameter Tuning for BCF and C-BCF

The BCF algorithm is sensitive to the parameters of the uncertainty threshold, which in this case is the variance of the control prior $\sigma^{\mathrm{prior}}$ . Higher variances, lead to less weight on the control prior and thus high $\lambda^{\mathrm{RL}}$ regimes. In Figure 11 we compared different parameters $\sigma^{\mathrm{prior}}$ . We find stable training progress and high return for $\sigma^{\mathrm{prior}}=6$ . However, this setting, as expected, uses a $\lambda^{\mathrm{RL}}$ regime close to one and thus results in a high number of failures. Figure 8 illustrates this regime.

Our C-BCF variant can resolve this problem only partially. Due to its construction, the BCF algorithm has a separate weighting factor for each action of which we take the mean. In addition, our pseudo $\lambda^{\mathrm{RL}}$ factor is only a rough estimate of the actual mixing as BCF samples from the posterior distribution. Both factors result in information loss and make the weight $\lambda^{\mathrm{RL}}$ only a rough estimate. We find that C-BCF-2.0 and BCF-6.0 achieve similar asymptotic performance, while C-BCF-2.0 leads to fewer failures.

C.5 Return vs. Failure Comparison for all trained Models

Figure 12 shows a scatter plot of the final return and the accumulated failures during training for all hybrid algorithms discussed in this paper. This final comparison shows that if prior methods are trained with the contextualized framework and tuned well ( C-BCF-0.8, C-CORE-0.4, C-CORE-0.8), they achieve high returns while maintaining fewer failures than their non-contextualized counterparts. Our algorithm (CHEQ-UTD1, CHEQ-UTD20) achieve the highest return while maintaining the lowest number of cumulative failures.

Contextualized Hybrid Ensemble Q-learning: Learning Fast with Control Priors

Abstract

1 Introduction

2 Related Work

3 Background

4 A Unified View on Hybrid Reinforcement Learning

4.1 Mixing Function f𝑓fitalic_f

4.2 Weight Adaption Function ΛΛ\Lambdaroman_Λ

5 Contextualized Hybrid Reinforcement Learning

5.1 General Concept of Contextualized Hybrid Reinforcement Learning

5.2 Illustrative Example

5.3 Contextualized Hybrid Ensemble Q-learning (CHEQ)

6 Experiments

6.1 Experimental Setup

6.2 Evaluation on the Car Racing Environment

7 Conclusion

Acknowledgments

References

Appendix A Algorithmic Details

A.1 Hyperparameters Settings

Appendix B Environment Details

B.1 Racing Environment

B.2 Control Prior

Appendix C Additional Results and Ablation Study

C.1 Contextualized Hybrid Variants of Prior Work

C.2 Additional Results and Hyperparameter Tuning for CHEQ

C.3 Additional Results and Hyperparameter Tuning for CORE and C-CORE

C.4 Additional Results and Hyperparameter Tuning for BCF and C-BCF

C.5 Return vs. Failure Comparison for all trained Models

Contextualized Hybrid Ensemble Q-learning:
Learning Fast with Control Priors

4.1 Mixing Function $f$

4.2 Weight Adaption Function $\Lambda$