Contextualized Hybrid Ensemble Q-learning:
Learning Fast with Control Priors

Emma Cramer
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Bernd Frauenknechtfootnotemark:
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Ramil Sabirovfootnotemark:
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University &Sebastian Trimpe
[email protected]
Institute for Data Science in Mechanical Engineering
RWTH Aachen University
These authors contributed equally.
Abstract

Combining Reinforcement Learning (RL) with a prior controller can yield the best out of two worlds: RL can solve complex nonlinear problems, while the control prior ensures safer exploration and speeds up training. Prior work largely blends both components with a fixed weight, neglecting that the RL agent’s performance varies with the training progress and across regions in the state space. Therefore, we advocate for an adaptive strategy that dynamically adjusts the weighting based on the RL agent’s current capabilities. We propose a new adaptive hybrid RL algorithm, Contextualized Hybrid Ensemble Q-learning (CHEQ). CHEQ combines three key ingredients: (i) a time-invariant formulation of the adaptive hybrid RL problem treating the adaptive weight as a context variable, (ii) a weight adaption mechanism based on the parametric uncertainty of a critic ensemble, and (iii) ensemble-based acceleration for data-efficient RL. Evaluating CHEQ on a car racing task reveals substantially stronger data efficiency, exploration safety, and transferability to unknown scenarios than state-of-the-art adaptive hybrid RL methods.

1 Introduction

Deep reinforcement learning (RL) methods have shown great success in challenging control problems such as gameplay  (Mnih et al., 2015; Silver et al., 2018a; OpenAI et al., 2019) and robotic manipulation  (Gupta et al., 2021; Büchler et al., 2022). Despite the great potential of RL methods, their data inefficiency, unstructured exploration behavior, and inability to generalize to unknown scenarios represent a significant hurdle to their application to real-world problems.

A prime reason for limited real-world applications is the task-agnostic architecture of state-of-the-art RL approaches (Schulman et al., 2017; Haarnoja et al., 2018) that does not incorporate prior knowledge on how to solve the task at hand. In contrast, control theory provides a rich set of methods for deriving near-optimal controllers in many applications. This motivates the drive for hybrid RL methods  (Silver et al., 2018b; Johannink et al., 2019) that blend control priors with deep RL policies. Hybrid algorithms thus combine the prior controller’s generalization capabilities and informed behavior with the power of deep RL for solving general nonlinear problems.

Notwithstanding the conceptual benefit of hybrid RL formulations, how to systematically combine the control prior with the RL agent largely remains an open problem. The majority of prior work (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020; Ceola et al., 2024) proposes a fixed weighting between the control prior and the RL agent. A fixed blending, however, disregards the fact that the capability of the RL agent depends on training time and state. In general, as more data is observed, the RL agent improves its behavior, ultimately outperforming the control prior in large portions of the domain. The core idea of our approach is to adapt the weighting between RL agent and control prior based on the agent’s confidence. As the RL agent improves over time, this induces a time-variant weighting mechanism. This time dependency leads to structural problems of prior formulations in uncertainty-adapted hybrid RL (Cheng et al., 2019; Rana et al., 2023).

We provide a unified view on hybrid RL that allows us to classify prior work within a general framework. Analyzing this framework highlights the necessity for a novel adaptive hybrid RL formulation with descriptive, time-invariant dynamics. We define the contextualized hybrid Markov decision process (MDP), introducing the adaptive weight as a context variable. Building upon this formulation, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm that systematically adapts the weighting between RL agent and control prior based on an uncertainty estimate of a critic ensemble. CHEQ combines the contextualized hybrid RL formulation with uncertainty-based weight adaption and existing ensemble-based acceleration techniques for data-efficient RL.

We evaluate our algorithm on a racing task (Schier et al., 2023), which requires operating a car close to its stability limits in order to achieve maximum return. We find that compared to prior work in adaptive hybrid RL, the CHEQ algorithm shows (i) reduced failures during training, (ii) increased sample efficiency, and (iii) improved transfer behavior on unseen race tracks.

In summary, our main contributions are:

  • A unified framework that allows us to classify existing approaches and reveal key limitations.

  • A hybrid MDP formulation, introducing the adaptive weight as a context variable and thus addressing structural problems of prior work in hybrid RL with adaptive weighting.

  • A novel hybrid RL algorithm, CHEQ, that systematically adapts the weighting between RL agent and control prior based on Q-ensemble disagreement.

2 Related Work

This section discusses relevant prior work combining RL and a control prior. We distinguish two types; hybrid RL with fixed and adaptive weighting between the RL agent and controller.

Hybrid Reinforcement Learning with Fixed Weighting. Two concurrent works (Silver et al., 2018b; Johannink et al., 2019) first combined RL and a control prior and introduced the term residual RL. In residual RL, the control prior is assumed to be fixed, and the RL agent learns a residual on top of this. In this work, we use the general term hybrid RL to include approaches that adapt the controller’s weight. Silver et al. (2018a) and Johannink et al. (2019) show advantages of hybrid RL, such as sample efficiency, improved sim-to-real transfer, and robustness towards uncertainties. Hybrid RL with fixed weights has then successfully been applied to real robot insertion tasks (Schoettler et al., 2020), peg insertion under uncertainty (Ranjbar et al., 2021), driving (Kerbel et al., 2022) and to learn a residual RL policy on top of a pre-trained RL agent (Ceola et al., 2024). A fixed mixing, however, does not allow one to consider the improving capabilities of the RL agent.

Hybrid Reinforcement Learning with Adaptive Weighting. Daoudi et al. (2023) assume a given controller confidence function, employing a controller in instances of high confidence and an RL agent in other scenarios. Our work focuses on the RL agent’s confidence and proposes to estimate the confidence based on a critic ensemble. Similar to our approach,  Hoel et al. (2020a; b) train an ensemble of bootstrapped Q-networks for a driving task with discrete actions. They evaluate the uncertainty as the coefficient of variation of Q-estimates and resort to safe fallback actions in case of high uncertainty. However, they do not combine controller and RL agent but switch between both. In this work, we investigate a seamless blending approach for continuous control.  Rana et al. (2020b) estimate the policy uncertainty using Monte-Carlo dropout and based on this uncertainty either sample from a residual policy or the controller alone.  Rana et al. (2020a) directly fuse a prior control distribution with an RL policy in a multiplicative fashion and anneal the influence of the control prior over training time. Rana et al. (2023) use a policy ensemble to estimate how certain the RL agent is in the current action. The combined action is then computed as the Bayesian posterior of control prior and policy distribution. Cheng et al. (2019) use the TD-error as an uncertainty estimate and combine controller and RL agent based on this. Both Rana et al. (2023) and Cheng et al. (2019) base their adaption mechanism on a form of policy uncertainty. Both approaches train based on the combined action, which becomes brittle when facing large distributional shifts. We further discuss this limitation in Section 5.

3 Background

The following introduces the key components and the general concept of hybrid RL.

Reinforcement Learning. RL is a method for solving sequential decision problems based on the interaction between an agent and an environment  (Sutton & Barto, 2018). The environment is modeled as a discounted Markov decision process defined by the tuple =(𝒮,𝒜,p,r,ρ0,γ)𝒮𝒜𝑝𝑟subscript𝜌0𝛾\mathcal{M}=(\mathcal{S},\mathcal{A},p,r,\rho_{0},\gamma)caligraphic_M = ( caligraphic_S , caligraphic_A , italic_p , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ), with state space 𝒮𝒮\mathcal{S}caligraphic_S, action space 𝒜𝒜\mathcal{A}caligraphic_A, and start state distribution ρ0subscript𝜌0\rho_{0}italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT. The commonly unknown transition function p(𝐬t+1,rt+1𝐬t,𝐚tRL)𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡superscriptsubscript𝐚𝑡RLp(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}})italic_p ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) describes transitions between states 𝐬t𝒮subscript𝐬𝑡𝒮\mathbf{s}_{t}\in\mathcal{S}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S and actions 𝐚tRL𝒜subscriptsuperscript𝐚RL𝑡𝒜\mathbf{a}^{\mathrm{RL}}_{t}\in\mathcal{A}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A. During transitions, rewards rtsubscript𝑟𝑡r_{t}\in\mathbb{R}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R are emitted according to a reward function rt+1r(𝐬t,𝐚tRL)similar-tosubscript𝑟𝑡1𝑟subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡r_{t+1}\sim r(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_r ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The objective of the RL agent is to learn a policy πRL(𝐚tRL𝐬t)superscript𝜋RLconditionalsubscriptsuperscript𝐚RL𝑡subscript𝐬𝑡\pi^{\mathrm{RL}}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}_{t})italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that maximizes the expected cumulative sum of rewards discounted by γ(0,1)𝛾01\gamma\in(0,1)italic_γ ∈ ( 0 , 1 ). This results in the RL objective

J(πRL)=maxπRL𝔼πRL,[t=0γtrt+1].𝐽superscript𝜋RLsubscriptsuperscript𝜋RLsubscript𝔼superscript𝜋RLdelimited-[]superscriptsubscript𝑡0superscript𝛾𝑡subscript𝑟𝑡1J(\pi^{\mathrm{RL}})=\max_{\pi^{\mathrm{RL}}}\mathbb{E}_{\pi^{\mathrm{RL}},% \mathcal{M}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t+1}\right].italic_J ( italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , caligraphic_M end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ] . (1)

The discounted sum of rewards is referred to as return and is accumulated along trajectories under the policy πRLsuperscript𝜋RL\pi^{\mathrm{RL}}italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT and the environment MDP \mathcal{M}caligraphic_M. State value functions condition expected return on a particular state VπRL(𝐬t)=𝔼πRL,[k=tγktrk+1𝐬t]superscript𝑉superscript𝜋RLsubscript𝐬𝑡subscript𝔼superscript𝜋RLdelimited-[]conditionalsuperscriptsubscript𝑘𝑡superscript𝛾𝑘𝑡subscript𝑟𝑘1subscript𝐬𝑡V^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t})=\mathbb{E}_{\pi^{\mathrm{RL}},\mathcal{M% }}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}\mid\mathbf{s}_{t}\right]italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , caligraphic_M end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] while, action value or Q-functions condition expected return on specific state action pairs QπRL(𝐬t,𝐚tRL)=𝔼πRL,[k=tγktrk+1𝐬t,𝐚tRL]superscript𝑄superscript𝜋RLsubscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝔼superscript𝜋RLdelimited-[]conditionalsuperscriptsubscript𝑘𝑡superscript𝛾𝑘𝑡subscript𝑟𝑘1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡Q^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})=\mathbb{E}_% {\pi^{\mathrm{RL}},\mathcal{M}}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}% \mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t}\right]italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , caligraphic_M end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ].

Control Prior. The prior policy πprior(𝐚tprior𝐬t)superscript𝜋priorconditionalsubscriptsuperscript𝐚prior𝑡subscript𝐬𝑡\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid\mathbf{s}_{t})italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) represents prior knowledge for solving the RL objective (1), while typically not providing the optimal policy over the whole domain 𝒮×𝒜𝒮𝒜\mathcal{S}\times\mathcal{A}caligraphic_S × caligraphic_A. This work focuses on control priors based on classic control theory. These can be derived with limited effort in many applications and often provide a good baseline for interaction with \mathcal{M}caligraphic_M. We assume a control prior that is time-invariant and without an internal state.

Hybrid Reinforcement Learning. Hybrid RL combines the control prior and the RL agent by blending their actions via some mixing function 𝐚tmix=f(𝐚tprior,𝐚tRL,𝝀t)subscriptsuperscript𝐚mix𝑡𝑓subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) depending on a weight 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

4 A Unified View on Hybrid Reinforcement Learning

Next, we develop a unified view on hybrid RL that allows classifying prior methods (cf. Section 2). In the standard RL setup depicted in Figure 1(a), the RL agent πRLsuperscript𝜋RL\pi^{\mathrm{RL}}italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT interacts with the time-invariant MDP \mathcal{M}caligraphic_M with dynamics p(𝐬t+1,rt+1𝐬t,𝐚tRL)𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})italic_p ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that represents the controlled system. Hybrid RL (see Figure 1(b)) incorporates a control prior πpriorsuperscript𝜋prior\pi^{\mathrm{prior}}italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT which requires reformulating the standard framework. Here, πRLsuperscript𝜋RL\pi^{\mathrm{RL}}italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT and πpriorsuperscript𝜋prior\pi^{\mathrm{prior}}italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT apply a combined action 𝐚tmixsubscriptsuperscript𝐚mix𝑡\mathbf{a}^{\mathrm{mix}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to \mathcal{M}caligraphic_M. The mixing function f𝑓fitalic_f generates a combined action by blending the individual actions based on a weighting vector 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT provided by a weight adaption function ΛΛ\Lambdaroman_Λ. Within this generalized framework, prior work in hybrid RL can be categorized based on the choice of f𝑓fitalic_f and ΛΛ\Lambdaroman_Λ.

4.1 Mixing Function f𝑓fitalic_f

We consider mixing functions based on a weighted sum with a weighting vector 𝝀t=[λtprior,λtRL]subscript𝝀𝑡superscriptsubscriptsuperscript𝜆prior𝑡subscriptsuperscript𝜆RL𝑡top\boldsymbol{\lambda}_{t}=[\lambda^{\mathrm{prior}}_{t},\lambda^{\mathrm{RL}}_{% t}]^{\top}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_λ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT

𝐚tmix=f(𝐚tprior,𝐚tRL,𝝀t)=λtprior𝐚tprior+λtRL𝐚tRL.subscriptsuperscript𝐚mix𝑡𝑓subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡subscriptsuperscript𝜆prior𝑡subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})=\lambda^{\mathrm{prior}}_{t}\cdot% \mathbf{a}^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\cdot\mathbf{a}^{% \mathrm{RL}}_{t}.bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_λ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (2)

This formulation allows to distinguish a residual and a regularized setting.

In the residual setting, λtpriorsubscriptsuperscript𝜆prior𝑡\lambda^{\mathrm{prior}}_{t}italic_λ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is typically constant while λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be variable 𝝀t=[1,λtRL]subscript𝝀𝑡superscript1subscriptsuperscript𝜆RL𝑡top\boldsymbol{\lambda}_{t}=[1,\lambda^{\mathrm{RL}}_{t}]^{\top}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ 1 , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020). Thus, the RL agent interacts with the closed control loop between πpriorsuperscript𝜋prior\pi^{\mathrm{prior}}italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT and \mathcal{M}caligraphic_M and learns a residual action on top of 𝐚tpriorsubscriptsuperscript𝐚prior𝑡\mathbf{a}^{\mathrm{prior}}_{t}bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Consequently, 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT modulates the RL agent’s impact on the closed loop dynamics. As the control prior is not scaled down, it might interpret the RL agent as a disturbance and counteract it (Ranjbar et al., 2021), which can limit the overall performance of residual formulations.

In the regularized setting, both weights are adaptable such that λtprior+λtRL=1subscriptsuperscript𝜆prior𝑡subscriptsuperscript𝜆RL𝑡1\lambda^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}=1italic_λ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1 (Cheng et al., 2019; Rana et al., 2023). This results in a mixing function of the form

𝐚tmix=f(𝐚tprior,𝐚tRL,λtRL)=(1λtRL)𝐚tprior+λtRL𝐚tRL.subscriptsuperscript𝐚mix𝑡𝑓subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡1subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{prior}}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=(1-\lambda^{\mathrm{RL}}_{t})\cdot% \mathbf{a}^{\mathrm{prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\cdot\mathbf{a}^{% \mathrm{RL}}_{t}.bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ⋅ bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⋅ bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (3)

with λtRL[0,1]subscriptsuperscript𝜆RL𝑡01\lambda^{\mathrm{RL}}_{t}\in[0,1]italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ]. In the limit λtRL=0subscriptsuperscript𝜆RL𝑡0\lambda^{\mathrm{RL}}_{t}=0italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0, the control prior interacts with \mathcal{M}caligraphic_M without the interference of the RL agent, while the regularized setting reduces to the standard RL problem for λtRL=1subscriptsuperscript𝜆RL𝑡1\lambda^{\mathrm{RL}}_{t}=1italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1. Thus, λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT indicates not only the impact of 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT but also whether the RL agent interacts with the open loop dynamics of \mathcal{M}caligraphic_M or the closed loop dynamics as in the residual setting. Consequently, the control prior can be interpreted as a regularization of the RL agent. Our proposed algorithm operates in the regularized setting, allowing the agent to take over complete control when λtRL=1subscriptsuperscript𝜆RL𝑡1\lambda^{\mathrm{RL}}_{t}=1italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 1.

4.2 Weight Adaption Function ΛΛ\Lambdaroman_Λ

Hybrid RL approaches can further be classified, based on the choice of the weight adaption function ΛΛ\Lambdaroman_Λ modulating the weighting vector 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of the mixing function f𝑓fitalic_f.

A large body of work, which we refer to as fixed-weight hybrid RL (Silver et al., 2018b; Johannink et al., 2019; Schoettler et al., 2020; Ranjbar et al., 2021; Ceola et al., 2024) chooses 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT fixed throughout training. Neglecting the time- and state-dependent capabilities of the RL agent.

Approaches that adapt 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, which we refer to as adaptive hybrid RL methods, rely on different mechanisms. Scheduling approaches (Rana et al., 2020a) change the weight explicitly with time, i.e. 𝝀t=Λ(t)subscript𝝀𝑡Λ𝑡\boldsymbol{\lambda}_{t}=\Lambda(t)bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Λ ( italic_t ), typically increasing the weight of the RL agent as training progresses. Domain-based approaches (Kulkarni et al., 2022; Daoudi et al., 2023) adapt the weights based on the point of operation within the domain 𝒮×𝒜𝒮𝒜\mathcal{S}\times\mathcal{A}caligraphic_S × caligraphic_A, i.e. 𝝀t=Λ(𝐬t,𝐚t)subscript𝝀𝑡Λsubscript𝐬𝑡subscript𝐚𝑡\boldsymbol{\lambda}_{t}=\Lambda(\mathbf{s}_{t},\mathbf{a}_{t})bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Λ ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Uncertainty-based approaches (Cheng et al., 2019; Rana et al., 2023) adapt the weight based on the confidence of the RL agent, indicated by an uncertainty estimate u(𝐬t,𝐚t,t)𝑢subscript𝐬𝑡subscript𝐚𝑡𝑡u(\mathbf{s}_{t},\mathbf{a}_{t},t)italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_t ), giving more weight to the RL agent when it has high confidence. Thus, they aim to leverage the benefits of the RL agent whenever possible, while resorting to a safe controller in situations where the RL agent has not seen enough data. The time dependency of the uncertainty estimate, however, increases the complexity of the hybrid RL setting, requiring a reformulation of the learning problem. In Section 5, we discuss the shortcomings of prior formulations and propose our own.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: A standard RL setting (1(a)), hybrid RL settings from prior work (1(b)) and our contextualized hybrid RL setting based on RL action 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and weighting factor 𝝀tRLsubscriptsuperscript𝝀RL𝑡\boldsymbol{\lambda}^{\mathrm{RL}}_{t}bold_italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (1(c)).

5 Contextualized Hybrid Reinforcement Learning

In Section 5.1, we propose a novel contextualized formulation of the adaptive hybrid RL problem and illustrate its benefits over prior approaches in Section 5.2. Based on that framework, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm in Section 5.3.

5.1 General Concept of Contextualized Hybrid Reinforcement Learning

Based on the unified view provided in Section 4, we propose a general formulation for the hybrid RL problem we call contextualized hybrid RL.

The environment in the hybrid setting not only consists of the controlled system \mathcal{M}caligraphic_M but also comprises the control prior, the mixing function, and the weight adaption function. We consider both the control prior and the mixing function to be time-invariant. In contrast, the weight adaption function ΛΛ\Lambdaroman_Λ can have time-varying behavior, i.e. 𝝀t=Λ(t,)subscript𝝀𝑡Λ𝑡\boldsymbol{\lambda}_{t}=\Lambda(t,\dots)bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_Λ ( italic_t , … ), as discussed in Section 4.2. This leads to time-varying dynamics of the hybrid environment, violating the assumption of time-invariance in the MDP formulation (Bellemare et al., 2023). Instead, we exclude ΛΛ\Lambdaroman_Λ from the definition of the hybrid environment and introduce the adaptive weight vector 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a context variable to the agent and the environment (see Figure 1(c)). We model the hybrid environment introducing the contextualized hybrid MDP ^=(𝒮,𝒜,𝒲,p^,r,ρ0,γ)^𝒮𝒜𝒲^𝑝𝑟subscript𝜌0𝛾\hat{\mathcal{M}}=(\mathcal{S},\mathcal{A},\mathcal{W},\hat{p},r,\rho_{0},\gamma)over^ start_ARG caligraphic_M end_ARG = ( caligraphic_S , caligraphic_A , caligraphic_W , over^ start_ARG italic_p end_ARG , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ) with 𝒲𝒲\mathcal{W}caligraphic_W the set of weighting vectors 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the contextualized dynamics function p^(𝐬t+1,rt+1𝐬t,𝐚tRL,𝝀t)^𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡\hat{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t% },\boldsymbol{\lambda}_{t})over^ start_ARG italic_p end_ARG ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

The MDP formulation ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG induces the contextualized hybrid RL objective

J^(πRL)=maxπRL𝔼πRL,^[t=0Tγtrt+1],^𝐽superscript𝜋RLsubscriptsuperscript𝜋RLsubscript𝔼superscript𝜋RL^delimited-[]superscriptsubscript𝑡0𝑇superscript𝛾𝑡subscript𝑟𝑡1\hat{J}(\pi^{\mathrm{RL}})=\max_{\pi^{\mathrm{RL}}}\mathbb{E}_{\pi^{\mathrm{RL% }},\hat{\mathcal{M}}}\left[\sum_{t=0}^{T}\gamma^{t}r_{t+1}\right],over^ start_ARG italic_J end_ARG ( italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) = roman_max start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , over^ start_ARG caligraphic_M end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ] , (4)

which enforces to learn a policy π^RL(𝐚tRL𝐬tRL,𝝀t)superscript^𝜋RLconditionalsubscriptsuperscript𝐚RL𝑡subscriptsuperscript𝐬RL𝑡subscript𝝀𝑡\hat{\pi}^{\mathrm{RL}}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}^{\mathrm{RL% }}_{t},\boldsymbol{\lambda}_{t})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that maximizes expected return in ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG. Introducing 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT as a context variable further yields the contextualized hybrid value functions V^πRL(𝐬t,𝝀t)=𝔼πRL,^[k=tγktrk+1𝐬t,𝝀t]superscript^𝑉superscript𝜋RLsubscript𝐬𝑡subscript𝝀𝑡subscript𝔼superscript𝜋RL^delimited-[]conditionalsuperscriptsubscript𝑘𝑡superscript𝛾𝑘𝑡subscript𝑟𝑘1subscript𝐬𝑡subscript𝝀𝑡\hat{V}^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\boldsymbol{\lambda}_{t})=\mathbb{E% }_{\pi^{\mathrm{RL}},\hat{\mathcal{M}}}\left[\sum_{k=t}^{\infty}\gamma^{k-t}r_% {k+1}\mid\mathbf{s}_{t},\boldsymbol{\lambda}_{t}\right]over^ start_ARG italic_V end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , over^ start_ARG caligraphic_M end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] and Q^πRL(𝐬t,𝐚tRL,𝝀t)=𝔼πRL,^[k=tγktrk+1𝐬t,𝐚tRL,𝝀t]superscript^𝑄superscript𝜋RLsubscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡subscript𝔼superscript𝜋RL^delimited-[]conditionalsuperscriptsubscript𝑘𝑡superscript𝛾𝑘𝑡subscript𝑟𝑘1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡\hat{Q}^{\pi^{\mathrm{RL}}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},% \boldsymbol{\lambda}_{t})=\mathbb{E}_{\pi^{\mathrm{RL}},\hat{\mathcal{M}}}% \left[\sum_{k=t}^{\infty}\gamma^{k-t}r_{k+1}\mid\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t}\right]over^ start_ARG italic_Q end_ARG start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , over^ start_ARG caligraphic_M end_ARG end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k = italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_k - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. Thus, we can optimize (4) using standard RL methods by additionally conditioning on 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The general mechanism of contextualized hybrid RL is illustrated in Algorithm 1.

Algorithm 1 Contextualized Hybrid Reinforcement Learning
1:RL policy π^ϕRL(𝐚tRL𝐬tRL,𝝀t)subscriptsuperscript^𝜋RLitalic-ϕconditionalsubscriptsuperscript𝐚RL𝑡subscriptsuperscript𝐬RL𝑡subscript𝝀𝑡\hat{\pi}^{\mathrm{RL}}_{\phi}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}^{% \mathrm{RL}}_{t},\boldsymbol{\lambda}_{t})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), control prior πprior(𝐚tpriors)superscript𝜋priorconditionalsubscriptsuperscript𝐚prior𝑡𝑠\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid s)italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ italic_s ), mixing function f(𝐚tRL,𝐚tprior,𝝀t)𝑓subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝐚prior𝑡subscript𝝀𝑡f(\mathbf{a}^{\mathrm{RL}}_{t},\mathbf{a}^{\mathrm{prior}}_{t},\boldsymbol{% \lambda}_{t})italic_f ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), weight adaption function ΛΛ\Lambdaroman_Λ, replay buffer 𝒟𝒟\mathcal{D}\leftarrow\emptysetcaligraphic_D ← ∅.
2:for each episode do
3:     Sample initial state 𝐬0ρ0similar-tosubscript𝐬0subscript𝜌0\mathbf{s}_{0}\sim\rho_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, initialize 𝝀0subscript𝝀0\boldsymbol{\lambda}_{0}bold_italic_λ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT.
4:     for each step do
5:         Sample RL action 𝐚tRLπϕRL(atRL𝐬t,𝝀t)similar-tosubscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜋RLitalic-ϕconditionalsubscriptsuperscript𝑎RL𝑡subscript𝐬𝑡subscript𝝀𝑡\mathbf{a}^{\mathrm{RL}}_{t}\sim\pi^{\mathrm{RL}}_{\phi}\left(a^{\mathrm{RL}}_% {t}\mid\mathbf{s}_{t},\boldsymbol{\lambda}_{t}\right)bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
6:         Sample control prior action 𝐚tpriorπϕprior(atRL𝐬t)similar-tosubscriptsuperscript𝐚prior𝑡subscriptsuperscript𝜋prioritalic-ϕconditionalsubscriptsuperscript𝑎RL𝑡subscript𝐬𝑡\mathbf{a}^{\mathrm{prior}}_{t}\sim\pi^{\mathrm{prior}}_{\phi}\left(a^{\mathrm% {RL}}_{t}\mid\mathbf{s}_{t}\right)bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
7:         Get combined action 𝐚tmix=f(𝐚tRL,𝐚tprior,𝝀t)subscriptsuperscript𝐚mix𝑡𝑓subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝐚prior𝑡subscript𝝀𝑡\mathbf{a}^{\mathrm{mix}}_{t}=f(\mathbf{a}^{\mathrm{RL}}_{t},\mathbf{a}^{% \mathrm{prior}}_{t},\boldsymbol{\lambda}_{t})bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
8:         Observe state transition 𝐬t+1,rt+1p(,𝐬t,atmix)\mathbf{s}_{t+1},r_{t+1}\sim p\left(\cdot,\cdot\mid\mathbf{s}_{t},a^{\mathrm{% mix}}_{t}\right)bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ italic_p ( ⋅ , ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).
9:         Store (𝐬t,𝐚tRL,𝝀t,𝐬t+1,rt+1)subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡subscript𝐬𝑡1subscript𝑟𝑡1\left(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\boldsymbol{\lambda}_{t},% \mathbf{s}_{t+1},r_{t+1}\right)( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) into replay buffer 𝒟𝒟\mathcal{D}caligraphic_D.
10:         Get next adaptive weight 𝝀t+1=Λsubscript𝝀𝑡1Λ\boldsymbol{\lambda}_{t+1}=\Lambdabold_italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Λ.
11:         Sample set of transitions (s,a,λ,s,r)𝒟similar-to𝑠𝑎𝜆superscript𝑠𝑟𝒟\left(s,a,\lambda,s^{\prime},r\right)\sim\mathcal{D}( italic_s , italic_a , italic_λ , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r ) ∼ caligraphic_D.
12:         Optimize ϕitalic-ϕ\phiitalic_ϕ with respect to (4) using RL with sampled transitions.
13:     end for
14:end for

Prior work takes different approaches to formulating the hybrid learning problem. Approaches with a time-invariant weight adaption function, such as fixed-weight hybrid methods, include ΛΛ\Lambdaroman_Λ in the definition of the environment MDP ¯=(𝒮,𝒜,p¯,r,ρ0,γ)¯𝒮𝒜¯𝑝𝑟subscript𝜌0𝛾\bar{\mathcal{M}}=(\mathcal{S},\mathcal{A},\bar{p},r,\rho_{0},\gamma)over¯ start_ARG caligraphic_M end_ARG = ( caligraphic_S , caligraphic_A , over¯ start_ARG italic_p end_ARG , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ ) with dynamics p¯(𝐬t+1,rt+1𝐬t,𝐚tRL)¯𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡\bar{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})over¯ start_ARG italic_p end_ARG ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This formulation is directly applicable to the standard RL objective (1), however, does not generalize to time-varying adaption mechanisms such as uncertainty-adapted methods. Approaches with a time-varying adaption mechanism (Cheng et al., 2019; Rana et al., 2023) typically formulate the hybrid RL problem concerning the controlled system MDP and the combined action with dynamics p(𝐬t+1,rt+1𝐬t,𝐚tmix)𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚mix𝑡p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{mix}}_{t})italic_p ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). This likewise yields a formulation that is directly applicable to (1), however, this can lead to problems as the agent is unaware of the downstream mixing process. Furthermore, this introduces a distributional shift between trained policy and data-collecting behavior policy. The distributional shift can lead to training instability and divergence (Kumar et al., 2020; Fujimoto et al., 2018).

5.2 Illustrative Example

We exemplify the strength of the contextualized hybrid RL formulation based on ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG introduced in Section 5.1 by comparing it to prior approaches on the cart pole system depicted in Figure 2(a). The goal is to balance the pole upright while kee** the cart close to its initial position. The system is controlled via continuous forces on the cart. We choose πpriorsuperscript𝜋prior\pi^{\mathrm{prior}}italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT to apply a constant force to the left, which destabilizes formulations unconscious of the mixing process. We investigate a time-invariant fixed weight setting as well as a time-varying schedule-based weight adaption setting to highlight the capability of the respective formulations to deal with both scenarios. This simple example illustrates that the contextualized hybrid MDP formulation can deal with destabilizing control priors and time-varying weight adaption functions while prior formulations fail.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: We illustrate different hybrid RL formulations on a cart pole system (2(a)) with a biased control prior, pushing to the left. Return of hybrid agents with fixed λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (2(b)) and variable λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (2(c)). Only the contextualized hybrid RL agent observing dynamics p^^𝑝\hat{p}over^ start_ARG italic_p end_ARG can cope with both scenarios.

Fixed Weighting. First, we consider a residual setting with fixed weights λtRL=λtprior=0.5subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝜆prior𝑡0.5\lambda^{\mathrm{RL}}_{t}=\lambda^{\mathrm{prior}}_{t}=0.5italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_λ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5. Figure 2(b) depicts the performance of RL agents trained under ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG, ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG, and \mathcal{M}caligraphic_M with respective dynamics p^(𝐬t+1,rt+1𝐬t,𝐚tRL,𝝀t)^𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscript𝝀𝑡\hat{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t% },\boldsymbol{\lambda}_{t})over^ start_ARG italic_p end_ARG ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), p¯(𝐬t+1,rt+1𝐬t,𝐚tRL)¯𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡\bar{p}(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t})over¯ start_ARG italic_p end_ARG ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), and p(𝐬t+1,rt+1𝐬t,𝐚tmix)𝑝subscript𝐬𝑡1conditionalsubscript𝑟𝑡1subscript𝐬𝑡subscriptsuperscript𝐚mix𝑡p(\mathbf{s}_{t+1},r_{t+1}\mid\mathbf{s}_{t},\mathbf{a}^{\mathrm{mix}}_{t})italic_p ( bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). While agents trained under ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG and ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG learn to stabilize the cart pole, the hybrid formulation concerning \mathcal{M}caligraphic_M fails. When formulating the RL problem concerning 𝐚tmixsubscriptsuperscript𝐚mix𝑡\mathbf{a}^{\mathrm{mix}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT the RL agent observes the combined action in its data and therefore learns the combined action in its policy. This, however, neglects the fact that the policy action is mixed with the controller action before being applied to \mathcal{M}caligraphic_M. Assuming the cart pole is not moving, and the pole is upright, an agent trained under \mathcal{M}caligraphic_M provides the optimal combined action, namely applying no force, while 𝐚tpriorsubscriptsuperscript𝐚prior𝑡\mathbf{a}^{\mathrm{prior}}_{t}bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT pushes the pole to the left. This results in 𝐚tmixsubscriptsuperscript𝐚mix𝑡\mathbf{a}^{\mathrm{mix}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT pointing to the left, causing the pole to fall while giving the agent no mechanism to observe and counteract this phenomenon. Instead, formulating the hybrid RL problem concerning 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT allows the agent to observe the mixing mechanism and compensate for the destabilizing control prior.

Adaptive Weighting. Second, we consider an adaptive hybrid RL problem with time-varying ΛΛ\Lambdaroman_Λ. We choose a schedule-based approach with a regularizing mixing function (3) and λtRL[0,1]subscriptsuperscript𝜆RL𝑡01\lambda^{\mathrm{RL}}_{t}\in[0,1]italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ 0 , 1 ] linearly increasing over time. Figure 2(c) shows the performance of agents trained under formulations based on ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG, ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG, and \mathcal{M}caligraphic_M. While agents trained under ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG and \mathcal{M}caligraphic_M fail, the formulation based on ^^\hat{\mathcal{M}}over^ start_ARG caligraphic_M end_ARG succeeds. In the beginning, when the RL agent is given only low weight, the formulation under \mathcal{M}caligraphic_M suffers from a high distributional shift between the action of the behavior policy 𝐚tmixsubscriptsuperscript𝐚mix𝑡\mathbf{a}^{\mathrm{mix}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and the action of the target policy 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The large distributional shift causes the agent to diverge (Kumar et al., 2020; Fujimoto et al., 2018). Although the distributional shift decreases with increasing weight lambda, the agent does not manage to recover. The formulation under ¯¯\bar{\mathcal{M}}over¯ start_ARG caligraphic_M end_ARG fails due to the missing information about the time-variant behavior of the mixing process. The proposed contextualized hybrid RL formulation solves these issues by formulating the task concerning 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and introducing the context variable 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

5.3 Contextualized Hybrid Ensemble Q-learning (CHEQ)

Based on the contextualized hybrid RL formulation introduced in Section 5.1, we propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm. At the heart of CHEQ is a critic ensemble that (i) provides an uncertainty estimate enabling an uncertainty-adapted hybrid RL mechanism, and (ii) allows to incorporate ensemble-based acceleration techniques for data-efficient RL.

We base CHEQ on the Soft Actor-Critic (SAC) (Haarnoja et al., 2018) algorithm and a regularizing mixing function (3) with 𝝀t=[(1λtRL),λtRL]subscript𝝀𝑡superscript1subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝜆RL𝑡top\boldsymbol{\lambda}_{t}=[(1-\lambda^{\mathrm{RL}}_{t}),\lambda^{\mathrm{RL}}_% {t}]^{\top}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ ( 1 - italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT. The weight adaption mechanism relies on a critic ensemble comprising of E𝐸Eitalic_E contextualized Q-functions with parameters θesubscript𝜃𝑒\theta_{e}italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, e{1,,E}𝑒1𝐸e\in\{1,\dots,E\}italic_e ∈ { 1 , … , italic_E } and corresponding target Q-functions with parameters θ¯esubscript¯𝜃𝑒\bar{\theta}_{e}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT, e{1,,E}𝑒1𝐸e\in\{1,\dots,E\}italic_e ∈ { 1 , … , italic_E }. We update the critics with the mechanism of Randomized Ensemble Double Q-learning (REDQ) (Chen et al., 2021) and enforce sufficient independence between Q-estimates using Bernoulli masking of the training data (Osband et al., 2016; Lee et al., 2021; Mai et al., 2022). Model ensembles estimate parametric uncertainty, referred to as epistemic uncertainty, from disagreement between individual models within the ensemble. If different critics disagree about the outcome of taking action 𝐚tRLsubscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in 𝐬tsubscript𝐬𝑡\mathbf{s}_{t}bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT while weighting with 𝝀tsubscript𝝀𝑡\boldsymbol{\lambda}_{t}bold_italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, this indicates a weak understanding of the task in the particular area of 𝒮×𝒜×𝒲𝒮𝒜𝒲\mathcal{S}\times\mathcal{A}\times\mathcal{W}caligraphic_S × caligraphic_A × caligraphic_W. Thus, the control prior should be prioritized over the RL agent in such situations. Therefore, epistemic uncertainty represents a suitable quantity for adapting the weighting of control prior and RL agent. We define epistemic uncertainty as the standard deviation of critic predictions

u(𝐬t,𝐚tRL,λtRL)=1Ee=1E(Q^θe(𝐬t,𝐚tRL,λtRL)μ(𝐬t,𝐚tRL,λtRL))2𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡1𝐸superscriptsubscript𝑒1𝐸superscriptsubscript^𝑄subscript𝜃𝑒subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡𝜇subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡2u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=\sqrt% {\frac{1}{E}\sum_{e=1}^{E}\left(\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}% ^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})-\mu(\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})\right)^{2}}italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_E end_ARG ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_μ ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (5)

with μ(𝐬t,𝐚tRL,λtRL)=1Ee=1EQ^θe(𝐬t,𝐚tRL,λtRL)𝜇subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡1𝐸superscriptsubscript𝑒1𝐸subscript^𝑄subscript𝜃𝑒subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡\mu(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})=% \frac{1}{E}\sum_{e=1}^{E}\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})italic_μ ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_E end_ARG ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). We aim to give low weight to the RL agent in areas of high uncertainty and vice versa. Thus, the weight adaption function Λ(u(𝐬t,𝐚tRL,λtRL))Λ𝑢subscript𝐬𝑡superscriptsubscript𝐚𝑡RLsubscriptsuperscript𝜆RL𝑡\Lambda(u(\mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}},\lambda^{\mathrm{RL}}_{t% }))roman_Λ ( italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) maps the critics epistemic uncertainty to the weighting factor λtRL[λmin,λmax][0,1]subscriptsuperscript𝜆RL𝑡subscript𝜆minsubscript𝜆max01\lambda^{\mathrm{RL}}_{t}\in[\lambda_{\mathrm{min}},\lambda_{\mathrm{max}}]% \subseteq[0,1]italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ [ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] ⊆ [ 0 , 1 ] via the piece-wise linear function

λt+1RL={λmaxif u(𝐬t,𝐚tRL,λtRL)<uminu(𝐬t,𝐚tRL,λtRL)umaxuminumax(λmaxλmin)+λminif u(𝐬t,𝐚tRL,λtRL)[umin,umax]λminif u(𝐬t,𝐚tRL,λtRL)>umax.subscriptsuperscript𝜆RL𝑡1casessubscript𝜆maxif 𝑢subscript𝐬𝑡superscriptsubscript𝐚𝑡RLsubscriptsuperscript𝜆RL𝑡subscript𝑢min𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡subscript𝑢maxsubscript𝑢minsubscript𝑢maxsubscript𝜆maxsubscript𝜆minsubscript𝜆minif 𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡subscript𝑢minsubscript𝑢maxsubscript𝜆minif 𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡subscript𝑢max\lambda^{\mathrm{RL}}_{t+1}=\begin{cases}\lambda_{\mathrm{max}}&\text{if }u(% \mathbf{s}_{t},\mathbf{a}_{t}^{\mathrm{RL}},\lambda^{\mathrm{RL}}_{t})<u_{% \mathrm{min}}\\ \frac{u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})% -u_{\mathrm{max}}}{u_{\mathrm{min}}-u_{\mathrm{max}}}(\lambda_{\mathrm{max}}-% \lambda_{\mathrm{min}})+\lambda_{\mathrm{min}}&\text{if }u(\mathbf{s}_{t},% \mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})\in[u_{\mathrm{min}},u_% {\mathrm{max}}]\\ \lambda_{\mathrm{min}}&\text{if }u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t}% ,\lambda^{\mathrm{RL}}_{t})>u_{\mathrm{max}}.\end{cases}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = { start_ROW start_CELL italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_CELL start_CELL if italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) < italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL divide start_ARG italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG start_ARG italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT - italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_ARG ( italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT - italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_CELL start_CELL if italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∈ [ italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] end_CELL end_ROW start_ROW start_CELL italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_CELL start_CELL if italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT . end_CELL end_ROW (6)

Besides providing an uncertainty estimate of the RL agent, the critic ensemble used in CHEQ has proven effective in mitigating overestimation bias  (Thrun & Schwartz, 1993) in Q-learning-based approaches (Lan et al., 2021; Wang et al., 2021; Chen et al., 2021). The Update-To-Data (UTD) ratio describes the number of gradient steps per environment interaction. Due to the reduction of the overestimation bias, the critic ensemble allows for increasing the UTD ratio while maintaining stable learning. This substantially improves the data efficiency of value-based actor-critic methods (Chen et al., 2021). A detailed pseudocode algorithm of CHEQ is provided in Algorithm 2 of Appendix A.

6 Experiments

We evaluate CHEQ on a racing task and compare it to standard RL, fixed-weighting hybrid RL, and state-of-the-art adaptive hybrid RL. In our experiments, CHEQ yields substantial improvements in (i) data efficiency compared to other hybrid methods, as well as (ii) exploration safety, and (iii) zero-shot transferability to unknown scenarios as compared to all competitor approaches.

6.1 Experimental Setup

We base our evaluation on a car racing setting adapted from (Schier et al., 2023). Achieving high returns requires advanced trajectory planning and control while operating the vehicle close to stability limits, including tire slip. The control prior is a trajectory-following task along the center line of the track using a Stanley controller (Thrun et al., 2006) for lateral and a proportional controller for longitudinal control. Further details are provided in Appendix B.

We compare CHEQ against the standard RL approaches SAC (Haarnoja et al., 2018) and REDQ (Chen et al., 2021), fixed-weighting hybrid RL based on SAC, and the state-of-the-art adaptive hybrid RL methods Controller Regularized RL (CORE) (Cheng et al., 2019) and Bayesian Controller Fusion (BCF) (Rana et al., 2023). In all experiments, we provide results for CHEQ with a high UTD ratio (CHEQ-UTD20) to demonstrate the capabilities of the approach and a low UTD ratio (CHEQ-UTD1) for a fair comparison to SAC-based methods. All implementations111The code is available at github.com/Data-Science-in-Mechanical-Engineering/cheq . are based on either the Clean RL library (Huang et al., 2022) or the original paper implementation (Rana et al., 2023). We provide a detailed description of the hyperparameter settings in Appendix A.1.

We train all our approaches on ten random seeds and one fixed race track. We report return and cumulative training failures. Runs are considered a failure when the car leaves the track. For zero-shot transfer, we evaluate ten trained models per algorithm on ten unseen racetracks. Return and failure plots show the respective mean (solid lines) and \qty95 confidence interval (shaded areas).

6.2 Evaluation on the Car Racing Environment

We compare CHEQ to standard RL, fixed-weight hybrid RL, and adaptive hybrid RL concerning learning performance (see Figure 3) and zero-shot transfer to unknown tracks (see Figure 4).

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Performance of all trained RL approaches in terms of evaluation return and training failures on the training track. Comparison of (3(a)) CHEQ with fixed-weight hybrid RL and standard RL, (3(b)) increased UTD ratios and (3(c)) prior work in adaptive hybrid RL.

Comparison against Fixed-Weight Hybrid RL and Standard RL. Comparing the CHEQ algorithm based on SAC (CHEQ-UTD1) to a standalone SAC agent and the control prior in Figure 3(a) illustrates the general benefit of hybrid RL. While the control prior operates safely without failing, it shows limited performance due to the conservative driving policy. SAC shows strong asymptotic performance at the cost of frequent failures throughout training. CHEQ-UTD1 considerably outperforms SAC concerning data efficiency and exploration safety, learning faster with fewer failures while yielding comparable asymptotic performance. The comparison of CHEQ to fixed-weight hybrid RL methods further illustrates the advantage of an adaptive weighting scheme. The fixed-weight hybrid RL approaches (0.5-SAC, 0.7-SAC) combine the control prior with a SAC agent using the mixing function in (3) with λtRL=0.5subscriptsuperscript𝜆RL𝑡0.5\lambda^{\mathrm{RL}}_{t}=0.5italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 and λtRL=0.7subscriptsuperscript𝜆RL𝑡0.7\lambda^{\mathrm{RL}}_{t}=0.7italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.7, respectively. Here, the choice of λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents a trade-off between exploration safety and asymptotic performance, where a higher λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT enables better performance while reducing safety. A fixed weight of λtRL=0.5subscriptsuperscript𝜆RL𝑡0.5\lambda^{\mathrm{RL}}_{t}=0.5italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 0.5 arguably reduces failures compared to CHEQ-UTD1, however, this comes at the cost of substantially lower performance.

Update-To-Data Ratio. As discussed in Section 5.3, the critic ensemble of CHEQ allows the use of acceleration techniques originally proposed in the REDQ algorithm. Increasing the UTD ratio to 20202020 notably improves data efficiency as compared to SAC, both as a standalone RL algorithm (REDQ) and as an adaptive hybrid RL algorithm (CHEQ-UTD20). The speed-up in training helps to reduce training failures as REDQ reports a drastically reduced number of failures compared to SAC. The benefit is further amplified in the adaptive hybrid formulation of CHEQ-UTD20 as indicated by its strong initial performance and the ability to reduce the mean cumulative fails to less than 20202020.

Comparison against state-of-the-art Adaptive Hybrid RL. Finally, Figure 3(c) compares CHEQ to the most relevant adaptive hybrid RL methods. CHEQ-UTD1 shows similar data efficiency and performance compared to CORE and BCF while considerably reducing accumulated fails. CHEQ-UTD20 substantially outperforms all competitor approaches in all performance metrics. A more detailed hyperparameter analysis of the prior approaches, as well as results for reformulations of CORE and BCF as contextualized hybrid RL methods are provided in in Appendix C.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 4: Comparison of return (4(a)) and failures (4(b)) of ten trained models per algorithm on ten transfer tracks. Development of λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT of the UTD-20 agent for an exemplary transfer track (4(c)).

Zero-shot Transfer. Next, we perform a zero-shot transfer of the trained agents. Returns are depicted in Figure 4(a) while Figure 4(b) shows the success rate of the respective methods. CHEQ-UTD1, CHEQ-UTD20 and the control prior achieve a success rate of \qty97, \qty95, and \qty100, respectively. The other standard and hybrid RL methods frequently fail in unseen scenarios. While the CHEQ variants fail slightly more often than the controller, they drive notably faster, i.e., they achieve higher returns. Figure 4(c) illustrates the adaption mechanism of CHEQ on one example track. In challenging and unseen curves, the agent gradually hands over to the control prior as can be seen in Figure 4(c). We find that in the few failure cases (3 out of 100 for CHEQ-UTD1 and 5 out of 100 for CHEQ-UT20), the agent correctly identifies its uncertainty, and hands over to the controller, but the controller is unable to navigate the situation safely. We provide an illustration of all transfer tracks, as well as the weight adaption of CHEQ-UTD20 on these tracks in Appendix C.2. In summary, CHEQ shows strong zero-shot transfer behavior, driving faster than the controller with only a few failure cases.

Refer to caption
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Final return over number of fails during training (5(a)) and zero-shot transfer (5(b)).

Summary. We summarize the trade-off between failures and asymptotic performance in Figure 5. Figure 5(a) illustrates the training results of the respective approaches while Figure 5(b) depicts the transfer results. Fixed weight hybrid RL effectively reduces failures as compared to standard SAC. This, however, comes at the cost of asymptotic performance. Our adaptive CHEQ algorithm avoids this trade-off, achieving high return with only a few failures. In zero-shot transfer, the CHEQ agent again performs best due to its ability to detect unforeseen situations reliably and then fall back to the safe control prior.

7 Conclusion

This work addresses how to systematically combine an RL agent with a control prior. We propose a novel formulation of the adaptive hybrid RL problem which introduces the adaptive weighting parameter as a context variable of the MDP, and based on this, propose the Contextualized Hybrid Ensemble Q-learning (CHEQ) algorithm. CHEQ combines a reliable critic uncertainty-based weight adaption mechanism with the data efficiency of critic ensemble methods, yielding substantially stronger results than state-of-the-art adaptive hybrid RL methods on a racing task concerning data efficiency, exploration safety, and transferability.

Acknowledgments

We thank Paul Brunzema, Johanna Menn, and David Stenger for their helpful comments. We also thank Devdutt Subhasish and Lukas Kesper for their help with the cartpole example. This work was funded in part by the German Federal Ministry of Education and Research (“Demonstrations- und Transfernetzwerk KI in der Produktion (ProKI-Netz)” initiative, grant number 02P22A010) and the German Federal Ministry for Economic Affairs and Climate Action (project EEMotion). Computations were performed with computing resources granted by RWTH Aachen University under the projects <thes1594>, <rwth1490>, and <rwth1501>.

References

  • Bellemare et al. (2023) Marc G. Bellemare, Will Dabney, and Mark Rowland. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  • Büchler et al. (2022) Dieter Büchler, Simon Guist, Roberto Calandra, Vincent Berenz, Bernhard Schölkopf, and Jan Peters. Learning to Play Table Tennis From Scratch Using Muscular Robots. IEEE Transactions on Robotics, 2022.
  • Ceola et al. (2024) Federico Ceola, Lorenzo Rosasco, and Lorenzo Natale. RESPRECT: Speeding-up Multi-fingered Gras** with Residual Reinforcement Learning, 2024. arXiv:2401.14858 [cs].
  • Chen et al. (2021) Xinyue Chen, Che Wang, Zijian Zhou, and Keith Ross. Randomized Ensembled Double Q-Learning: Learning Fast Without a Model, 2021.
  • Cheng et al. (2019) Richard Cheng, Abhinav Verma, Gabor Orosz, Swarat Chaudhuri, Yisong Yue, and Joel Burdick. Control Regularization for Reduced Variance Reinforcement Learning. In Proceedings of the 36th International Conference on Machine Learning. PMLR, 2019.
  • Daoudi et al. (2023) Paul Daoudi, Bogdan Robu, Christophe Prieur, Ludovic Dos Santos, and Merwan Barlier. Enhancing Reinforcement Learning Agents with Local Guides. In International Conference on Autonomous Agents and Multiagent Systems, 2023.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke Hoof, and David Meger. Addressing Function Approximation Error in Actor-Critic Methods. In Proceedings of the 35th International Conference on Machine Learning. PMLR, 2018.
  • Gupta et al. (2021) Abhishek Gupta, Justin Yu, Tony Z. Zhao, Vikash Kumar, Aaron Rovinsky, Kelvin Xu, Thomas Devlin, and Sergey Levine. Reset-Free Reinforcement Learning via Multi-Task Learning: Learning Dexterous Manipulation Behaviors without Human Intervention. In IEEE International Conference on Robotics and Automation (ICRA), 2021.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 2018.
  • Hoel et al. (2020a) Carl-Johan Hoel, Tommy Tram, and Jonas Sjöberg. Reinforcement Learning with Uncertainty Estimation for Tactical Decision-Making in Intersections. In IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), 2020a.
  • Hoel et al. (2020b) Carl-Johan Hoel, Krister Wolff, and Leo Laine. Tactical Decision-Making in Autonomous Driving by Reinforcement Learning with Uncertainty Estimation. In IEEE Intelligent Vehicles Symposium (IV), 2020b.
  • Huang et al. (2022) Shengyi Huang, Rousslan Fernand Julien Dossa, Chang Ye, Jeff Braga, Dipam Chakraborty, Kinal Mehta, and João G. M. Araújo. CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms. Journal of Machine Learning Research, 2022.
  • Johannink et al. (2019) Tobias Johannink, Shikhar Bahl, Ashvin Nair, Jianlan Luo, Avinash Kumar, Matthias Loskyll, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Residual reinforcement learning for robot control. In International Conference on Robotics and Automation (ICRA), 2019.
  • Kerbel et al. (2022) Lindsey Kerbel, Beshah Ayalew, Andrej Ivanco, and Keith Loiselle. Residual Policy Learning for Powertrain Control. IFAC-PapersOnLine, 2022.
  • Kulkarni et al. (2022) Padmaja Kulkarni, Jens Kober, Robert Babuška, and Cosimo Della Santina. Learning assembly tasks in a few minutes by combining impedance control and residual recurrent reinforcement learning. Advanced Intelligent Systems, 2022.
  • Kumar et al. (2020) Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative Q-Learning for Offline Reinforcement Learning. In Advances in Neural Information Processing Systems, volume 33. Curran Associates, Inc., 2020.
  • Lan et al. (2021) Qingfeng Lan, Yangchen Pan, Alona Fyshe, and Martha White. Maxmin Q-learning: Controlling the Estimation Bias of Q-learning, 2021.
  • Lee et al. (2021) Kimin Lee, Michael Laskin, Aravind Srinivas, and Pieter Abbeel. Sunrise: A simple unified framework for ensemble learning in deep reinforcement learning. In International Conference on Machine Learning. PMLR, 2021.
  • Mai et al. (2022) Vincent Mai, Kaustubh Mani, and Liam Paull. Sample efficient deep reinforcement learning via uncertainty estimation. arXiv preprint arXiv:2201.01666, 2022.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, and Georg Ostrovski. Human-level control through deep reinforcement learning. Nature, 518(7540), 2015.
  • OpenAI et al. (2019) OpenAI, Christopher Berner, Greg Brockman, Brooke Chan, Vicki Cheung, Przemysław Dębiak, Christy Dennison, David Farhi, Quirin Fischer, Shariq Hashme, Chris Hesse, Rafal Józefowicz, Scott Gray, Catherine Olsson, Jakub Pachocki, Michael Petrov, Henrique P. d O. Pinto, Jonathan Raiman, Tim Salimans, Jeremy Schlatter, Jonas Schneider, Szymon Sidor, Ilya Sutskever, Jie Tang, Filip Wolski, and Susan Zhang. Dota 2 with Large Scale Deep Reinforcement Learning, 2019.
  • Osband et al. (2016) Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep Exploration via Bootstrapped DQN. In Advances in Neural Information Processing Systems, volume 29, 2016.
  • Rana et al. (2020a) Krishan Rana, Vibhavari Dasagi, Ben Talbot, Michael Milford, and Niko Sünderhauf. Multiplicative Controller Fusion: Leveraging Algorithmic Priors for Sample-efficient Reinforcement Learning and Safe Sim-To-Real Transfer. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020a.
  • Rana et al. (2020b) Krishan Rana, Ben Talbot, Vibhavari Dasagi, Michael Milford, and Niko Sünderhauf. Residual Reactive Navigation: Combining Classical and Learned Navigation Strategies For Deployment in Unknown Environments. In 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020b.
  • Rana et al. (2023) Krishan Rana, Vibhavari Dasagi, Jesse Haviland, Ben Talbot, Michael Milford, and Niko Sünderhauf. Bayesian controller fusion: Leveraging control priors in deep reinforcement learning for robotics. The International Journal of Robotics Research, 2023.
  • Ranjbar et al. (2021) Alireza Ranjbar, Ngo Anh Vien, Hanna Ziesche, Joschka Boedecker, and Gerhard Neumann. Residual Feedback Learning for Contact-Rich Manipulation Tasks with Uncertainty. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2021.
  • Schier et al. (2023) Maximilian Schier, Christoph Reinders, and Bodo Rosenhahn. Learned Fourier Bases for Deep Set Feature Extractors in Automotive Reinforcement Learning. In IEEE 26th International Conference on Intelligent Transportation Systems (ITSC), Bilbao, Spain, 2023.
  • Schoettler et al. (2020) Gerrit Schoettler, Ashvin Nair, Jianlan Luo, Shikhar Bahl, Juan Aparicio Ojea, Eugen Solowjow, and Sergey Levine. Deep reinforcement learning for industrial insertion tasks with visual inputs and natural rewards. In IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020.
  • Schulman et al. (2017) J. Schulman, F. Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. ArXiv, 2017.
  • Silver et al. (2018a) David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy Lillicrap, Karen Simonyan, and Demis Hassabis. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play. Science, 2018a.
  • Silver et al. (2018b) Tom Silver, Kelsey Allen, Josh Tenenbaum, and Leslie Kaelbling. Residual policy learning. arXiv preprint arXiv:1812.06298, 2018b.
  • Sutton & Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Thrun & Schwartz (1993) Sebastian Thrun and Anton Schwartz. Issues in Using Function Approximation for Reinforcement Learning. In Proceedings of the Fourth Connectionist Models Summer School, 1993.
  • Thrun et al. (2006) Sebastian Thrun, Mike Montemerlo, Hendrik Dahlkamp, David Stavens, Andrei Aron, James Diebel, Philip Fong, John Gale, Morgan Halpenny, Gabriel Hoffmann, Kenny Lau, Celia Oakley, Mark Palatucci, Vaughan Pratt, Pascal Stang, Sven Strohband, Cedric Dupont, Lars-Erik Jendrossek, Christian Koelen, Charles Markey, Carlo Rummel, Joe van Niekerk, Eric Jensen, Philippe Alessandrini, Gary Bradski, Bob Davies, Scott Ettinger, Adrian Kaehler, Ara Nefian, and Pamela Mahoney. Stanley: The robot that won the DARPA Grand Challenge. Journal of Field Robotics, 2006.
  • Wang et al. (2021) Hang Wang, Sen Lin, and Junshan Zhang. Adaptive Ensemble Q-learning: Minimizing Estimation Bias via Error Feedback. In Advances in Neural Information Processing Systems, volume 34, 2021.

Appendix A Algorithmic Details

Algorithm 2 shows the pseudocode of the Contextualized Hybrid Ensemble Q-learning algorithm.

Algorithm 2 CHEQ
Initialize control prior πprior(𝐚tprior𝐬t)superscript𝜋priorconditionalsubscriptsuperscript𝐚prior𝑡subscript𝐬𝑡\pi^{\mathrm{prior}}(\mathbf{a}^{\mathrm{prior}}_{t}\mid\mathbf{s}_{t})italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), contextualized RL policy π^ϕRL(𝐚tRL𝐬t,λtRL)subscriptsuperscript^𝜋RLitalic-ϕconditionalsubscriptsuperscript𝐚RL𝑡subscript𝐬𝑡subscriptsuperscript𝜆RL𝑡\hat{\pi}^{\mathrm{RL}}_{\phi}(\mathbf{a}^{\mathrm{RL}}_{t}\mid\mathbf{s}_{t},% \lambda^{\mathrm{RL}}_{t})over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), contextualized critic ensemble Q^θe(𝐬t,𝐚tRL,λtRL)subscript^𝑄subscript𝜃𝑒subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡\hat{Q}_{\theta_{e}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{% \mathrm{RL}}_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), e{1,,E}𝑒1𝐸e\in\{1,\dots,E\}italic_e ∈ { 1 , … , italic_E }, contextualized target critic ensemble Q^θ¯e(𝐬t,𝐚tRL,λtRL)subscript^𝑄subscript¯𝜃𝑒subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡\hat{Q}_{\bar{\theta}_{e}}(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda% ^{\mathrm{RL}}_{t})over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), e{1,,E}𝑒1𝐸e\in\{1,\dots,E\}italic_e ∈ { 1 , … , italic_E }, replay buffer 𝒟𝒟\mathcal{D}\leftarrow\emptysetcaligraphic_D ← ∅, weighting interval [λmin,λmax]subscript𝜆minsubscript𝜆max[\lambda_{\mathrm{min}},\lambda_{\mathrm{max}}][ italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], uncertainty limits [umin,umax]subscript𝑢minsubscript𝑢max[u_{\mathrm{min}},u_{\mathrm{max}}][ italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], UTD ratio G𝐺Gitalic_G, Bernoulli masking rate κ𝜅\kappaitalic_κ, minimization targets F𝐹Fitalic_F, Polyak averaging factor τ𝜏\tauitalic_τ
for each epoch do
     𝐬0ρ0similar-tosubscript𝐬0subscript𝜌0\mathbf{s}_{0}\sim\rho_{0}bold_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, λ0RL=λminsubscriptsuperscript𝜆RL0subscript𝜆min\lambda^{\mathrm{RL}}_{0}=\lambda_{\mathrm{min}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT
     for each epoch step do
         𝐚tRLπ^ϕRL(𝐬t,λtRL)\mathbf{a}^{\mathrm{RL}}_{t}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot\mid% \mathbf{s}_{t},\lambda^{\mathrm{RL}}_{t})bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
         𝐚tpriorπprior(𝐬t)\mathbf{a}^{\mathrm{prior}}_{t}\sim\pi^{\mathrm{prior}}(\cdot\mid\mathbf{s}_{t})bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT ( ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
         𝐚tmix=(1λtRL)𝐚tprior+λtRL𝐚tRLsubscriptsuperscript𝐚mix𝑡1subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝐚prior𝑡subscriptsuperscript𝜆RL𝑡subscriptsuperscript𝐚RL𝑡\mathbf{a}^{\mathrm{mix}}_{t}=(1-\lambda^{\mathrm{RL}}_{t})\mathbf{a}^{\mathrm% {prior}}_{t}+\lambda^{\mathrm{RL}}_{t}\mathbf{a}^{\mathrm{RL}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( 1 - italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
         u(𝐬t,𝐚tRL,λtRL)𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) according to (5)
         λt+1=Λ(u(𝐬t,𝐚tRL,λtRL))subscript𝜆𝑡1Λ𝑢subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡\lambda_{t+1}=\Lambda(u(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}_{t},\lambda^{% \mathrm{RL}}_{t}))italic_λ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = roman_Λ ( italic_u ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) according to (6)
         𝐬t+1,rt+1p^(,𝐬t,𝐚tRL,λtRL)\mathbf{s}_{t+1},r_{t+1}\sim\hat{p}(\cdot,\cdot\mid\mathbf{s}_{t},\mathbf{a}^{% \mathrm{RL}}_{t},\lambda^{\mathrm{RL}}_{t})bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∼ over^ start_ARG italic_p end_ARG ( ⋅ , ⋅ ∣ bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
         for e=1,,E𝑒1𝐸e=1,\dots,Eitalic_e = 1 , … , italic_E do
              Sample Bernoulli Mask mteBer(κ)similar-tosuperscriptsubscript𝑚𝑡𝑒𝐵𝑒𝑟𝜅m_{t}^{e}\sim Ber(\kappa)italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT ∼ italic_B italic_e italic_r ( italic_κ )
         end for
         𝒟𝒟{(𝐬t,𝐚tRL,λtRL,𝐬t+1,rt+1),mt1,,mtE}𝒟𝒟subscript𝐬𝑡subscriptsuperscript𝐚RL𝑡subscriptsuperscript𝜆RL𝑡subscript𝐬𝑡1subscript𝑟𝑡1superscriptsubscript𝑚𝑡1superscriptsubscript𝑚𝑡𝐸\mathcal{D}\leftarrow\mathcal{D}\cup\{(\mathbf{s}_{t},\mathbf{a}^{\mathrm{RL}}% _{t},\lambda^{\mathrm{RL}}_{t},\mathbf{s}_{t+1},r_{t+1}),m_{t}^{1},\dots,m_{t}% ^{E}\}caligraphic_D ← caligraphic_D ∪ { ( bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) , italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_m start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT }
         for G𝐺Gitalic_G updates do
              Sample mini-batch ={(𝐬,𝐚RL,λRL,𝐬,r}\mathcal{B}=\{(\mathbf{s},\mathbf{a}^{\mathrm{RL}},\lambda^{\mathrm{RL}},% \mathbf{s}^{\prime},r\}caligraphic_B = { ( bold_s , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_r } from 𝒟𝒟\mathcal{D}caligraphic_D
              Sample a set \mathcal{F}caligraphic_F with ||=F𝐹|\mathcal{F}|=F| caligraphic_F | = italic_F uniform at random from {1,,E}1𝐸\{1,\dots,E\}{ 1 , … , italic_E }
              𝐚~RLπ^ϕRL(𝐬,λRL)\tilde{\mathbf{a}}^{\prime\mathrm{RL}}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot% \mid\mathbf{s}^{\prime},\lambda^{\mathrm{RL}})over~ start_ARG bold_a end_ARG start_POSTSUPERSCRIPT ′ roman_RL end_POSTSUPERSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT )
              y=r+γ(mineQ^θ¯e(𝐬,𝐚~RL,λRL)αlogπ^ϕRL(𝐚~RL𝐬,λRL))𝑦𝑟𝛾subscript𝑒subscript^𝑄subscript¯𝜃𝑒superscript𝐬superscript~𝐚RLsuperscript𝜆RL𝛼subscriptsuperscript^𝜋RLitalic-ϕconditionalsuperscript~𝐚RLsuperscript𝐬superscript𝜆RLy=r+\gamma\left(\min_{e\in\mathcal{F}}\hat{Q}_{\bar{\theta}_{e}}(\mathbf{s}^{% \prime},\tilde{\mathbf{a}}^{\prime\mathrm{RL}},\lambda^{\mathrm{RL}})-\alpha% \log\hat{\pi}^{\mathrm{RL}}_{\phi}(\tilde{\mathbf{a}}^{\prime\mathrm{RL}}\mid% \mathbf{s}^{\prime},\lambda^{\mathrm{RL}})\right)italic_y = italic_r + italic_γ ( roman_min start_POSTSUBSCRIPT italic_e ∈ caligraphic_F end_POSTSUBSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , over~ start_ARG bold_a end_ARG start_POSTSUPERSCRIPT ′ roman_RL end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) - italic_α roman_log over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_a end_ARG start_POSTSUPERSCRIPT ′ roman_RL end_POSTSUPERSCRIPT ∣ bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) )
              for e=1,,E𝑒1𝐸e=1,\dots,Eitalic_e = 1 , … , italic_E do
                  Update θesubscript𝜃𝑒\theta_{e}italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT with gradient descent using
                  𝟙meθe1||(𝐬,𝐚RL,λRL,r,𝐬)(Q^θe(𝐬,𝐚,λRL)y)2subscript1superscript𝑚𝑒subscriptsubscript𝜃𝑒1subscript𝐬superscript𝐚RLsuperscript𝜆RL𝑟superscript𝐬superscriptsubscript^𝑄subscript𝜃𝑒𝐬𝐚superscript𝜆RL𝑦2\mathbbm{1}_{m^{e}}\nabla_{\theta_{e}}\frac{1}{|\mathcal{B}|}\sum_{(\mathbf{s}% ,\mathbf{a}^{\mathrm{RL}},\lambda^{\mathrm{RL}},r,\mathbf{s}^{\prime})\in% \mathcal{B}}\left(\hat{Q}_{\theta_{e}}(\mathbf{s},\mathbf{a},\lambda^{\mathrm{% RL}})-y\right)^{2}blackboard_1 start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT ( bold_s , bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT , italic_r , bold_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∈ caligraphic_B end_POSTSUBSCRIPT ( over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s , bold_a , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) - italic_y ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
                  θ¯eτθ¯e+(1τ)θesubscript¯𝜃𝑒𝜏subscript¯𝜃𝑒1𝜏subscript𝜃𝑒\bar{\theta}_{e}\leftarrow\tau\bar{\theta}_{e}+(1-\tau)\theta_{e}over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ← italic_τ over¯ start_ARG italic_θ end_ARG start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT + ( 1 - italic_τ ) italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT
              end for
         end for
         update ϕitalic-ϕ\phiitalic_ϕ with gradient ascent using
         𝐚~RLπ^ϕRL(𝐬,λRL))\tilde{\mathbf{a}}_{\mathrm{RL}}\sim\hat{\pi}^{\mathrm{RL}}_{\phi}(\cdot\mid% \mathbf{s},\lambda^{\mathrm{RL}}))over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT roman_RL end_POSTSUBSCRIPT ∼ over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ ∣ bold_s , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) )
         ϕ1||𝐬(1Ee=1EQ^θe(𝐬,𝐚~RL,λRL)αlogπ^ϕRL(𝐚~RL𝐬,λRL)))\nabla_{\phi}\frac{1}{|\mathcal{B}|}\sum_{\mathbf{s}\in\mathcal{B}}\left(\frac% {1}{E}\sum_{e=1}^{E}\hat{Q}_{\theta_{e}}(\mathbf{s},\tilde{\mathbf{a}}_{% \mathrm{RL}},\lambda^{\mathrm{RL}})-\alpha\log\hat{\pi}^{\mathrm{RL}}_{\phi}(% \tilde{\mathbf{a}}_{\mathrm{RL}}\mid\mathbf{s},\lambda^{\mathrm{RL}}))\right)∇ start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT divide start_ARG 1 end_ARG start_ARG | caligraphic_B | end_ARG ∑ start_POSTSUBSCRIPT bold_s ∈ caligraphic_B end_POSTSUBSCRIPT ( divide start_ARG 1 end_ARG start_ARG italic_E end_ARG ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E end_POSTSUPERSCRIPT over^ start_ARG italic_Q end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_s , over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT roman_RL end_POSTSUBSCRIPT , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) - italic_α roman_log over^ start_ARG italic_π end_ARG start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( over~ start_ARG bold_a end_ARG start_POSTSUBSCRIPT roman_RL end_POSTSUBSCRIPT ∣ bold_s , italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT ) ) )
     end for
end for

A.1 Hyperparameters Settings

We build our SAC implementation based on CleanRL (Huang et al., 2022). All SAC-specific hyperparameters are kept consistent between all approaches and reported in Table 1.

Table 1: Shared Hyperparameters.
Hyperparameter Value
number of steps 1.5×1061.5E61.5\text{\times}{10}^{6}start_ARG 1.5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 6 end_ARG end_ARG
batch size 256
learning rate actor 3×1043E-43\text{\times}{10}^{-4}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG
learning rate critic 3×1043E-43\text{\times}{10}^{-4}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG
target entropy Htsubscript𝐻𝑡H_{t}italic_H start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT 3-3-3- 3
replay buffer size 1×1061E61\text{\times}{10}^{6}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 6 end_ARG end_ARG
discount factor γ𝛾\gammaitalic_γ 0.990.990.990.99
gradient update start 1×1031E31\text{\times}{10}^{3}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 3 end_ARG end_ARG steps
Polyak averaging factor τ𝜏\tauitalic_τ 0.0050.0050.0050.005

CHEQ (UTD1 and UTD20) uses an ensemble of E=5𝐸5E=5italic_E = 5 critics. We set the upper bound of the uncertainty as umax=0.15subscript𝑢max0.15u_{\mathrm{max}}=0.15italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 0.15 and the lower bound as umin=0.03subscript𝑢min0.03u_{\mathrm{min}}=0.03italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT = 0.03. Further we set λmax=1.0subscript𝜆max1.0\lambda_{\mathrm{max}}=1.0italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 1.0 and λmin=0.2subscript𝜆𝑚𝑖𝑛0.2\lambda_{min}=0.2italic_λ start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT = 0.2. We use a Bernoulli masking rate of κ=0.8𝜅0.8\kappa=0.8italic_κ = 0.8 and F=2𝐹2F=2italic_F = 2 minimization targets.

BCF trains an ensemble of policy networks. We maintain the original ensemble size from (Rana et al., 2023) which uses ten policy networks. We set the standard deviation of the control prior in BCF as σprior=6.0superscript𝜎prior6.0\sigma^{\mathrm{prior}}=6.0italic_σ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT = 6.0.

For the uncertainty estimate in CORE, we set A=7𝐴7A=7italic_A = 7, C=0.02𝐶0.02C=0.02italic_C = 0.02. Note that in the original paper, A𝐴Aitalic_A is denoted as λmaxsubscript𝜆max\lambda_{\mathrm{max}}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, which we change to avoid ambiguous notation.

SAC uses a UTD ratio of 1111. REDQ implementation uses an ensemble size of 5555 and a UTD ratio of 20202020.

For all algorithms, we include a random sampling phase for the first 1×1031E31\text{\times}{10}^{3}start_ARG 1 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 3 end_ARG end_ARG steps where we sample the RL action uniformly random and do not update our agent. In this setting, we keep λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT small for the hybrid agents. For CHEQ we vary λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT between [0.2, 0.3]. As CORE and BCF are unable to observe changes in λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT we use a fixed λRL=0.2superscript𝜆RL0.2\lambda^{\mathrm{RL}}=0.2italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT = 0.2 which has shown to be favorable in our experiments. After the random sampling phase, agent training starts, but λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT is kept small for another 4×1034E34\text{\times}{10}^{3}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 3 end_ARG end_ARG steps and afterward, λ𝜆\lambdaitalic_λ adaption starts.

For performance evaluation, we conduct a greedy evaluation run every 20202020 episodes. Evaluation happens in an adapted setting, together with the controller where the weight is calculated as in the training procedure.

Appendix B Environment Details

B.1 Racing Environment

We test our agent on the simulated racing task adapted from Schier et al. (2023). Figure 6 shows an example of the environment.

Refer to caption
(a)
Refer to caption
(b)
Figure 6: Racing task and the training track.

The vehicle uses a dynamic single-track model with a coupled Dugoff tire model. The throttle, brake, and steering are continuous actions. The vehicle is a front-wheel drive. The RL agent may learn to control brake balance by applying throttle and brake individually. We define the state of the RL agent as 𝐬t=(vx,vy,ω,β,otrack)subscript𝐬𝑡subscript𝑣𝑥subscript𝑣𝑦𝜔𝛽subscript𝑜track\mathbf{s}_{t}=(v_{x},v_{y},\omega,\beta,o_{\mathrm{track}})bold_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT , italic_ω , italic_β , italic_o start_POSTSUBSCRIPT roman_track end_POSTSUBSCRIPT ), with the ego vehicle’s velocity vector 𝐯ego=(vx,vy)subscript𝐯egosubscript𝑣𝑥subscript𝑣𝑦\mathbf{v}_{\mathrm{ego}}=(v_{x},v_{y})bold_v start_POSTSUBSCRIPT roman_ego end_POSTSUBSCRIPT = ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ) in vehicle reference frame, steering angle β𝛽\betaitalic_β, and yaw rate ω𝜔\omegaitalic_ω. The observation of the track otrack=(𝐱,𝐲)Tsubscript𝑜tracksuperscript𝐱𝐲𝑇o_{\mathrm{track}}=(\mathbf{x},\mathbf{y})^{T}italic_o start_POSTSUBSCRIPT roman_track end_POSTSUBSCRIPT = ( bold_x , bold_y ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT is given as a vector of 20202020 Cartesian distances (xi,yi)subscript𝑥𝑖subscript𝑦𝑖(x_{i},y_{i})( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) to the centerline of the track. The points are sampled equidistantly from the \qty60 track segment ahead.

We use the original reward formulation from Schier et al. (2023) where the RL agent receives a penalty rcollisionsubscript𝑟collisionr_{\mathrm{collision}}italic_r start_POSTSUBSCRIPT roman_collision end_POSTSUBSCRIPT whenever it collides with the track boundary and a penalty rfailsubscript𝑟failr_{\mathrm{fail}}italic_r start_POSTSUBSCRIPT roman_fail end_POSTSUBSCRIPT for leaving the track with the center of mass. The latter case also terminates the episode. The RL agent receives a positive reward for driving fast: the scalar projection of its velocity vector vegosubscript𝑣egov_{\mathrm{ego}}italic_v start_POSTSUBSCRIPT roman_ego end_POSTSUBSCRIPT onto the forward track direction ntracksubscript𝑛trackn_{\mathrm{track}}italic_n start_POSTSUBSCRIPT roman_track end_POSTSUBSCRIPT. The complete reward is then given by

r(s,a)=rfail0.2rcollision+0.01𝐧track𝐯ego.𝑟𝑠𝑎subscript𝑟fail0.2subscript𝑟collision0.01subscript𝐧tracksubscript𝐯egor(s,a)=-r_{\mathrm{fail}}-0.2\cdot r_{\mathrm{collision}}+0.01\cdot\mathbf{n}_% {\mathrm{track}}\cdot\mathbf{v}_{\mathrm{ego}}.italic_r ( italic_s , italic_a ) = - italic_r start_POSTSUBSCRIPT roman_fail end_POSTSUBSCRIPT - 0.2 ⋅ italic_r start_POSTSUBSCRIPT roman_collision end_POSTSUBSCRIPT + 0.01 ⋅ bold_n start_POSTSUBSCRIPT roman_track end_POSTSUBSCRIPT ⋅ bold_v start_POSTSUBSCRIPT roman_ego end_POSTSUBSCRIPT . (7)

B.2 Control Prior

For the racing task, we design a simple path-following controller with adaptive speeds. For the lateral control, we use a Stanley Controller (Thrun et al., 2006) following the steering control law

δ(t)=ψ(t)+kcrosse(t)v(t)+ksoft,𝛿𝑡𝜓𝑡subscript𝑘cross𝑒𝑡𝑣𝑡subscript𝑘soft\delta(t)=\psi(t)+\frac{k_{\mathrm{cross}}\cdot e(t)}{v(t)+k_{\mathrm{soft}}},italic_δ ( italic_t ) = italic_ψ ( italic_t ) + divide start_ARG italic_k start_POSTSUBSCRIPT roman_cross end_POSTSUBSCRIPT ⋅ italic_e ( italic_t ) end_ARG start_ARG italic_v ( italic_t ) + italic_k start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT end_ARG ,

where ψ(t)𝜓𝑡\psi(t)italic_ψ ( italic_t ) denotes the heading error, e(t)𝑒𝑡e(t)italic_e ( italic_t ) denotes the crosstrack error of the front axle and v(t)𝑣𝑡v(t)italic_v ( italic_t ) describes the velocity of the vehicle.

For the longitudinal control, we design two symmetric P-controllers; one for the brake and one for the throttle. First, we compute the target velocity dependent on the curve radius R(t)𝑅𝑡R(t)italic_R ( italic_t ) of the track directly in front of the vehicle as

vtarget(t)=min{krR(t),vmax},subscript𝑣target𝑡subscript𝑘𝑟𝑅𝑡subscript𝑣maxv_{\mathrm{target}}(t)=\min\{k_{r}\cdot R(t),v_{\mathrm{max}}\},italic_v start_POSTSUBSCRIPT roman_target end_POSTSUBSCRIPT ( italic_t ) = roman_min { italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ⋅ italic_R ( italic_t ) , italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT } ,

where vmaxsubscript𝑣maxv_{\mathrm{max}}italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum desired velocity. Then, we design the throttle control as

throt(t)={kv(t)(vtarget(t)v(t)),vtarget(t)v(t)00,elsethrot𝑡casessubscript𝑘𝑣𝑡subscript𝑣𝑡𝑎𝑟𝑔𝑒𝑡𝑡𝑣𝑡subscript𝑣𝑡𝑎𝑟𝑔𝑒𝑡𝑡𝑣𝑡00else\mathrm{throt}(t)=\begin{cases}k_{v}(t)*(v_{target}(t)-v(t)),&\quad v_{target}% (t)-v(t)\geq 0\\ 0,&\quad\text{else}\end{cases}roman_throt ( italic_t ) = { start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_t ) ∗ ( italic_v start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ( italic_t ) - italic_v ( italic_t ) ) , end_CELL start_CELL italic_v start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ( italic_t ) - italic_v ( italic_t ) ≥ 0 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else end_CELL end_ROW

and the brake control as

br(t)={kv(t)(v(t)vtarget(t)),vtarget(t)v(t)00,else,br𝑡casessubscript𝑘𝑣𝑡𝑣𝑡subscript𝑣𝑡𝑎𝑟𝑔𝑒𝑡𝑡subscript𝑣𝑡𝑎𝑟𝑔𝑒𝑡𝑡𝑣𝑡00else,\mathrm{br}(t)=\begin{cases}k_{v}(t)*(v(t)-v_{target}(t)),&\quad v_{target}(t)% -v(t)\leq 0\\ 0,&\quad\text{else,}\end{cases}roman_br ( italic_t ) = { start_ROW start_CELL italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_t ) ∗ ( italic_v ( italic_t ) - italic_v start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ( italic_t ) ) , end_CELL start_CELL italic_v start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT ( italic_t ) - italic_v ( italic_t ) ≤ 0 end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL else, end_CELL end_ROW

with shared gain kvsubscript𝑘𝑣k_{v}italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT. Following this control law, the control prior accelerates if it is going too slow and brakes if it is going too fast. It never uses the brake and the throttle at the same time.

To avoid aggressive braking behavior when the RL agent hands over to the controller in risky situations (high velocity around curves), we additionally introduce a simple clipped linear gain schedule on kvsubscript𝑘𝑣k_{v}italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT attenuating the braking control for higher velocities as

kv(t)=clip(kvmaxkvminvlowvhigh(v(t)vlow)+kvmax;kvmax,kvmin).subscript𝑘𝑣𝑡clipsuperscriptsubscript𝑘𝑣maxsuperscriptsubscript𝑘𝑣minsubscript𝑣lowsubscript𝑣high𝑣𝑡subscript𝑣lowsubscriptsuperscript𝑘max𝑣superscriptsubscript𝑘𝑣maxsuperscriptsubscript𝑘𝑣mink_{v}(t)=\mathrm{clip}\left(\frac{k_{v}^{\mathrm{max}}-k_{v}^{\mathrm{min}}}{v% _{\mathrm{low}}-v_{\mathrm{high}}}(v(t)-v_{\mathrm{low}})+k^{\mathrm{max}}_{v}% ;k_{v}^{\mathrm{max}},k_{v}^{\mathrm{min}}\right).italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ( italic_t ) = roman_clip ( divide start_ARG italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT - italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT end_ARG ( italic_v ( italic_t ) - italic_v start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT ) + italic_k start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT ; italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT , italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT ) .

We tuned the controller gains and coefficients to kcross=0.5[1/s]subscript𝑘cross0.5delimited-[]1sk_{\mathrm{cross}}=0.5[$\mathrm{1}\mathrm{/}\mathrm{s}$]italic_k start_POSTSUBSCRIPT roman_cross end_POSTSUBSCRIPT = 0.5 [ 1 / roman_s ], ksoft=1[m s1]subscript𝑘soft1delimited-[]timesmetersecond1k_{\mathrm{soft}}=1[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]italic_k start_POSTSUBSCRIPT roman_soft end_POSTSUBSCRIPT = 1 [ start_ARG roman_m end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_s end_ARG start_ARG - 1 end_ARG end_ARG ], kr=0.4[1/s]subscript𝑘𝑟0.4delimited-[]1sk_{r}=0.4[$\mathrm{1}\mathrm{/}\mathrm{s}$]italic_k start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0.4 [ 1 / roman_s ], vmax=8[m s1]subscript𝑣max8delimited-[]timesmetersecond1v_{\mathrm{max}}=8[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT = 8 [ start_ARG roman_m end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_s end_ARG start_ARG - 1 end_ARG end_ARG ], kvmax=0.25[s m1]subscriptsuperscript𝑘max𝑣0.25delimited-[]timessecondmeter1k^{\mathrm{max}}_{v}=0.25[$\mathrm{s}\text{\,}{\mathrm{m}}^{-1}$]italic_k start_POSTSUPERSCRIPT roman_max end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = 0.25 [ start_ARG roman_s end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_m end_ARG start_ARG - 1 end_ARG end_ARG ], kvmin=0.05[s m1]superscriptsubscript𝑘𝑣min0.05delimited-[]timessecondmeter1k_{v}^{\mathrm{min}}=0.05[$\mathrm{s}\text{\,}{\mathrm{m}}^{-1}$]italic_k start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_min end_POSTSUPERSCRIPT = 0.05 [ start_ARG roman_s end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_m end_ARG start_ARG - 1 end_ARG end_ARG ], vlow=8[m s1]subscript𝑣low8delimited-[]timesmetersecond1v_{\mathrm{low}}=8[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]italic_v start_POSTSUBSCRIPT roman_low end_POSTSUBSCRIPT = 8 [ start_ARG roman_m end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_s end_ARG start_ARG - 1 end_ARG end_ARG ] and vhigh=28[m s1]subscript𝑣high28delimited-[]timesmetersecond1v_{\mathrm{high}}=28[$\mathrm{m}\text{\,}{\mathrm{s}}^{-1}$]italic_v start_POSTSUBSCRIPT roman_high end_POSTSUBSCRIPT = 28 [ start_ARG roman_m end_ARG start_ARG times end_ARG start_ARG power start_ARG roman_s end_ARG start_ARG - 1 end_ARG end_ARG ].

Appendix C Additional Results and Ablation Study

In this section, we formulate contextualized hybrid variants of CORE and BCF, C-CORE and C-BCF. Further, we present additional results on hyperparameter sensitivity and the distribution of λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT for CHEQ-UTD1, CORE, and BCF. For a reasonable comparison, we focus mainly on CHEQ-UTD1 using a UTD ratio of 1111.

C.1 Contextualized Hybrid Variants of Prior Work

To further substantiate our claim that the contextualized hybrid RL formulation aids the training progress, we developed contextualized variants of the CORE and BCF algorithm, which we call C-CORE and C-BCF. To use the adaptive weight as a context variable, a weighting parameter λtRLsuperscriptsubscript𝜆𝑡RL\lambda_{t}^{\mathrm{RL}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT needs to be determined.

The CORE algorithm comes with a direct weight estimate(Cheng et al., 2019), which can be written as

λtRL=11+λtCOREsuperscriptsubscript𝜆𝑡RL11superscriptsubscript𝜆𝑡CORE\lambda_{t}^{\mathrm{RL}}=\frac{1}{1+\lambda_{t}^{\mathrm{CORE}}}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT = divide start_ARG 1 end_ARG start_ARG 1 + italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_CORE end_POSTSUPERSCRIPT end_ARG

where λtCORE=A(1eC|δt1|)superscriptsubscript𝜆𝑡CORE𝐴1superscript𝑒𝐶subscript𝛿𝑡1\lambda_{t}^{\mathrm{CORE}}=A(1-e^{-C\lvert\delta_{t-1}\rvert})italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_CORE end_POSTSUPERSCRIPT = italic_A ( 1 - italic_e start_POSTSUPERSCRIPT - italic_C | italic_δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT | end_POSTSUPERSCRIPT ) with the TD-error δt1subscript𝛿𝑡1\delta_{t-1}italic_δ start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and C𝐶Citalic_C, A𝐴Aitalic_A being tuning parameters222CORE (Cheng et al., 2019) uses the term λmaxsubscript𝜆max\lambda_{\mathrm{max}}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT instead of A𝐴Aitalic_A. As we use λmaxsubscript𝜆max\lambda_{\mathrm{max}}italic_λ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT in a different context, we stick to A𝐴Aitalic_A here..

For C-BCF, we derive a pseudo-weight, as the BCF algorithm (Rana et al., 2023) does not have a straightforward weighting factor λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT. In BCF, at timestep t1𝑡1t-1italic_t - 1, the fusion of the prior 𝒩ψ,t1(μψ,t1,σψ,t1)subscript𝒩𝜓𝑡1subscript𝜇𝜓𝑡1subscript𝜎𝜓𝑡1\mathcal{N}_{\psi,t-1}(\mu_{\psi,t-1},\sigma_{\psi,t-1})caligraphic_N start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT ) and the SAC-ensemble 𝒩π,t1(μπ,t1,σπ,t1)subscript𝒩𝜋𝑡1subscript𝜇𝜋𝑡1subscript𝜎𝜋𝑡1\mathcal{N}_{\pi,t-1}(\mu_{\pi,t-1},\sigma_{\pi,t-1})caligraphic_N start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT ( italic_μ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT ) results in a Gaussian distribution with mean

μfuse,t1=σψ,t12σψ,t12+σπ,t12μπ,t1+σπ,t12σψ,t12+σπ,t12μψ,t1.subscript𝜇fuse𝑡1superscriptsubscript𝜎𝜓𝑡12superscriptsubscript𝜎𝜓𝑡12superscriptsubscript𝜎𝜋𝑡12subscript𝜇𝜋𝑡1superscriptsubscript𝜎𝜋𝑡12superscriptsubscript𝜎𝜓𝑡12superscriptsubscript𝜎𝜋𝑡12subscript𝜇𝜓𝑡1\mu_{\mathrm{fuse},t-1}=\frac{\sigma_{\psi,t-1}^{2}}{\sigma_{\psi,t-1}^{2}+% \sigma_{\pi,t-1}^{2}}\cdot\mu_{\pi,t-1}+\frac{\sigma_{\pi,t-1}^{2}}{\sigma_{% \psi,t-1}^{2}+\sigma_{\pi,t-1}^{2}}\cdot\mu_{\psi,t-1}.italic_μ start_POSTSUBSCRIPT roman_fuse , italic_t - 1 end_POSTSUBSCRIPT = divide start_ARG italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_μ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT + divide start_ARG italic_σ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ⋅ italic_μ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT .

Thus, in BCF the weight has the dimension of the action space, whereas the contextualized mechanism requires a scalar weight λtRLsubscriptsuperscript𝜆RL𝑡\lambda^{\mathrm{RL}}_{t}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. For C-BCF, we compute a scalar weight

λtRL=1Ni=1N(σψ,t12σψ,t12+σπ,t12)i.subscriptsuperscript𝜆RL𝑡1𝑁superscriptsubscript𝑖1𝑁subscriptsuperscriptsubscript𝜎𝜓𝑡12superscriptsubscript𝜎𝜓𝑡12superscriptsubscript𝜎𝜋𝑡12𝑖\lambda^{\mathrm{RL}}_{t}=\frac{1}{N}\sum_{i=1}^{N}\left(\frac{\sigma_{\psi,t-% 1}^{2}}{\sigma_{\psi,t-1}^{2}+\sigma_{\pi,t-1}^{2}}\right)_{i}.italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( divide start_ARG italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUBSCRIPT italic_ψ , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_σ start_POSTSUBSCRIPT italic_π , italic_t - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ) start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .

The next action 𝐚tmixsubscriptsuperscript𝐚mix𝑡\mathbf{a}^{\mathrm{mix}}_{t}bold_a start_POSTSUPERSCRIPT roman_mix end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is computed according to Equation 3 where 𝐚tRL𝒩π,tsimilar-tosubscriptsuperscript𝐚RL𝑡subscript𝒩𝜋𝑡\mathbf{a}^{\mathrm{RL}}_{t}\sim\mathcal{N}_{\pi,t}bold_a start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N start_POSTSUBSCRIPT italic_π , italic_t end_POSTSUBSCRIPT and 𝐚tprior=μψ,tsubscriptsuperscript𝐚prior𝑡subscript𝜇𝜓𝑡\mathbf{a}^{\mathrm{prior}}_{t}=\mu_{\psi,t}bold_a start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_μ start_POSTSUBSCRIPT italic_ψ , italic_t end_POSTSUBSCRIPT.

For C-BCF and C-CORE we use the same warm-up phase as for our CHEQ agent.

C.2 Additional Results and Hyperparameter Tuning for CHEQ

Refer to caption
(a)
Refer to caption
(b)
Figure 7: Comparison of CHEQ-UTD1 with different umaxsubscript𝑢maxu_{\mathrm{max}}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT thresholds. Plotting return (7(a)) and number of fails (7(b)) for the racing environment.
Refer to caption
Figure 8: Distributions of λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT over training steps shown for all hybrid agents.
Refer to caption
Figure 9: The 10 transfer tracks and the corresponding zero-shot transfer of one seed of CHEQ-UTD20.

Our algorithm has only two important hyperparameters, upper and lower bounds of the uncertainty umaxsubscript𝑢maxu_{\mathrm{max}}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and uminsubscript𝑢minu_{\mathrm{min}}italic_u start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT. We chose these hyperparameters by conducting one training run and investigating the uncertainty range within this run. CHEQ is generally robust against changes in these thresholds. We observe slightly lower final return and fewer fails for lower upper bounds umaxsubscript𝑢maxu_{\mathrm{max}}italic_u start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. This is to be expected as frequent handover to the control prior results in lower velocities and thus lower return. Figure 7 shows the return and the number of fails during training for our CHEQ-UTD1 variants. For CHEQ-UTD20 we were able to use the same upper and lower uncertainty bounds as for CHEQ-UTD1.

Figure 8 shows the distribution of the weight λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT over the training progress. We find that for CHEQ-UTD1 the agent starts with an almost uniform distribution of the weight and slowly moves towards a λRL=1superscript𝜆RL1\lambda^{\mathrm{RL}}=1italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT = 1 regime. Even in later training stages, the agent hands over to the controller from time to time.

Lastly, we investigated the transfer behavior of the CHEQ-UTD20 agent further. Figure 9 shows the ten transfer tracks. We plot λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT over the track for one evaluated model. Here, we find that the agent frequently becomes uncertain and hands over to the controller, especially in unknown curves. In plot C we see one of the two failure cases (out of 100 runs) that we experience during transfer. We find that the agent becomes uncertain and hands over to the controller. In this specific scenario, however, the controller is not able to safely navigate the situation and leaves the track.

C.3 Additional Results and Hyperparameter Tuning for CORE and C-CORE

Refer to caption
(a)
Refer to caption
(b)
Figure 10: Comparison of return (10(a)) and number of fails (10(b)) CORE and C-CORE runs with different C𝐶Citalic_C parameters.

In Figure 10 we compare different parameters C𝐶Citalic_C. We find stable training progress and high return for C=0.02𝐶0.02C=0.02italic_C = 0.02 but since this uses high λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT values, this setting results in many failures. Figure 8 illustrates the high λ𝜆\lambdaitalic_λ regime of the C=0.02𝐶0.02C=0.02italic_C = 0.02 agent. C-CORE using our contextualized hybrid framework, notably outperforms CORE in terms of asymptotic return, training stability, and the number of training failures. Using the contextualized formulation, C-CORE can use a much wider λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT distribution (see Figure 8).

C.4 Additional Results and Hyperparameter Tuning for BCF and C-BCF

The BCF algorithm is sensitive to the parameters of the uncertainty threshold, which in this case is the variance of the control prior σpriorsuperscript𝜎prior\sigma^{\mathrm{prior}}italic_σ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT. Higher variances, lead to less weight on the control prior and thus high λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT regimes. In Figure 11 we compared different parameters σpriorsuperscript𝜎prior\sigma^{\mathrm{prior}}italic_σ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT. We find stable training progress and high return for σprior=6superscript𝜎prior6\sigma^{\mathrm{prior}}=6italic_σ start_POSTSUPERSCRIPT roman_prior end_POSTSUPERSCRIPT = 6. However, this setting, as expected, uses a λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT regime close to one and thus results in a high number of failures. Figure 8 illustrates this regime.

Our C-BCF variant can resolve this problem only partially. Due to its construction, the BCF algorithm has a separate weighting factor for each action of which we take the mean. In addition, our pseudo λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT factor is only a rough estimate of the actual mixing as BCF samples from the posterior distribution. Both factors result in information loss and make the weight λRLsuperscript𝜆RL\lambda^{\mathrm{RL}}italic_λ start_POSTSUPERSCRIPT roman_RL end_POSTSUPERSCRIPT only a rough estimate. We find that C-BCF-2.0 and BCF-6.0 achieve similar asymptotic performance, while C-BCF-2.0 leads to fewer failures.

Refer to caption
(a)
Refer to caption
(b)
Figure 11: Comparison of return (11(a)) and number of fails (11(b)) of BCF and C-BCF runs with different σpriorsubscript𝜎𝑝𝑟𝑖𝑜𝑟\sigma_{prior}italic_σ start_POSTSUBSCRIPT italic_p italic_r italic_i italic_o italic_r end_POSTSUBSCRIPT parameters.

C.5 Return vs. Failure Comparison for all trained Models

Figure 12 shows a scatter plot of the final return and the accumulated failures during training for all hybrid algorithms discussed in this paper. This final comparison shows that if prior methods are trained with the contextualized framework and tuned well ( C-BCF-0.8, C-CORE-0.4, C-CORE-0.8), they achieve high returns while maintaining fewer failures than their non-contextualized counterparts. Our algorithm (CHEQ-UTD1, CHEQ-UTD20) achieve the highest return while maintaining the lowest number of cumulative failures.

Refer to caption
Figure 12: Scatter plot of the return and number of fails for different hyperparams for BCF, CORE, C-BCF, C-CORE. CHEQ-UTD20, CHEQ-UTD1 + Prior in Comparison