HTML conversions sometimes display errors due to content that did not convert correctly from the source. This paper uses the following packages that are not yet supported by the HTML conversion tool. Feedback on these issues are not necessary; they are known and are being worked on.

  • failed: bigints

Authors: achieve the best HTML results from your LaTeX submissions by following these best practices.

License: CC BY 4.0
arXiv:2403.14860v1 [eess.SY] 21 Mar 2024

Robust Model Based Reinforcement Learning Using 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control

Minjun Sung,  Sambhu H. Karumanchi11footnotemark: 1,  Aditya Gahlawat,  Naira Hovakimyan
Department of Mechanical Science & Engineering
University of Illinois Urbana-Champaign
Urbana, IL, 61801, USA
{mjsung2,shk9,gahlawat,nhovakim}@illinois.edu
The authors equally contributed to this work.
Abstract

We introduce 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL, a control-theoretic augmentation scheme for Model-Based Reinforcement Learning (MBRL) algorithms. Unlike model-free approaches, MBRL algorithms learn a model of the transition function using data and use it to design a control input. Our approach generates a series of approximate control-affine models of the learned transition function according to the proposed switching law. Using the approximate model, control input produced by the underlying MBRL is perturbed by the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control, which is designed to enhance the robustness of the system against uncertainties. Importantly, this approach is agnostic to the choice of MBRL algorithm, enabling the use of the scheme with various MBRL algorithms. MBRL algorithms with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation exhibit enhanced performance and sample efficiency across multiple MuJoCo environments, outperforming the original MBRL algorithms, both with and without system noise.

1 Introduction

Reinforcement learning (RL) combines stochastic optimal control with data-driven learning. Recent progress in deep neural networks (NNs) has enabled RL algorithms to make decisions in complex and dynamic environments (Wang et al., 2022). Reinforcement learning algorithms have achieved remarkable performance in a wide range of applications, including robotics (Ibarz et al., 2021; Nguyen & La, 2019), natural language processing (Wang et al., 2018; Wu et al., 2018), autonomous driving (Milz et al., 2018; Li et al., 2020), and computer vision (Yun et al., 2017; Wu et al., 2017).

There are two main approaches to reinforcement learning: Model-Free RL (MFRL) and MBRL. MFRL algorithms directly learn a policy to maximize cumulative reward from data, while MBRL algorithms first learn a model of the transition function and then use it to obtain optimal policies (Moerland et al., 2023). While MFRL algorithms have demonstrated impressive asymptotic performance, they often suffer from poor sample complexity (Mnih et al., 2015; Lillicrap et al., 2016; Schulman et al., 2017). On the other hand, MBRL algorithms offer superior sample complexity and are agnostic to tasks or rewards (Kocijan et al., 2004; Deisenroth et al., 2013). While MBRL algorithms have traditionally lagged behind MFRL algorithms in terms of asymptotic performance, recent approaches, such as the one presented by Chua et al. (2018), aim to bridge this gap.

In MBRL, learning a model of the transition function can introduce model (or epistemic) uncertainties due to the lack of sufficient data or data with insufficient information. Moreover, real-world systems are also subject to inherently random aleatoric uncertainties. As a result, unless sufficient data—both in quantity and quality—is available, the learned policies will exhibit a gap between expected and actual performance, commonly referred to as the sim-to-real gap (Zhao et al., 2020).

The field of robust and adaptive control theory has a rich history and was born out of a need to design a controller to address the uncertainties discussed above. Given that both MBRL algorithms and classical control tools depend on models of the transition function, it is natural to consider exploring the consolidation of robust and adaptive control with MBRL. However, such a consolidation is far from straightforward, primarily due to the difference between the class of models for which the robustness is considered. To analyze systems and design controllers for such systems, conventional control methods often assume extensively modeled dynamics that are gain scheduled, linear, control affine, and/or true up to parametric uncertainties (Neal et al., 2004; Nichols et al., 1993). On the other hand, MBRL algorithms frequently update highly nonlinear models (e.g. NNs) to enhance their predictive accuracy. The combination of this iterative updating and the model’s high nonlinearity creates a unique challenge in embedding robust and adaptive controllers within MBRL algorithms.

1.1 Statement of contributions

We propose the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework as an add-on scheme to augment MBRL algorithms, which offers improved robustness against epistemic and aleatoric uncertainties. We affinize the learned model in the control space according to the switching law to construct a control-affine model based on which the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control input is designed. The switching law design provides a distinct capability to explicitly control the predictive performance bound of the state predictor within the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control architecture while harnessing the robustness advantages offered by the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control. The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT add-on does not require any modifications to the underlying MBRL algorithm, making it agnostic to the choice of the baseline MBRL algorithm. To evaluate the effectiveness of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL scheme, we conduct extensive numerical simulations using two baseline MBRL algorithms across multiple environments, including scenarios with action or observation noise. The results unequivocally demonstrate that the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL scheme enhances the performance of the underlying MBRL algorithms without any redesign or retuning of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT controller from one scenario to another.

1.2 Related Work

Control Augmentation of RL policies is of significant relevance to our research. Notable recent studies in this area, including Cheng et al. (2022) and Arevalo-Castiblanco et al. (2021), have investigated the augmentation of adaptive controllers to policies learned through MFRL algorithms. However, these approaches are limited by their assumption of known nominal models and their restriction to control-affine or nonlinear models with known basis functions, which restricts their applicability to specific system types. In contrast, our approach does not assume any specific structure or knowledge of the nominal dynamics. We instead provide a general framework to augment an 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller to the learned policy, while simultaneously learning the transition function.

Robust and adversarial RL methods aim to enhance the robustness of RL policies by utilizing minimax optimization with adversarial perturbation, as seen in various studies (Tobin et al., 2017; Peng et al., 2018; Loquercio et al., 2019; Pinto et al., 2017). However, existing methods often involve modifications to data or dynamics in order to handle worst-case scenarios, leading to poor general performance. In contrast, our method offers a distinct advantage by enhancing robustness without perturbing the underlying MBRL algorithm. This allows us to improve the robustness of the baseline algorithm without sacrificing its general performance.

Meta-(MB)RL methods train models across multiple tasks to facilitate rapid adaptation to dynamic variations as proposed in (Finn et al., 2017; Nagabandi et al., 2019a; b). These approaches employ hierarchical latent variable models to handle non-stationary environments. However, they lack explicit provisions for uncertainty estimation or rejection, which can result in significant performance degradation when faced with uncertain conditions (Chen et al., 2021). In contrast, the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework is purposefully designed to address this limitation through uncertainty estimation and explicit rejection. Importantly, our 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL method offers the potential for effective integration with meta-RL approaches, allowing for the joint leveraging of both methods to achieve both environment adaptation and uncertainty rejection in a non-stationary and noisy environment.

2 Preliminaries

In this section we provide a brief overview of the two main components of our 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework: MBRL and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control methodology.

2.1 Model Based Reinforcement Learning

This paper assumes a discrete-time finite-horizon Markov Decision Process (MDP), defined by the tuple =(𝒳,𝒰,f,r,ρ0,γ,H)𝒳𝒰𝑓𝑟subscript𝜌0𝛾𝐻\mathcal{M}=(\mathcal{X},\mathcal{U},f,r,\rho_{0},\gamma,H)caligraphic_M = ( caligraphic_X , caligraphic_U , italic_f , italic_r , italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_γ , italic_H ). Here 𝒳n𝒳superscript𝑛\mathcal{X}\subset\mathbb{R}^{n}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the compact state space, 𝒰m𝒰superscript𝑚\mathcal{U}\subset\mathbb{R}^{m}caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the compact action space, f:𝒳×𝒰𝒳:𝑓𝒳𝒰𝒳f:\mathcal{X}\times\mathcal{U}\rightarrow\mathcal{X}italic_f : caligraphic_X × caligraphic_U → caligraphic_X is the deterministic transition function, r:𝒳×𝒜:𝑟𝒳𝒜r:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R}italic_r : caligraphic_X × caligraphic_A → blackboard_R is a bounded reward function. Let ξ(𝒳)𝜉𝒳\xi(\mathcal{X})italic_ξ ( caligraphic_X ) be the set of probability distributions over 𝒳𝒳\mathcal{X}caligraphic_X and ρ0ξ(𝒳)subscript𝜌0𝜉𝒳\rho_{0}\in\xi(\mathcal{X})italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_ξ ( caligraphic_X ) be the initial state distribution. γ𝛾\gammaitalic_γ is the discount factor and H𝐻H\in\mathbb{N}italic_H ∈ blackboard_N is a known horizon of the problem. For any time step t<H𝑡𝐻t<Hitalic_t < italic_H, if xt𝒳subscript𝑥𝑡𝒳x_{t}\notin\mathcal{X}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∉ caligraphic_X or ut𝒰subscript𝑢𝑡𝒰u_{t}\notin\mathcal{U}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∉ caligraphic_U, then the episode is considered to have failed such that r(xt,ut)=0𝑟subscript𝑥superscript𝑡subscript𝑢superscript𝑡0r(x_{t^{\prime}},u_{t^{\prime}})=0italic_r ( italic_x start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) = 0 for all t=t,t+1,,Hsuperscript𝑡𝑡𝑡1𝐻t^{\prime}=t,t+1,\ldots,Hitalic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t , italic_t + 1 , … , italic_H. A policy is denoted by π𝜋\piitalic_π and is specified as π:=[π1,,πH1]assign𝜋subscript𝜋1subscript𝜋𝐻1\pi:=[\pi_{1},\ldots,\pi_{H-1}]italic_π := [ italic_π start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_π start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT ], where πt:𝒳ξ(𝒜):subscript𝜋𝑡𝒳𝜉𝒜\pi_{t}:\mathcal{X}\rightarrow\xi(\mathcal{A})italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT : caligraphic_X → italic_ξ ( caligraphic_A ) and ξ(𝒜)𝜉𝒜\xi(\mathcal{A})italic_ξ ( caligraphic_A ) is the set of probability distributions over 𝒜𝒜\mathcal{A}caligraphic_A. The goal of RL is to find a policy that maximizes the expected sum of the reward along a trajectory τ:=(x0,u0,,xH1,uH1,xH)assign𝜏subscript𝑥0subscript𝑢0subscript𝑥𝐻1subscript𝑢𝐻1subscript𝑥𝐻\tau:=(x_{0},u_{0},\cdots,x_{H-1},u_{H-1},x_{H})italic_τ := ( italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , ⋯ , italic_x start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_H - 1 end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_H end_POSTSUBSCRIPT ), or formally, to maximize J(π)=𝔼x0ρ0,utπt[t=1Hγtr(xt,ut)]𝐽𝜋subscript𝔼formulae-sequencesimilar-tosubscript𝑥0subscript𝜌0similar-tosubscript𝑢𝑡subscript𝜋𝑡delimited-[]superscriptsubscript𝑡1𝐻superscript𝛾𝑡𝑟subscript𝑥𝑡subscript𝑢𝑡J(\pi)=\mathbb{E}_{x_{0}\sim\rho_{0},u_{t}\sim\pi_{t}}[\sum_{t=1}^{H}\gamma^{t% }r(x_{t},u_{t})]italic_J ( italic_π ) = blackboard_E start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∼ italic_ρ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_r ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ], where xt+1xt=f(xt,ut)subscript𝑥𝑡1subscript𝑥𝑡𝑓subscript𝑥𝑡subscript𝑢𝑡x_{t+1}-x_{t}=f(x_{t},u_{t})italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (Nagabandi et al., 2018). The trained model can be utilized in various ways to obtain the policy, as detailed in Sec. 3.1.

2.2 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control

The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control theory provides a framework to counter the uncertainties with guaranteed transient and steady-state performance, alongside robustness margins (Hovakimyan & Cao, 2010). The performance and reliability of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control has been extensively tested on systems including robotic platforms (Cheng et al., 2022; Pravitra et al., 2020; Wu et al., 2022), NASA AirSTAR sub-scale aircraft (Gregory et al., 2009; 2010), and Learjet (Ackerman et al., 2017). While we give a brief description of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control in this subsection, we refer the interested reader to Appendix A for detailed explanation.

Assume that the continuous-time dynamics of a system can be represented as

x˙(t)=g(x(t))+h(x(t))u(t)+d(t,x(t),u(t)),x(0)=x0,formulae-sequence˙𝑥𝑡𝑔𝑥𝑡𝑥𝑡𝑢𝑡𝑑𝑡𝑥𝑡𝑢𝑡𝑥0subscript𝑥0\dot{x}(t)=g(x(t))+h(x(t))u(t)+d(t,x(t),u(t)),\quad x(0)=x_{0},over˙ start_ARG italic_x end_ARG ( italic_t ) = italic_g ( italic_x ( italic_t ) ) + italic_h ( italic_x ( italic_t ) ) italic_u ( italic_t ) + italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) , italic_x ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (1)

where x(t)n𝑥𝑡superscript𝑛x(t)\in\mathbb{R}^{n}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the system state, u(t)m𝑢𝑡superscript𝑚u(t)\in\mathbb{R}^{m}italic_u ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the control input, g:nn:𝑔superscript𝑛superscript𝑛g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and h:nn×m:superscript𝑛superscript𝑛𝑚h:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n\times m}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT are known nonlinear functions, and d(t,x(t),u(t))n𝑑𝑡𝑥𝑡𝑢𝑡superscript𝑛d(t,x(t),u(t))\in\mathbb{R}^{n}italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the unknown residual containing both the model uncertainties and the disturbances affecting the system.

Consider a desired control input u(t)msuperscript𝑢𝑡superscript𝑚u^{\star}(t)\in\mathbb{R}^{m}italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT and the induced desired state trajectory x(t)nsuperscript𝑥𝑡superscript𝑛x^{\star}(t)\in\mathbb{R}^{n}italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT based on the nominal (uncertainty-free) dynamics

x˙(t)=g(x(t))+h(x(t))u(t),x(0)=x0.formulae-sequencesuperscript˙𝑥𝑡𝑔superscript𝑥𝑡superscript𝑥𝑡superscript𝑢𝑡superscript𝑥0subscript𝑥0\dot{x}^{\star}(t)=g(x^{\star}(t))+h(x^{\star}(t))u^{\star}(t),\quad x^{\star}% (0)=x_{0}.over˙ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) = italic_g ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ) + italic_h ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ) italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) , italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT . (2)

If we directly apply u(t)superscript𝑢𝑡u^{\star}(t)italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) to the true system in Equation (1), the presence of the uncertainty d(t,x(t),u(t))𝑑𝑡𝑥𝑡𝑢𝑡d(t,x(t),u(t))italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) can cause the actual trajectory to diverge unquantifiably from the nominal trajectory. The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller computes an additive control input ua(t)subscript𝑢𝑎𝑡u_{a}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) to ensure that the augmented input u(t)=u(t)+ua(t)𝑢𝑡superscript𝑢𝑡subscript𝑢𝑎𝑡u(t)=u^{\star}(t)+u_{a}(t)italic_u ( italic_t ) = italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) + italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) keeps the actual trajectory x(t)𝑥𝑡x(t)italic_x ( italic_t ) bounded around the nominal trajectory x(t)superscript𝑥𝑡x^{\star}(t)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) in a quantifiable and uniform manner.

The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller has three components: the state predictor, the adaptive law, and a low-pass filter. The state predictor is given by

x^˙(t)=g(x(t))+h(x(t))(u(t)+ua(t))+σ^(t)+Asx~(t),˙^𝑥𝑡𝑔𝑥𝑡𝑥𝑡superscript𝑢𝑡subscript𝑢𝑎𝑡^𝜎𝑡subscript𝐴𝑠~𝑥𝑡\dot{\hat{x}}(t)=g(x(t))+h(x(t))(u^{\star}(t)+u_{a}(t))+\hat{\sigma}(t)+A_{s}% \tilde{x}(t),over˙ start_ARG over^ start_ARG italic_x end_ARG end_ARG ( italic_t ) = italic_g ( italic_x ( italic_t ) ) + italic_h ( italic_x ( italic_t ) ) ( italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) + italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) ) + over^ start_ARG italic_σ end_ARG ( italic_t ) + italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ( italic_t ) , (3)

with the initial condition x^(0)=x^0^𝑥0subscript^𝑥0\hat{x}(0)=\hat{x}_{0}over^ start_ARG italic_x end_ARG ( 0 ) = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, where x^(t)n^𝑥𝑡superscript𝑛\hat{x}(t)\in\mathbb{R}^{n}over^ start_ARG italic_x end_ARG ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the state of the predictor, σ^(t)^𝜎𝑡\hat{\sigma}(t)over^ start_ARG italic_σ end_ARG ( italic_t ) is the estimate of d(t,x(t),u(t))𝑑𝑡𝑥𝑡𝑢𝑡d(t,x(t),u(t))italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ),x~(t)=x^(t)x(t)~𝑥𝑡^𝑥𝑡𝑥𝑡\>\tilde{x}(t)=\hat{x}(t)-x(t)over~ start_ARG italic_x end_ARG ( italic_t ) = over^ start_ARG italic_x end_ARG ( italic_t ) - italic_x ( italic_t ) is the state prediction error, and Asn×nsubscript𝐴𝑠superscript𝑛𝑛A_{s}\in\mathbb{R}^{n\times n}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_n end_POSTSUPERSCRIPT is a Hurwitz matrix chosen by the user. Furthermore, σ^(t)^𝜎𝑡\hat{\sigma}(t)over^ start_ARG italic_σ end_ARG ( italic_t ) can be decomposed as

σ^(t)=h(x(t))σ^m(t)+h(x(t))σ^um(t),^𝜎𝑡𝑥𝑡subscript^𝜎𝑚𝑡superscriptperpendicular-to𝑥𝑡subscript^𝜎𝑢𝑚𝑡\hat{\sigma}(t)=h(x(t))\hat{\sigma}_{m}(t)+h^{\perp}(x(t))\hat{\sigma}_{um}(t),over^ start_ARG italic_σ end_ARG ( italic_t ) = italic_h ( italic_x ( italic_t ) ) over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) + italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t ) , (4)

where σ^m(t)subscript^𝜎𝑚𝑡\hat{\sigma}_{m}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) and σ^um(t)subscript^𝜎𝑢𝑚𝑡\hat{\sigma}_{um}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t ) are the estimates of the matched and unmatched uncertainties. Here, h(x(t))n×(nm)superscriptperpendicular-to𝑥𝑡superscript𝑛𝑛𝑚h^{\perp}(x(t))\in\mathbb{R}^{n\times(n-m)}italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( italic_n - italic_m ) end_POSTSUPERSCRIPT is a matrix satisfying h(x(t))h(x(t))=0superscript𝑥𝑡topsuperscriptperpendicular-to𝑥𝑡0h(x(t))^{\top}h^{\perp}(x(t))=0italic_h ( italic_x ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) = 0 and rank([h(x(t)),h(x(t))])=nrankmatrix𝑥𝑡superscriptperpendicular-to𝑥𝑡𝑛\text{rank}\left(\begin{bmatrix}h(x(t)),~{}h^{\perp}(x(t))\end{bmatrix}\right)=nrank ( [ start_ARG start_ROW start_CELL italic_h ( italic_x ( italic_t ) ) , italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) end_CELL end_ROW end_ARG ] ) = italic_n. The existence of h(x(t))superscriptperpendicular-to𝑥𝑡h^{\perp}(x(t))italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) is guaranteed, given h(x(t))𝑥𝑡h(x(t))italic_h ( italic_x ( italic_t ) ) is a full-rank matrix. The role of the predictor is to produce the state estimate x^(t)^𝑥𝑡\hat{x}(t)over^ start_ARG italic_x end_ARG ( italic_t ) induced by the uncertainty estimate σ^(t)^𝜎𝑡\hat{\sigma}(t)over^ start_ARG italic_σ end_ARG ( italic_t ).

The uncertainty estimate is updated using the piecewise constant adaptive law given by

σ^(t)=σ^(iTs)=Φ1(Ts)μ(iTs),t[iTs,(i+1)Ts),i,σ^(0)=σ^0,formulae-sequence^𝜎𝑡^𝜎𝑖subscript𝑇𝑠superscriptΦ1subscript𝑇𝑠𝜇𝑖subscript𝑇𝑠formulae-sequence𝑡𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠formulae-sequence𝑖^𝜎0subscript^𝜎0\hat{\sigma}(t)=\hat{\sigma}(iT_{s})=-\Phi^{-1}(T_{s})\mu(iT_{s}),~{}\quad t% \in[iT_{s},(i+1)T_{s}),~{}\quad i\in\mathbb{N},~{}\quad\hat{\sigma}(0)=\hat{% \sigma}_{0},over^ start_ARG italic_σ end_ARG ( italic_t ) = over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = - roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_μ ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_t ∈ [ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_i ∈ blackboard_N , over^ start_ARG italic_σ end_ARG ( 0 ) = over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (5)

where Φ(Ts)=As1(exp(AsTs)𝕀n)Φsubscript𝑇𝑠superscriptsubscript𝐴𝑠1expsubscript𝐴𝑠subscript𝑇𝑠subscript𝕀𝑛\Phi(T_{s})=A_{s}^{-1}\left(\text{exp}(A_{s}T_{s})-\mathbb{I}_{n}\right)roman_Φ ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ), μ(iTs)=exp(AsTs)x~(iTs)𝜇𝑖subscript𝑇𝑠expsubscript𝐴𝑠subscript𝑇𝑠~𝑥𝑖subscript𝑇𝑠\mu(iT_{s})=\text{exp}(A_{s}T_{s})\tilde{x}(iT_{s})italic_μ ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), and Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the sampling time.

Finally, the control input is given by

u(t)=u(t)+ua(t),ua(s)=C(s)𝔏[σ^m(t)],formulae-sequence𝑢𝑡superscript𝑢𝑡subscript𝑢𝑎𝑡subscript𝑢𝑎𝑠𝐶𝑠𝔏delimited-[]subscript^𝜎𝑚𝑡u(t)=u^{\star}(t)+u_{a}(t),\quad u_{a}(s)=-C(s)\mathfrak{L}\left[\hat{\sigma}_% {m}(t)\right],italic_u ( italic_t ) = italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) + italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) , italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s ) = - italic_C ( italic_s ) fraktur_L [ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ] , (6)

where 𝔏[]𝔏delimited-[]\mathfrak{L}[\cdot]fraktur_L [ ⋅ ] denotes the Laplace transform, and the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT input ua(t)subscript𝑢𝑎𝑡u_{a}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) is the output of the low-pass filter C(s)𝐶𝑠C(s)italic_C ( italic_s ) in response to the estimate of the matched component σ^m(t)subscript^𝜎𝑚𝑡\hat{\sigma}_{m}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ). The bandwidth of the low-pass filter is chosen to satisfy the small-gain stability condition (Wang & Hovakimyan, 2011).

3 The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL Algorithm

In this section, we present the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL algorithm, illustrated in Fig. 1. We first explain a standard MBRL algorithm and describe our method to integrate 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control with it.

Refer to caption
Figure 1: 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL Framework. The policy box πϕ(|xt)\pi_{\phi}(\cdot|x_{t})italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) includes policy update and control input sampling for each time step. Although this figure illustrates an on-policy MBRL setting with a parameterized πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT to provide a simple visualization, the framework is not limited to such class and can also be applied to off-policy algorithms or without a parameterized policy.

3.1 Standard MBRL

As our work aims to develop an add-on module that enhances the robustness of an existing MBRL algorithm, we provide a high-level overview of a standard MBRL algorithm and its popular complementary techniques.

The standard structure of MBRL algorithms involves the following steps: data collection, model updating, and policy updating using the updated model. To reduce model bias, many recent results consider an ensemble of models {f^θi}i=1,2,,Msubscriptsubscript^𝑓subscript𝜃𝑖𝑖12𝑀\{{\hat{f}}_{\theta_{i}}\}_{i=1,2,\cdots,M}{ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_i = 1 , 2 , ⋯ , italic_M end_POSTSUBSCRIPT, M𝑀M\in\mathbb{N}italic_M ∈ blackboard_N, on the data set 𝒟𝒟\mathcal{D}caligraphic_D. The ensemble model f^θisubscript^𝑓subscript𝜃𝑖\hat{f}_{\theta_{i}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is trained by minimizing the following loss function for each θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT (Nagabandi et al., 2018):

1|𝒟|(xt,ut,xt+1)D(xt+1xt)f^θi(xt,ut)22.1𝒟subscriptsubscript𝑥𝑡subscript𝑢𝑡subscript𝑥𝑡1𝐷superscriptsubscriptnormsubscript𝑥𝑡1subscript𝑥𝑡subscript^𝑓subscript𝜃𝑖subscript𝑥𝑡subscript𝑢𝑡22\frac{1}{|\mathcal{D}|}\sum_{(x_{t},u_{t},x_{t+1})\in D}\|(x_{t+1}-x_{t})-\hat% {f}_{\theta_{i}}(x_{t},u_{t})\|_{2}^{2}.divide start_ARG 1 end_ARG start_ARG | caligraphic_D | end_ARG ∑ start_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) ∈ italic_D end_POSTSUBSCRIPT ∥ ( italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT . (7)

In this paper, we consider (7) as the loss function used to train the baseline MBRL algorithm among other possibilities. This is only for convenience in explaining the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation in Section 3.2, and appropriate adjustments can be readily made for the augmentation upon the choice of different loss functions.

Besides the loss function, methods like random initialization of parameters, varying model architectures, and mini-batch shuffling are widely used to reduce the correlation among the outputs of different models in the ensemble. Further, various standard techniques including early stop**, input/output normalization, and weight normalization can be used to avoid overfitting.

Once the model is learned, control input can be computed by any of the following options: 1) using the learned dynamics as a simulator to generate fictitious samples (Kurutach et al., 2018; Clavera et al., 2018), 2) leveraging the derivative of the model for policy search (Levine & Koltun, 2013; Heess et al., 2015), or 3) applying the Model Predictive Controller (MPC) (Nagabandi et al., 2018; Chua et al., 2018). We highlight here that our proposed method is agnostic to the use of particular techniques or the choice of the policy optimizer.

3.2 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Augmentation

Let the true dynamics in discrete time be given by

Δxt+1=f(xt,ut)+w(t,xt,ut),Δxt+1xt+1xt,formulae-sequenceΔsubscript𝑥𝑡1𝑓subscript𝑥𝑡subscript𝑢𝑡𝑤𝑡subscript𝑥𝑡subscript𝑢𝑡Δsubscript𝑥𝑡1subscript𝑥𝑡1subscript𝑥𝑡\Delta x_{t+1}=f(x_{t},u_{t})+w(t,x_{t},u_{t}),\quad\Delta x_{t+1}\triangleq x% _{t+1}-x_{t},roman_Δ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_w ( italic_t , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Δ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≜ italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (8)

where the transition function f𝑓fitalic_f and the system uncertainty w𝑤witalic_w are unknown. Let f^θ:=1Mi=1Mfθiassignsubscript^𝑓𝜃1𝑀superscriptsubscript𝑖1𝑀subscript𝑓subscript𝜃𝑖\hat{f}_{\theta}:=\frac{1}{M}\sum_{i=1}^{M}f_{\theta_{i}}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT := divide start_ARG 1 end_ARG start_ARG italic_M end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M end_POSTSUPERSCRIPT italic_f start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT be the mean of the ensemble model trained using the loss function in Equation (7). Then, we express

Δx¯t+1=f^θ(xt,ut),Δx¯t+1x¯t+1xt,formulae-sequenceΔsubscript¯𝑥𝑡1subscript^𝑓𝜃subscript𝑥𝑡subscript𝑢𝑡Δsubscript¯𝑥𝑡1subscript¯𝑥𝑡1subscript𝑥𝑡\Delta\bar{x}_{t+1}=\hat{f}_{\theta}(x_{t},u_{t}),\quad\Delta\bar{x}_{t+1}% \triangleq\bar{x}_{t+1}-x_{t},roman_Δ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , roman_Δ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≜ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (9)

where x¯t+1subscript¯𝑥𝑡1\bar{x}_{t+1}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT indicates the estimate of the next state evaluated with f^θ(xt,ut)subscript^𝑓𝜃subscript𝑥𝑡subscript𝑢𝑡\hat{f}_{\theta}(x_{t},u_{t})over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). In MBRL, such transition functions are typically modeled using fully nonlinear function approximators like NNs. However, as discussed in Sec. 2.2, it is necessary to represent the nominal model in the control-affine form to apply 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control. A common approach to obtain a control-affine model involves restricting the model structure to the control-affine class (Khojasteh et al., 2020; Taylor et al., 2019; Choi et al., 2020). For NN models, this process involves training two NNs gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, such that Equation (9) becomes Δx¯t+1=gθ(xt)+hθ(xt)utΔsubscript¯𝑥𝑡1subscript𝑔𝜃subscript𝑥𝑡subscript𝜃subscript𝑥𝑡subscript𝑢𝑡\Delta\bar{x}_{t+1}=g_{\theta}(x_{t})+h_{\theta}(x_{t})u_{t}roman_Δ over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

While control-affine models are used for their tractability and direct connection to control-theoretic methods, they are inherently limited in their representational power compared to fully nonlinear models, and hence, their use in an MBRL algorithm can result in reduced performance.

Refer to caption
Figure 2: Comparison of performance between fully nonlinear and control-affine model on the Halfcheetah environment using METRPO.The control-affine model failed to learn the Halfcheetah dynamics.

To study the level of compromise on the performance, we compare fully nonlinear models with control-affine models in the Halfcheetah environment for METRPO  (Kurutach et al., 2018), where each size of the implicit layers of the control-affine model gθsubscript𝑔𝜃g_{\theta}italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT are chosen to match that of the fully nonlinear f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The degraded performance of the control-affine model depicted in Fig. 2 can be primarily attributed to intricate nonlinearities in the environment.

Although using the above naive control-affine model can be convenient, it must trade in the capabilities of the underlying MBRL algorithm. To avoid such limitations, we adopt an alternative approach inspired by the Guided Policy Search (Levine & Koltun, 2013). Specifically, we apply a control-affine transformation to the fully nonlinear dynamics multiple times according to the predefined switching law. Specifically, we apply the first-order Taylor series approximation around the operating input u¯¯𝑢\bar{u}over¯ start_ARG italic_u end_ARG:

f^θsubscript^𝑓𝜃\displaystyle\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT (xt,ut)f^θ(xt,u¯)+([uf^θ(xt,u)]u=u¯)(utu¯)subscript𝑥𝑡subscript𝑢𝑡subscript^𝑓𝜃subscript𝑥𝑡¯𝑢superscriptsubscriptdelimited-[]subscript𝑢subscript^𝑓𝜃subscript𝑥𝑡𝑢𝑢¯𝑢topsubscript𝑢𝑡¯𝑢\displaystyle(x_{t},u_{t})\approx\hat{f}_{\theta}(x_{t},\bar{u})+\left(\left[% \nabla_{u}\hat{f}_{\theta}(x_{t},u)\right]_{u=\bar{u}}\right)^{\top}(u_{t}-% \bar{u})( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ≈ over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_u end_ARG ) + ( [ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u ) ] start_POSTSUBSCRIPT italic_u = over¯ start_ARG italic_u end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over¯ start_ARG italic_u end_ARG ) (10)
=f^θ(xt,u¯)([uf^θ(xt,u)]u=u¯)u¯gθ(xt)+([uf^θ(xt,u)]u=u¯)hθ(xt)utf^θa(xt,ut;u¯).absentsubscriptsubscript^𝑓𝜃subscript𝑥𝑡¯𝑢superscriptsubscriptdelimited-[]subscript𝑢subscript^𝑓𝜃subscript𝑥𝑡𝑢𝑢¯𝑢top¯𝑢absentsubscript𝑔𝜃subscript𝑥𝑡subscriptsuperscriptsubscriptdelimited-[]subscript𝑢subscript^𝑓𝜃subscript𝑥𝑡𝑢𝑢¯𝑢topabsentsubscript𝜃subscript𝑥𝑡subscript𝑢𝑡subscriptsuperscript^𝑓𝑎𝜃subscript𝑥𝑡subscript𝑢𝑡¯𝑢\displaystyle=\underbrace{\hat{f}_{\theta}(x_{t},\bar{u})-\left(\left[\nabla_{% u}\hat{f}_{\theta}(x_{t},u)\right]_{u=\bar{u}}\right)^{\top}\bar{u}}_{% \triangleq~{}g_{\theta}(x_{t})}+\underbrace{\left(\left[\nabla_{u}\hat{f}_{% \theta}(x_{t},u)\right]_{u=\bar{u}}\right)^{\top}}_{\triangleq~{}h_{\theta}(x_% {t})}u_{t}\triangleq\hat{f}^{a}_{\theta}(x_{t},u_{t};\bar{u}).= under⏟ start_ARG over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over¯ start_ARG italic_u end_ARG ) - ( [ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u ) ] start_POSTSUBSCRIPT italic_u = over¯ start_ARG italic_u end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT over¯ start_ARG italic_u end_ARG end_ARG start_POSTSUBSCRIPT ≜ italic_g start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT + under⏟ start_ARG ( [ ∇ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u ) ] start_POSTSUBSCRIPT italic_u = over¯ start_ARG italic_u end_ARG end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT end_ARG start_POSTSUBSCRIPT ≜ italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; over¯ start_ARG italic_u end_ARG ) .

Here, the superscript a𝑎aitalic_a indicates the affine approximation of f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The term affinization in this paper is distinguished from linearization, which linearizes the function with respect to both xtsubscript𝑥𝑡x_{t}italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT such that x¯t+1Axt+Butsimilar-to-or-equalssubscript¯𝑥𝑡1𝐴subscript𝑥𝑡𝐵subscript𝑢𝑡\bar{x}_{t+1}\simeq Ax_{t}+Bu_{t}over¯ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ≃ italic_A italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_B italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT for some constant matrix A𝐴Aitalic_A and B𝐵Bitalic_B. Since it is common to have more states than control inputs in most controlled systems, the affinized model is a significantly more accurate approximation of the nominal dynamics compared to the linearized model.

Indeed, the control-affine model f^θasuperscriptsubscript^𝑓𝜃𝑎\hat{f}_{\theta}^{a}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT is only a good approximation of f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT around u¯¯𝑢\bar{u}over¯ start_ARG italic_u end_ARG. When the control input deviates considerably from u¯¯𝑢\bar{u}over¯ start_ARG italic_u end_ARG, the quality of the approximation deteriorates. To handle this, we produce the next approximation model when the following switching condition holds:

f^θa(xt,ut;u¯)f^θ(xt,ut)ϵa.normsubscriptsuperscript^𝑓𝑎𝜃subscript𝑥𝑡subscript𝑢𝑡¯𝑢subscript^𝑓𝜃subscript𝑥𝑡subscript𝑢𝑡subscriptitalic-ϵ𝑎\|\hat{f}^{a}_{\theta}(x_{t},u_{t};\bar{u})-\hat{f}_{\theta}(x_{t},u_{t})\|% \geq\epsilon_{a}.∥ over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; over¯ start_ARG italic_u end_ARG ) - over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∥ ≥ italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT . (11)

Here, \|\cdot\|∥ ⋅ ∥ indicates the vector norm, and ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the model tolerance hyperparameter chosen by the user. Note that as ϵa0subscriptitalic-ϵ𝑎0\epsilon_{a}\rightarrow 0italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT → 0, we make an affine approximation at every point in the input space and we retrieve the original non-affine function f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Remark 1.

Although a more intuitive choice for the switching condition would be utu¯>ϵanormsubscript𝑢𝑡normal-¯𝑢subscriptitalic-ϵ𝑎\|u_{t}-\bar{u}\|>\epsilon_{a}∥ italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over¯ start_ARG italic_u end_ARG ∥ > italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, we adopt an implicit switching condition (Equation (11)) to explicitly control over the acceptable level of prediction error between f^θasubscriptsuperscriptnormal-^𝑓𝑎𝜃\hat{f}^{a}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and f^θsubscriptnormal-^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by specifying the threshold ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. This approach prevents significant deviation in the performance of the underlying MBRL algorithm, and its utilization is instrumental in establishing the theoretical guarantees of the uncertainty estimation (See Section 3.3).

Given a locally valid control-affine model f^θasuperscriptsubscript^𝑓𝜃𝑎\hat{f}_{\theta}^{a}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT, we can proceed with the design of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT input by utilizing the discrete-time version of the controller presented in Sec. 2.2. In particular, the state-predictor, the adaptation law, and the control law are given by

x^t+1subscript^𝑥𝑡1\displaystyle\hat{x}_{t+1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT =x^t+Δx^tx^t+f^θa(xt,ut)+σ^t+(Asx~t)Δt,x^0=x0,x~t=x^txt,formulae-sequenceabsentsubscript^𝑥𝑡Δsubscript^𝑥𝑡subscript^𝑥𝑡subscriptsuperscript^𝑓𝑎𝜃subscript𝑥𝑡subscript𝑢𝑡subscript^𝜎𝑡subscript𝐴𝑠subscript~𝑥𝑡Δ𝑡formulae-sequencesubscript^𝑥0subscript𝑥0subscript~𝑥𝑡subscript^𝑥𝑡subscript𝑥𝑡\displaystyle=\hat{x}_{t}+\Delta\hat{x}_{t}\triangleq\hat{x}_{t}+\hat{f}^{a}_{% \theta}(x_{t},u_{t})+\hat{\sigma}_{t}+(A_{s}\tilde{x}_{t})\Delta t,\quad\hat{x% }_{0}=x_{0},\>\tilde{x}_{t}=\hat{x}_{t}-x_{t},= over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + roman_Δ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ≜ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) roman_Δ italic_t , over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12a)
σ^tsubscript^𝜎𝑡\displaystyle\hat{\sigma}_{t}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =Φ1μt,absentsuperscriptΦ1subscript𝜇𝑡\displaystyle=-\Phi^{-1}\mu_{t},= - roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12b)
qtsubscript𝑞𝑡\displaystyle q_{t}italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =qt1+(Kqt1+Kσ^m,t1)Δt,q0=0,ua,t=qt,formulae-sequenceabsentsubscript𝑞𝑡1𝐾subscript𝑞𝑡1𝐾subscript^𝜎𝑚𝑡1Δ𝑡formulae-sequencesubscript𝑞00subscript𝑢𝑎𝑡subscript𝑞𝑡\displaystyle=q_{t-1}+(-Kq_{t-1}+K\hat{\sigma}_{m,t-1})\Delta t,\quad q_{0}=0,% ~{}\>u_{a,t}=-q_{t},= italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + ( - italic_K italic_q start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT + italic_K over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m , italic_t - 1 end_POSTSUBSCRIPT ) roman_Δ italic_t , italic_q start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , italic_u start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT = - italic_q start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , (12c)

where K0succeeds𝐾0K\succ 0italic_K ≻ 0 is an m×m𝑚𝑚m\times mitalic_m × italic_m symmetric matrix that characterizes the first order low pass filter C(s)=K(s𝕀m+K)1𝐶𝑠𝐾superscript𝑠subscript𝕀𝑚𝐾1C(s)=K(s\mathbb{I}_{m}+K)^{-1}italic_C ( italic_s ) = italic_K ( italic_s blackboard_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_K ) start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT, discretized in time. Note that Equation (3)-Equation (6) can be considered as zero-order-hold continuous-time signals of discrete signals produced by  Equation (3.2). As such, σ^tsubscript^𝜎𝑡\hat{\sigma}_{t}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and μtsubscript𝜇𝑡\mu_{t}italic_μ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are defined analogously to their counterparts in the continuous-time definitions. In our setting, where prior information about the desired input signal frequency is unavailable, an unbiased choice is to set K=ω𝕀m𝐾𝜔subscript𝕀𝑚K=\omega\mathbb{I}_{m}italic_K = italic_ω blackboard_I start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, where ω𝜔\omegaitalic_ω is the chosen cutoff frequency. The sampling time Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is set to ΔtΔ𝑡\Delta troman_Δ italic_t, which corresponds to the time interval at which the baseline MBRL algorithm operates. The algorithm for the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control is presented in Algorithm 1.

Data: Initialize {x^t,σ^t}0subscript^𝑥𝑡subscript^𝜎𝑡0\{\hat{x}_{t},~{}\hat{\sigma}_{t}\}\leftarrow 0{ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ← 0
Set ω𝜔\omegaitalic_ω for K𝐾Kitalic_K in Equation (12c)
Function Control(uRL,t,xt,f^θasubscript𝑢𝑅𝐿𝑡subscript𝑥𝑡subscriptsuperscriptnormal-^𝑓𝑎𝜃u_{RL,t},x_{t},\hat{f}^{a}_{\theta}italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ):
       Prediction error update: x~tx^txtsubscript~𝑥𝑡subscript^𝑥𝑡subscript𝑥𝑡\tilde{x}_{t}\longleftarrow\hat{x}_{t}-x_{t}over~ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ⟵ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Uncertainty estimate σ^tsubscript^𝜎𝑡\hat{\sigma}_{t}over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT update: Equation (12b) Compute ua,t(u_{a,t}~{}(italic_u start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT (Equation (12c)) utuRL,t+ua,tsubscript𝑢𝑡subscript𝑢𝑅𝐿𝑡subscript𝑢𝑎𝑡u_{t}\leftarrow u_{RL,t}+u_{a,t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT + italic_u start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT Update x^t+1subscript^𝑥𝑡1\hat{x}_{t+1}over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT (Equation (12a)) return utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
End Function
Algorithm 1 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control

As the underlying MBRL algorithm updates its model f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, the corresponding control-affine model f^θasubscriptsuperscript^𝑓𝑎𝜃\hat{f}^{a}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control input uasubscript𝑢𝑎u_{a}italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT are updated sequentially (Algorithm 1). By incorporating the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control augmentation into the MBRL algorithm, we obtain the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL algorithm, as outlined in Algorithm 2. Note that in this algorithm we are adding uRL,tsubscript𝑢𝑅𝐿𝑡u_{RL,t}italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT instead of utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to the dataset. Intuitively, this is to learn the nominal dynamics that remains after uncertainties get compensated by ua,tsubscript𝑢𝑎𝑡u_{a,t}italic_u start_POSTSUBSCRIPT italic_a , italic_t end_POSTSUBSCRIPT. Similar approach has been employed previously in (Wang & Ba, 2020, Appendix A.1).

Set 𝒟𝒟\mathcal{D}\leftarrow{\emptyset}caligraphic_D ← ∅, {x^t,σ^t}0subscript^𝑥𝑡subscript^𝜎𝑡0\{\hat{x}_{t},~{}\hat{\sigma}_{t}\}\leftarrow 0{ over^ start_ARG italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } ← 0
Initialize ϵ,πϕ,f^θ,ω,x0italic-ϵsubscript𝜋italic-ϕsubscript^𝑓𝜃𝜔subscript𝑥0\epsilon,\pi_{\phi},\hat{f}_{\theta},\omega,x_{0}italic_ϵ , italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT , over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT , italic_ω , italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
repeat
       for Episodes Nesubscript𝑁𝑒N_{e}\in\mathbb{N}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ∈ blackboard_N do
             u¯𝙽𝚘𝚗𝚎¯𝑢𝙽𝚘𝚗𝚎\bar{u}\leftarrow\texttt{None}over¯ start_ARG italic_u end_ARG ← None
             for Horizon H𝐻H\in\mathbb{N}italic_H ∈ blackboard_N do
                   uRL,tπϕ(|xt)u_{RL,t}\sim\pi_{\phi}(\cdot|x_{t})italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT ∼ italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( ⋅ | italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
                   if u¯normal-¯𝑢\bar{u}over¯ start_ARG italic_u end_ARG is None or Equation (11) then
                         u¯uRL,t¯𝑢subscript𝑢𝑅𝐿𝑡\bar{u}\leftarrow u_{RL,t}over¯ start_ARG italic_u end_ARG ← italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT
                         compute f^θasuperscriptsubscript^𝑓𝜃𝑎\hat{f}_{\theta}^{a}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT via Equation (10)
                  ut𝙲𝚘𝚗𝚝𝚛𝚘𝚕subscript𝑢𝑡𝙲𝚘𝚗𝚝𝚛𝚘𝚕u_{t}\leftarrow\texttt{Control}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ← Control in Algo. 1
                   Execute utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝒟(xt,uRL,t,xt+1)𝒟subscript𝑥𝑡subscript𝑢𝑅𝐿𝑡subscript𝑥𝑡1\mathcal{D}\leftarrow(x_{t},u_{RL,t},x_{t+1})caligraphic_D ← ( italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_R italic_L , italic_t end_POSTSUBSCRIPT , italic_x start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT )
            
       Update model(s) f^θsubscript^𝑓𝜃\hat{f}_{\theta}over^ start_ARG italic_f end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT using 𝒟𝒟\mathcal{D}caligraphic_D
       Update policy πϕsubscript𝜋italic-ϕ\pi_{\phi}italic_π start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT
until the average return converges
Algorithm 2 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL Algorithm

Our 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework makes a control-affine approximation of the learned dynamics, which is itself an approximation of the ground truth dynamics. Such layers of approximations may amplify errors, which may degrade the effect of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation. In this section, we prove that the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT prediction error is bounded, and subsequently, the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT controller effectively compensates for uncertainties. To this end, we conduct a continuous-time analysis of the system that is controlled via 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework which operates in sampled-time (Åström & Wittenmark, 2013). It is important to note that our adaptation law (Equation (5)) holds the estimation for each time interval (zero-order-hold), converting discrete estimates obtained from the MBRL algorithm into a continuous-time signal. Such choice of the adaptation law ensures that the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation is compatible with the discrete MBRL setup, providing the basis for the following analysis.

3.3 Theoretical analysis

Consider the nonlinear (unknown) continuous-time counterpart of Equation (8)

x˙(t)=F(x(t),u(t))+W(t,x(t),u(t)),˙𝑥𝑡𝐹𝑥𝑡𝑢𝑡𝑊𝑡𝑥𝑡𝑢𝑡\dot{x}(t)=F(x(t),u(t))+W(t,x(t),u(t)),over˙ start_ARG italic_x end_ARG ( italic_t ) = italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) , (13)

where F:𝒳×𝒰n:𝐹𝒳𝒰superscript𝑛F:\mathcal{X}\times\mathcal{U}\rightarrow\mathbb{R}^{n}italic_F : caligraphic_X × caligraphic_U → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a fully nonlinear function defining the vector field. Note that unlike the system in Equation (1), we do not make any assumptions on F(x(t),u(t))𝐹𝑥𝑡𝑢𝑡F(x(t),u(t))italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) being control-affine. Furthermore, recall from Sec. 2.1 that the sets 𝒳n𝒳superscript𝑛\mathcal{X}\subset\mathbb{R}^{n}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and 𝒰m𝒰superscript𝑚\mathcal{U}\subset\mathbb{R}^{m}caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, over which the MBRL experiments take place, are compact. Additionally, W(t,x(t),u(t))n𝑊𝑡𝑥𝑡𝑢𝑡superscript𝑛W(t,x(t),u(t))\in\mathbb{R}^{n}italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the disturbance perturbing the system. As before, we denote by F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) the approximation of F(x(t),u(t))𝐹𝑥𝑡𝑢𝑡F(x(t),u(t))italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ), and its affine approximate as

F^θa(x(t),u(t))=Gθ(x(t))+Hθ(x(t))u(t).subscriptsuperscript^𝐹𝑎𝜃𝑥𝑡𝑢𝑡subscript𝐺𝜃𝑥𝑡subscript𝐻𝜃𝑥𝑡𝑢𝑡\hat{F}^{a}_{\theta}(x(t),u(t))={G_{\theta}(x(t))+H_{\theta}(x(t))u(t)}.over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) = italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) + italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) italic_u ( italic_t ) . (14)

Subsequently, we define the residual error l(t,x(t),u(t))𝑙𝑡𝑥𝑡𝑢𝑡l(t,x(t),u(t))italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) and an affinization error a(x(t),u(t))𝑎𝑥𝑡𝑢𝑡a(x(t),u(t))italic_a ( italic_x ( italic_t ) , italic_u ( italic_t ) ) as

l(t,x(t),u(t))𝑙𝑡𝑥𝑡𝑢𝑡\displaystyle l(t,x(t),u(t))italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) F(x(t),u(t))+W(t,x(t),u(t))F^θ(x(t),u(t))absent𝐹𝑥𝑡𝑢𝑡𝑊𝑡𝑥𝑡𝑢𝑡subscript^𝐹𝜃𝑥𝑡𝑢𝑡\displaystyle\triangleq F(x(t),u(t))+W(t,x(t),u(t))-\hat{F}_{\theta}(x(t),u(t))≜ italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) )
a(x(t),u(t))𝑎𝑥𝑡𝑢𝑡\displaystyle a(x(t),u(t))italic_a ( italic_x ( italic_t ) , italic_u ( italic_t ) ) F^θ(x(t),u(t))F^θa(x(t),u(t)).absentsubscript^𝐹𝜃𝑥𝑡𝑢𝑡subscriptsuperscript^𝐹𝑎𝜃𝑥𝑡𝑢𝑡\displaystyle\triangleq\hat{F}_{\theta}(x(t),u(t))-\hat{F}^{a}_{\theta}(x(t),u% (t)).≜ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) .

Note that a(x(t),u(t))ϵanorm𝑎𝑥𝑡𝑢𝑡subscriptitalic-ϵ𝑎\|a(x(t),u(t))\|\leq\epsilon_{a}∥ italic_a ( italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ ≤ italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT in the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework by Equation (11).

We pose the following assumptions.

Assumption 1.
  1. 1.

    The functions F(x(t),u(t))𝐹𝑥𝑡𝑢𝑡F(x(t),u(t))italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) and W(t,x(t),u(t))𝑊𝑡𝑥𝑡𝑢𝑡W(t,x(t),u(t))italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) are Lipschitz continuous over 𝒳×𝒰𝒳𝒰\mathcal{X}\times\mathcal{U}caligraphic_X × caligraphic_U and [0,tmax)×𝒳×𝒰0subscript𝑡𝒳𝒰[0,t_{\max})\times\mathcal{X}\times\mathcal{U}[ 0 , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) × caligraphic_X × caligraphic_U, respectively, for 0<tmaxH0subscript𝑡𝐻0<t_{\max}\leq H0 < italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≤ italic_H, where H𝐻Hitalic_H is a known finite time horizon of the episode. The learned model F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) is Lipschitz continuous in 𝒳𝒳\mathcal{X}caligraphic_X, and continuously differentiable222More precisely, 𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT everywhere except finite sets of measure zero. (𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT) in 𝒰𝒰\mathcal{U}caligraphic_U.

  2. 2.

    The learned model is uniformly bounded over (t,x,u)[0,tmax)×𝒳×𝒰𝑡𝑥𝑢0subscript𝑡𝒳𝒰(t,x,u)\in[0,t_{\max})\times\mathcal{X}\times\mathcal{U}( italic_t , italic_x , italic_u ) ∈ [ 0 , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) × caligraphic_X × caligraphic_U:

    F(x(t),u(t))+W(t,x(t),u(t))F^θ(x(t),u(t))norm𝐹𝑥𝑡𝑢𝑡𝑊𝑡𝑥𝑡𝑢𝑡subscript^𝐹𝜃𝑥𝑡𝑢𝑡\displaystyle\|F(x(t),u(t))+W(t,x(t),u(t))-\hat{F}_{\theta}(x(t),u(t))\|∥ italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ ϵl,absentsubscriptitalic-ϵ𝑙\displaystyle\leq\epsilon_{l},≤ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , (15)

    where the bound ϵlsubscriptitalic-ϵ𝑙\epsilon_{l}italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT is assumed to be known.

See Appendix B for remarks on this assumption.

Next, we set

u(t)=u(t)+ua(t),𝑢𝑡superscript𝑢𝑡subscript𝑢𝑎𝑡u(t)=u^{\star}(t)+u_{a}(t),italic_u ( italic_t ) = italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) + italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) , (16)

where u(t)superscript𝑢𝑡u^{\star}(t)italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) is the continuous-time signal converted from the discrete control input produced by the underlying MBRL, and ua(t)subscript𝑢𝑎𝑡u_{a}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) is the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control input. As described in Section 2.2, the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT controller estimates the uncertainty by following the piecewise constant adaptive law (Equation (5)). Now, we aim to evaluate the estimation error e(t,x(t),u(t))𝑒𝑡𝑥𝑡𝑢𝑡e(t,x(t),u(t))italic_e ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ):

e(t,\displaystyle e(t,italic_e ( italic_t , x(t),u(t))l(t,x(t),u(t))+a(x(t),u(t))σ^(t)\displaystyle x(t),u(t))\triangleq l(t,x(t),u(t))+a(x(t),u(t))-\hat{\sigma}(t)italic_x ( italic_t ) , italic_u ( italic_t ) ) ≜ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_a ( italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t )
=\displaystyle== F(x(t),u(t))+W(t,x(t),u(t))Gθ(x(t))Hθ(x(t))u(t)σ^(t).𝐹𝑥𝑡𝑢𝑡𝑊𝑡𝑥𝑡𝑢𝑡subscript𝐺𝜃𝑥𝑡subscript𝐻𝜃𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle F(x(t),u(t))+W(t,x(t),u(t))-G_{\theta}(x(t))-H_{\theta}(x(t))u(t% )-\hat{\sigma}(t).italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) - italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) italic_u ( italic_t ) - over^ start_ARG italic_σ end_ARG ( italic_t ) . (17)

Our interest in evaluating the estimation error is articulated in Remark 2, Appendix C, where we also provide proof of the following theorem. We note here that the sets 𝒳𝒳\mathcal{X}caligraphic_X and 𝒰𝒰\mathcal{U}caligraphic_U in the following result are compact due to the inherent nature of the MBRL algorithm, as described in Sec. 2.1.

Theorem 1.

Consider the system described by Equation (13), and its learned control-affine representation in Equation (14). Additionally, assume that the system is operating under the augmented feedback control presented in Equation (16). Let As=𝚍𝚒𝚊𝚐{λ1,,λn}subscript𝐴𝑠𝚍𝚒𝚊𝚐subscript𝜆1normal-…subscript𝜆𝑛A_{s}=\texttt{diag}\{\lambda_{1},\dots,\lambda_{n}\}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT = diag { italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT } be the Hurwitz matrix that is used in the definition of the state predictor (Equation (3)). If Assumption 1 holds, then the estimation error defined in Equation (17) satisfies e(t,x(t),u(t)ϵl+ϵa\|e(t,x(t),u(t)\|\leq\epsilon_{l}+\epsilon_{a}∥ italic_e ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ∥ ≤ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, t[0,Ts)for-all𝑡0subscript𝑇𝑠\forall t\in[0,T_{s})∀ italic_t ∈ [ 0 , italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) and

e(t,x(t),u(t))=2ϵa+𝒪(Ts)t[Ts,tmax),formulae-sequencenorm𝑒𝑡𝑥𝑡𝑢𝑡2subscriptitalic-ϵ𝑎𝒪subscript𝑇𝑠for-all𝑡subscript𝑇𝑠subscript𝑡\displaystyle\|e(t,x(t),u(t))\|=2\epsilon_{a}+\mathcal{O}(T_{s})\quad\forall t% \in[T_{s},t_{\max}),∥ italic_e ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ = 2 italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + caligraphic_O ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∀ italic_t ∈ [ italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) ,

where 0<Ts<tmaxH<0subscript𝑇𝑠subscript𝑡𝐻0<T_{s}<t_{\max}\leq H<\infty0 < italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT < italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ≤ italic_H < ∞, and H𝐻Hitalic_H is the known bounded horizon (see Sec. 2.1).

4 Simulation Results

In this section, we demonstrate the efficacy of our proposed 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework using the METRPO (Kurutach et al., 2018) and MBMF (Nagabandi et al., 2018) as the baseline MBRL method 333METRPO and MBMF conform to the standard MBRL framework but employ different strategies for control optimization (refer to Section 2.1). Selecting these baselines demonstrates that our framework is agnostic to various control optimization methods, illustrating its functionality as a versatile add-on module., and we defer the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBMF derivations and details to Appendix D. We briefly note here that we observed a similar performance improvement with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation as for 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO.

In our first experimental study, we evaluate the proposed 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework on five different OpenAI Gym environments (Brockman et al., 2016) with varying levels of state and action complexity. For each environment, we report the mean and standard deviation of the average reward per episode across multiple random seeds. Additionally, we incorporate noise into the observation (σo=0.1subscript𝜎𝑜0.1\sigma_{o}=0.1italic_σ start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT = 0.1) or action (σa=0.1subscript𝜎𝑎0.1\sigma_{a}=0.1italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.1) by sampling from a uniform distribution (Wang et al., 2019). This enables us to evaluate the impact of noise on MBRL performance. The results are summarized in Table 1. Further details of the experiment setup are provided in the Appendix D.

Table 1: Performance comparison between METRPO and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO (Ours). The average performance and standard deviation over multiple seeds are evaluated for a window size of 3000 timesteps at the end of the training for multiple seeds. Higher performance cases are marked in bold and green.
Noise-free σ𝐚=0.1subscript𝜎𝐚0.1\mathbf{\sigma_{a}=0.1}italic_σ start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT = bold_0.1 σ𝐨=0.1subscript𝜎𝐨0.1\mathbf{\sigma_{o}=0.1}italic_σ start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.1
Env. METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO
Inv. P. 51.3±67.8plus-or-minus51.367.8-51.3\pm 67.8- 51.3 ± 67.8 0.0±0.0plus-or-minus0.00.0\mathbf{-0.0\pm 0.0}- bold_0.0 ± bold_0.0 105.2±81.6plus-or-minus105.281.6-105.2\pm 81.6- 105.2 ± 81.6 0.0±0.0plus-or-minus0.00.0\mathbf{-0.0\pm 0.0}- bold_0.0 ± bold_0.0 74.22±74.5plus-or-minus74.2274.5-74.22\pm 74.5- 74.22 ± 74.5 21.3±20.7plus-or-minus21.320.7\mathbf{-21.3\pm 20.7}- bold_21.3 ± bold_20.7
Swimmer 309.5±49.3plus-or-minus309.549.3309.5\pm 49.3309.5 ± 49.3 313.8±18.7plus-or-minus313.818.7\mathbf{313.8\pm 18.7}bold_313.8 ± bold_18.7 258.7±113.7plus-or-minus258.7113.7258.7\pm 113.7258.7 ± 113.7 322.7±5.3plus-or-minus322.75.3\mathbf{322.7\pm 5.3}bold_322.7 ± bold_5.3 30.7±56.1plus-or-minus30.756.130.7\pm 56.130.7 ± 56.1 79.2±85.0plus-or-minus79.285.0\mathbf{79.2\pm 85.0}bold_79.2 ± bold_85.0
Hopper 1140.1±552.4plus-or-minus1140.1552.41140.1\pm 552.41140.1 ± 552.4 1491.4±623.8plus-or-minus1491.4623.8\mathbf{1491.4\pm 623.8}bold_1491.4 ± bold_623.8 609.0±793.5plus-or-minus609.0793.5609.0\pm 793.5609.0 ± 793.5 868.7±735.8plus-or-minus868.7735.8\mathbf{868.7\pm 735.8}bold_868.7 ± bold_735.8 1391.2±266.5plus-or-minus1391.2266.5-1391.2\pm 266.5- 1391.2 ± 266.5 486.6±459.9plus-or-minus486.6459.9\mathbf{-486.6\pm 459.9}- bold_486.6 ± bold_459.9
Walker 6.6±0.3plus-or-minus6.60.3\mathbf{-6.6\pm 0.3}- bold_6.6 ± bold_0.3 6.9±0.5plus-or-minus6.90.5-6.9\pm 0.5- 6.9 ± 0.5 9.8±2.2plus-or-minus9.82.2-9.8\pm 2.2- 9.8 ± 2.2 5.9±0.3plus-or-minus5.90.3\mathbf{-5.9\pm 0.3}- bold_5.9 ± bold_0.3 30.3±28.2plus-or-minus30.328.2-30.3\pm 28.2- 30.3 ± 28.2 6.3±0.3plus-or-minus6.30.3\mathbf{-6.3\pm 0.3}- bold_6.3 ± bold_0.3
Halfcheetah 2367.3±1274.5plus-or-minus2367.31274.52367.3\pm 1274.52367.3 ± 1274.5 2588.6±955.1plus-or-minus2588.6955.1\mathbf{2588.6\pm 955.1}bold_2588.6 ± bold_955.1 1920.3±932.4plus-or-minus1920.3932.41920.3\pm 932.41920.3 ± 932.4 2515.9±1216.4plus-or-minus2515.91216.4\mathbf{2515.9\pm 1216.4}bold_2515.9 ± bold_1216.4 1419.0±517.2plus-or-minus1419.0517.21419.0\pm 517.21419.0 ± 517.2 1906.3±972.7plus-or-minus1906.3972.7\mathbf{1906.3\pm 972.7}bold_1906.3 ± bold_972.7

The experimental results demonstrate that our proposed 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework outperforms the baseline METRPO in almost every scenario. Notably, the advantages of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation become more apparent under noisy conditions.

4.1 Ablation Study

We conduct an ablation study to compare the specific contributions of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control in the training and testing. During testing, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control explicitly rejects system uncertainties and improves performance. On the other hand, during training, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT additionally influences the learning by shifting the training dataset distribution. To evaluate the effect of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT at each phase, we compare four scenarios: 1) no 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT during training or testing, 2) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT applied only during testing, 3) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT used only during training, and 4) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT applied during both training and testing. The results are summarized in Fig. 3.

Refer to caption
Figure 3: Contribution of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in the training and testing phase. The notation 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT on (off)-on (off) indicates 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is applied (not applied) during training-testing, respectively. The error bar ranges for one standard deviation of the performance. On-on and off-off correspond to our main result in Table 1. As expected, the on-on case achieved the highest performance in most scenarios.

The results depicted in Fig. 3 demonstrate that the influence of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control during training and testing varies across different environments and noise types. However, as anticipated, the highest performance is achieved when 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is applied in both the training and testing phases.

In order to evaluate the effectiveness of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBRL framework in addressing the sim-to-real gap, we conducted a secondary ablation study. First, we trained the model without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in a noise-free environment and subsequently tested the model with and without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT under a noisy environment. The results are demonstrated in Table 2. This result indicates that our 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL framework effectively addresses the sim-to-real gap, and this demonstrates the potential for directly extending our framework to the offline MBRL setting, presenting promising opportunities for future research.

Table 2: Addressing the sim-to-real gap with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation: The original METRPO was initially trained on an environment without uncertainty. Subsequently, the policy was deployed in a noisy environment that emulates real-world conditions, with and without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation.
σ𝐚=0.1subscript𝜎𝐚0.1\mathbf{\sigma_{a}=0.1}italic_σ start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT = bold_0.1 σ𝐨=0.1subscript𝜎𝐨0.1\mathbf{\sigma_{o}=0.1}italic_σ start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.1 σ𝐚=0.1subscript𝜎𝐚0.1\mathbf{\sigma_{a}=0.1}italic_σ start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT = bold_0.1 & σ𝐨=0.1subscript𝜎𝐨0.1\mathbf{\sigma_{o}=0.1}italic_σ start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.1
Env. METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO METRPO 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO
Inv. P. 30.2±45.1plus-or-minus30.245.130.2\pm 45.130.2 ± 45.1 0.0±0.0plus-or-minus0.00.0\mathbf{-0.0\pm 0.0}- bold_0.0 ± bold_0.0 74.1±53.1plus-or-minus74.153.1-74.1\pm 53.1- 74.1 ± 53.1 3.1±2.0plus-or-minus3.12.0\mathbf{-3.1\pm 2.0}- bold_3.1 ± bold_2.0 107.0±72.4plus-or-minus107.072.4-107.0\pm 72.4- 107.0 ± 72.4 6.1±4.6plus-or-minus6.14.6\mathbf{-6.1\pm 4.6}- bold_6.1 ± bold_4.6
Swimmer 250.8±130.2plus-or-minus250.8130.2250.8\pm 130.2250.8 ± 130.2 330.5±5.7plus-or-minus330.55.7\mathbf{330.5\pm 5.7}bold_330.5 ± bold_5.7 337.8±2.9plus-or-minus337.82.9\mathbf{337.8\pm 2.9}bold_337.8 ± bold_2.9 331.2±8.34plus-or-minus331.28.34331.2\pm 8.34331.2 ± 8.34 248.2±133.6plus-or-minus248.2133.6248.2\pm 133.6248.2 ± 133.6 327.3±6.8plus-or-minus327.36.8\mathbf{327.3\pm 6.8}bold_327.3 ± bold_6.8
Hopper 198.9±617.8plus-or-minus198.9617.8198.9\pm 617.8198.9 ± 617.8 623.4±405.6plus-or-minus623.4405.6\mathbf{623.4\pm 405.6}bold_623.4 ± bold_405.6 84.5±1035.8plus-or-minus84.51035.8-84.5\pm 1035.8- 84.5 ± 1035.8 157.1±379.7plus-or-minus157.1379.7\mathbf{157.1\pm 379.7}bold_157.1 ± bold_379.7 87.5±510.2plus-or-minus87.5510.287.5\pm 510.287.5 ± 510.2 309.8±477.8plus-or-minus309.8477.8\mathbf{309.8\pm 477.8}bold_309.8 ± bold_477.8
Walker 6.0±0.8plus-or-minus6.00.8\mathbf{-6.0\pm 0.8}- bold_6.0 ± bold_0.8 6.3±0.7plus-or-minus6.30.7-6.3\pm 0.7- 6.3 ± 0.7 6.4±0.4plus-or-minus6.40.4-6.4\pm 0.4- 6.4 ± 0.4 6.08±0.6plus-or-minus6.080.6\mathbf{-6.08\pm 0.6}- bold_6.08 ± bold_0.6 6.3±0.4plus-or-minus6.30.4-6.3\pm 0.4- 6.3 ± 0.4 5.2±1.5plus-or-minus5.21.5\mathbf{-5.2\pm 1.5}- bold_5.2 ± bold_1.5
Halfcheetah 1845.8±600.9plus-or-minus1845.8600.91845.8\pm 600.91845.8 ± 600.9 1965.3±839.5plus-or-minus1965.3839.5\mathbf{1965.3\pm 839.5}bold_1965.3 ± bold_839.5 1265.0±440.8plus-or-minus1265.0440.81265.0\pm 440.81265.0 ± 440.8 1861.6±605.5plus-or-minus1861.6605.5\mathbf{1861.6\pm 605.5}bold_1861.6 ± bold_605.5 1355.0±335.6plus-or-minus1355.0335.61355.0\pm 335.61355.0 ± 335.6 1643.6±712.5plus-or-minus1643.6712.5\mathbf{1643.6\pm 712.5}bold_1643.6 ± bold_712.5

5 Limitations

(Performance of the base MBRL) Our 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL scheme rejects uncertainty estimates derived from the learned nominal dynamics. As a result, the performance of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL is inherently tied to the baseline MBRL algorithm, and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation cannot independently guarantee good performance. This can be related to the role of ϵlsubscriptitalic-ϵ𝑙\epsilon_{l}italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT in Equation (15). Empirical evidence in Fig. 3 illustrates this point that, despite meaningful improvements, the performance of scenarios augmented with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is closely correlated to that of METRPO without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation.

(Trade-off in choosing ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT) In Sec. 3.2, we mentioned that as ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT approaches zero, the baseline MBRL is recovered. This implies that for small values of ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, the robustness properties exhibited by the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control are compromised. Conversely, if ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is increased excessively, it permits significant deviations between the control-affine and nonlinear models, potentially allowing for larger errors in the state predictor (see Section 3.3). Our heuristic observation from the experiments is to select an ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT that results in approximately 0-100 affinization switches per 1000 time steps for systems with low complexity (n<5𝑛5n<5italic_n < 5) and 200-500 switches for more complex systems.

6 Conclusion

In this paper, we proposed an 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBRL control theoretic add-on scheme to robustify MBRL algorithms against model and environment uncertainties. We affinize the trained nonlinear model according to a switching rule along the input trajectory, enabling the use of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control. Without perturbing the underlying MBRL algorithm, we were able to improve the overall performance in almost all scenarios with and without aleatoric uncertainties.

The results open up interesting research directions where we wish to test the applicability of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL on offline MBRL algorithms to address the sim-to-real gap (Kidambi et al., 2020; Yu et al., 2020). Moreover, its consolidation with a distributionally robust optimization problem to address the distribution shift is of interest. Finally, we will also research the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL design for MBRL algorithms with probabilistic models (Chua et al., 2018; Wang & Ba, 2020) to explore a method to utilize the covariance information in addition to mean dynamics.

Acknowledgments

This work is financially supported by National Aeronautics and Space Administration (NASA) ULI (80NSSC22M0070), NASA USRC (NNH21ZEA001N-USRC), Air Force Office of Scientific Research (FA9550-21-1-0411), National Science Foundation (NSF) AoF Robust Intelligence (2133656), NSF CMMI (2135925), and NSF SLES (2331878).

References

  • Ackerman et al. (2017) Kasey A Ackerman, Enric Xargay, Ronald Choe, Naira Hovakimyan, M Christopher Cotting, Robert B Jeffrey, Margaret P Blackstun, T Paul Fulkerson, Timothy R Lau, and Shawn S Stephens. Evaluation of an 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Flight Control Law on Calspan’s variable-stability Learjet. AIAA Journal of Guidance, Control, and Dynamics, 40(4):1051–1060, 2017.
  • Arevalo-Castiblanco et al. (2021) Miguel F Arevalo-Castiblanco, César A Uribe, and Eduardo Mojica-Nava. Model Reference Adaptive Control for Online Policy Adaptation and Network Synchronization. In IEEE Conference on Decision and Control, pp.  4071–4076, 2021.
  • Åström & Wittenmark (2013) Karl J Åström and Björn Wittenmark. Computer-Controlled Systems: Theory and Design. Courier Corporation, 2013.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv preprint arXiv:1606.01540, 2016.
  • Chen et al. (2021) Baiming Chen, Zuxin Liu, Jiacheng Zhu, Mengdi Xu, Wenhao Ding, Liang Li, and Ding Zhao. Context-Aware Safe Reinforcement Learning for Non-stationary Environments. In IEEE International Conference on Robotics and Automation, pp.  10689–10695, 2021.
  • Cheng et al. (2022) Yikun Cheng, Pan Zhao, Fanxin Wang, Daniel J Block, and Naira Hovakimyan. Improving the Robustness of Reinforcement Learning Policies with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control. IEEE Robotics and Automation Letters, 7(3):6574–6581, 2022.
  • Choi et al. (2020) Jason Choi, Fernando Castañeda, Claire J Tomlin, and Koushil Sreenath. Reinforcement Learning for Safety-Critical Control under Model Uncertainty, using Control Lyapunov Functions and Control Barrier Functions. In Robotics: Science and Systems, 2020.
  • Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep Reinforcement Learning in a Handful of Trials Using Probabilistic Dynamics Models. Advances in Neural Information Processing Systems, 31, 2018.
  • Clavera et al. (2018) Ignasi Clavera, Jonas Rothfuss, John Schulman, Yasuhiro Fujita, Tamim Asfour, and Pieter Abbeel. Model-Based Reinforcement Learning via Meta-Policy Optimization. In Conference on Robot Learning, pp.  617–629, 2018.
  • Deisenroth et al. (2013) Marc Peter Deisenroth, Dieter Fox, and Carl Edward Rasmussen. Gaussian Processes for Data-Efficient Learning in Robotics and Control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2013.
  • Finn et al. (2017) Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic Meta-Learning for Fast Adaptation of Deep Networks. In International Conference on Machine Learning, pp. 1126–1135, 2017.
  • Gregory et al. (2009) Irene Gregory, Chengyu Cao, Enric Xargay, Naira Hovakimyan, and Xiaotian Zou. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control Design for NASA AirSTAR Flight Test Vehicle. In AIAA Guidance, Navigation, and Control Conference, pp. 5738, 2009.
  • Gregory et al. (2010) Irene Gregory, Enric Xargay, Chengyu Cao, and Naira Hovakimyan. Flight Test of an 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Controller on the NASA AirSTAR Flight Test Vehicle. In AIAA Guidance, Navigation, and Control Conference, pp. 8015, 2010.
  • Heess et al. (2015) Nicolas Heess, Greg Wayne, David Silver, Timothy Lillicrap, Yuval Tassa, and Tom Erez. Learning Continuous Control Policies by Stochastic Value Gradients. arXiv preprint arXiv:1510.09142, 2015.
  • Hovakimyan & Cao (2010) Naira Hovakimyan and Chengyu Cao. 1subscript1{\mathcal{L}_{1}}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control Theory: Guaranteed Robustness with Fast Adaptation. SIAM, 2010.
  • Ibarz et al. (2021) Julian Ibarz, Jie Tan, Chelsea Finn, Mrinal Kalakrishnan, Peter Pastor, and Sergey Levine. How to train your robot with deep reinforcement learning: Lessons we have learned. The International Journal of Robotics Research, 40(4-5):698–721, 2021.
  • Khalil (2002) Hassan K Khalil. Nonlinear Systems Third Edition. Patience Hall, 2002.
  • Kharisov (2013) Evgeny Kharisov. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Output-Feedback Control Architectures. PhD thesis, 2013.
  • Khojasteh et al. (2020) Mohammad Javad Khojasteh, Vikas Dhiman, Massimo Franceschetti, and Nikolay Atanasov. Probabilistic Safety Constraints for Learned High Relative Degree System Dynamics. In Learning for Dynamics and Control, pp.  781–792, 2020.
  • Kidambi et al. (2020) Rahul Kidambi, Aravind Rajeswaran, Praneeth Netrapalli, and Thorsten Joachims. MOREL: Model-Based Offline Reinforcement Learning. Advances in Neural Information Processing Systems, 33:21810–21823, 2020.
  • Knuth et al. (2021) Craig Knuth, Glen Chou, Necmiye Ozay, and Dmitry Berenson. Planning with Learned Dynamics: Probabilistic Guarantees on Safety and Reachability via Lipschitz Constants. IEEE Robotics and Automation Letters, 6(3):5129–5136, 2021.
  • Kocijan et al. (2004) Juš Kocijan, Roderick Murray-Smith, Carl Edward Rasmussen, and Agathe Girard. Gaussian Process Model Based Predictive Control. In IEEE American Control Conference, volume 3, pp. 2214–2219, 2004.
  • Koller et al. (2018) Torsten Koller, Felix Berkenkamp, Matteo Turchetta, and Andreas Krause. Learning-Based Model Predictive Control for Safe Exploration. In IEEE Conference on Decision and Control, pp.  6059–6066, 2018.
  • Kurutach et al. (2018) Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, and Pieter Abbeel. Model-Ensemble Trust-Region Policy Optimization. In International Conference on Learning Representations, 2018.
  • Lakshmanan et al. (2020) Arun Lakshmanan, Aditya Gahlawat, and Naira Hovakimyan. Safe Feedback Motion Planning: A Contraction Theory and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control Based Approach. In IEEE Conference on Decision and Control, pp.  1578–1583, 2020.
  • Levine & Koltun (2013) Sergey Levine and Vladlen Koltun. Guided Policy Search. In International Conference on Machine Learning, pp.  1–9, 2013.
  • Li et al. (2020) Junxiang Li, Liang Yao, Xin Xu, Bang Cheng, and Junkai Ren. Deep Reinforcement Learning for Pedestrian Collision Avoidance and Human-Machine Cooperative Driving. Information Sciences, 532:110–124, 2020.
  • Lillicrap et al. (2016) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous Control with Deep Reinforcement Learning. In International Conference on Learning Representations, 2016.
  • Loquercio et al. (2019) Antonio Loquercio, Elia Kaufmann, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun, and Davide Scaramuzza. Deep Drone Racing: From Simulation to Reality with Domain Randomization. IEEE Transactions on Robotics, 36(1):1–14, 2019.
  • Manzano et al. (2020) José María Manzano, Daniel Limon, David Muñoz de la Peña, and Jan-Peter Calliess. Robust Learning-Based MPC for Nonlinear Constrained Systems. Automatica, 117:108948, 2020.
  • Mendelson (2022) Elliott Mendelson. Schaum’s Outline of Calculus. McGraw-Hill Education, 2022.
  • Milz et al. (2018) Stefan Milz, Georg Arbeiter, Christian Witt, Bassam Abdallah, and Senthil Yogamani. Visual SLAM for automated driving: Exploring the applications of deep learning. In IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp.  247–257, 2018.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level Control through Deep Reinforcement Learning. Nature, 518(7540):529–533, 2015.
  • Moerland et al. (2023) Thomas M Moerland, Joost Broekens, Aske Plaat, Catholijn M Jonker, et al. Model-Based Reinforcement Learning: A survey. Foundations and Trends in Machine Learning, 16(1):1–118, 2023.
  • Nagabandi et al. (2018) Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey Levine. Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning. In IEEE International Conference on Robotics and Automation, pp.  7559–7566, 2018.
  • Nagabandi et al. (2019a) Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fearing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learning to Adapt in Dynamic, Real-World Environments through Meta-Reinforcement Learning. In International Conference on Learning Representations, 2019a.
  • Nagabandi et al. (2019b) Anusha Nagabandi, Chelsea Finn, and Sergey Levine. Deep Online Learning Via Meta-Learning: Continual Adaptation for Model-Based RL. In International Conference on Learning Representations, 2019b.
  • Neal et al. (2004) David Neal, Matthew Good, Christopher Johnston, Harry Robertshaw, William Mason, and Daniel Inman. Design and Wind-Tunnel Analysis of a Fully Adaptive Aircraft Configuration. In AIAA Structures, Structural Dynamics & Materials Conference, pp.  1727, 2004.
  • Nguyen & La (2019) Hai Nguyen and Hung La. Review of Deep Reinforcement Learning for Robot Manipulation. In IEEE International Conference on Robotic Computing, pp. 590–595, 2019.
  • Nichols et al. (1993) Robert A Nichols, Robert T Reichert, and Wilson J Rugh. Gain Scheduling for H-infinity Controllers: A Flight Control Example. IEEE Transactions on Control Systems Technology, 1(2):69–79, 1993.
  • Peng et al. (2018) Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-Real Transfer of Robotic Control with Dynamics Randomization. In IEEE International Conference on Robotics and Automation, pp.  3803–3810, 2018.
  • Pinto et al. (2017) Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust Adversarial Reinforcement Learning. In International Conference on Machine Learning, pp. 2817–2826, 2017.
  • Pravitra et al. (2020) **tasit Pravitra, Kasey A Ackerman, Chengyu Cao, Naira Hovakimyan, and Evangelos A Theodorou. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-Adaptive MPPI Architecture for Robust and Agile Control of Multirotors. In IEEE International Conference on Intelligent Robots and Systems, pp.  7661–7666, 2020.
  • Quinonero-Candela et al. (2008) Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer, and Neil D Lawrence. Dataset Shift in Machine Learning. Mit Press, 2008.
  • Ross et al. (2011) Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp.  627–635, 2011.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust Region Policy Optimization. In International Conference on Machine Learning, pp. 1889–1897, 2015.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Taylor et al. (2019) Andrew J Taylor, Victor D Dorobantu, Hoang M Le, Yisong Yue, and Aaron D Ames. Episodic Learning with Control Lyapunov Functions for Uncertain Robotic Systems. In IEEE International Conference on Intelligent Robots and Systems, pp.  6878–6884, 2019.
  • Tobin et al. (2017) Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World. In IEEE International Conference on Intelligent Robots and Systems, pp.  23–30, 2017.
  • Wang & Ba (2020) Tingwu Wang and Jimmy Ba. Exploring Model-based Planning with Policy Networks. In International Conference on Learning Representations, 2020.
  • Wang et al. (2019) Tingwu Wang, Xuchan Bao, Ignasi Clavera, Jerrick Hoang, Yeming Wen, Eric Langlois, Shunshi Zhang, Guodong Zhang, Pieter Abbeel, and Jimmy Ba. Benchmarking Model-Based Reinforcement Learning. arXiv preprint arXiv:1907.02057, 2019.
  • Wang & Hovakimyan (2011) Xiaofeng Wang and Naira Hovakimyan. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Controller for Nonlinear Reference Systems. In IEEE American Control Conference, pp.  594–599, 2011.
  • Wang et al. (2018) Xin Wang, Wenhu Chen, Jiawei Wu, Yuan-Fang Wang, and William Yang Wang. Video Captioning via Hierarchical Reinforcement Learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp.  4213–4222, 2018.
  • Wang et al. (2022) Xu Wang, Sen Wang, Xingxing Liang, Dawei Zhao, **cai Huang, Xin Xu, Bin Dai, and Qiguang Miao. Deep Reinforcement Learning: A Survey. IEEE Transactions on Neural Networks and Learning Systems, 2022.
  • Wu et al. (2017) Jiajun Wu, Erika Lu, Pushmeet Kohli, Bill Freeman, and Josh Tenenbaum. Learning to See Physics via Visual De-animation. Advances in Neural Information Processing Systems, 30, 2017.
  • Wu et al. (2018) Lijun Wu, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. A Study of Reinforcement Learning for Neural Machine translation. In Conference on Empirical Methods in Natural Language Processing, pp.  3612–3621, 2018.
  • Wu et al. (2022) Zhuohuan Wu, Sheng Cheng, Kasey A Ackerman, Aditya Gahlawat, Arun Lakshmanan, Pan Zhao, and Naira Hovakimyan. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Augmentation for Geometric Tracking Control of Quadrotors. In International Conference on Robotics and Automation, pp. 1329–1336, 2022.
  • Yu et al. (2020) Tianhe Yu, Garrett Thomas, Lantao Yu, Stefano Ermon, James Y Zou, Sergey Levine, Chelsea Finn, and Tengyu Ma. MOPO: Model-Based Offline Policy Optimization. Advances in Neural Information Processing Systems, 33:14129–14142, 2020.
  • Yun et al. (2017) Sangdoo Yun, Jongwon Choi, Youngjoon Yoo, Kimin Yun, and ** Young Choi. Action-Decision Networks for Visual Tracking with Deep Reinforcement learning. In IEEE Conference on Computer Vision and Pattern Recognition, pp.  2711–2720, 2017.
  • Zhao et al. (2020) Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey. In IEEE Symposium Series on Computational Intelligence, pp. 737–744, 2020.
  • Zheng et al. (2022) Ruijie Zheng, Xiyao Wang, Huazhe Xu, and Furong Huang. Is Model Ensemble Necessary? Model-based RL via a Single Model with Lipschitz Regularized Value Function. In The International Conference on Learning Representations, 2022.

Appendix

Appendix A Extended Description of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT Adaptive Control

In this section, we provide a detailed explanation of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control. Consider the following system dynamics:

x˙(t)=g(x(t))+h(x(t))u(t)+d(t,x(t),u(t)),x(0)=x0,formulae-sequence˙𝑥𝑡𝑔𝑥𝑡𝑥𝑡𝑢𝑡𝑑𝑡𝑥𝑡𝑢𝑡𝑥0subscript𝑥0\displaystyle\dot{x}(t)=g(x(t))+h(x(t))u(t)+d(t,x(t),u(t)),~{}\quad x(0)=x_{0},over˙ start_ARG italic_x end_ARG ( italic_t ) = italic_g ( italic_x ( italic_t ) ) + italic_h ( italic_x ( italic_t ) ) italic_u ( italic_t ) + italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) , italic_x ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (18)

where x(t)n𝑥𝑡superscript𝑛x(t)\in\mathbb{R}^{n}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the system state vector, u(t)m𝑢𝑡superscript𝑚u(t)\in\mathbb{R}^{m}italic_u ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the control signal, g:nn:𝑔superscript𝑛superscript𝑛g:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}italic_g : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT and h:nn×m:superscript𝑛superscript𝑛𝑚h:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n\times m}italic_h : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT are known functions that define the desired dynamics, both of which are locally-Lipschitz continuous functions. Furthermore, d(t,x(t),u(t))n𝑑𝑡𝑥𝑡𝑢𝑡superscript𝑛d(t,x(t),u(t))\in\mathbb{R}^{n}italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT represents the unknown nonlinearities and is continuous in its arguments. We now decompose d(t,x(t),u(t))𝑑𝑡𝑥𝑡𝑢𝑡d(t,x(t),u(t))italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) with respect to the range and kernel of h(x(t))𝑥𝑡h(x(t))italic_h ( italic_x ( italic_t ) ) to obtain

x˙(t)=g(x(t))+h(x(t))(u(t)+σm(t,x(t),u(t)))+h(x(t))σum(t,x(t),u(t)),x(0)=x0,formulae-sequence˙𝑥𝑡𝑔𝑥𝑡𝑥𝑡𝑢𝑡subscript𝜎𝑚𝑡𝑥𝑡𝑢𝑡superscriptperpendicular-to𝑥𝑡subscript𝜎𝑢𝑚𝑡𝑥𝑡𝑢𝑡𝑥0subscript𝑥0\displaystyle\dot{x}(t)=g(x(t))+h(x(t))(u(t)+\sigma_{m}(t,x(t),u(t)))+h^{\perp% }(x(t))\sigma_{um}(t,x(t),u(t)),\quad x(0)=x_{0},over˙ start_ARG italic_x end_ARG ( italic_t ) = italic_g ( italic_x ( italic_t ) ) + italic_h ( italic_x ( italic_t ) ) ( italic_u ( italic_t ) + italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ) + italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) italic_σ start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) , italic_x ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , (19)

where h(x(t))σm(t,x(t),u(t))+h(x(t))σum(t,x(t),u(t))=d(t,x(t),u(t))𝑥𝑡subscript𝜎𝑚𝑡𝑥𝑡𝑢𝑡superscriptperpendicular-to𝑥𝑡subscript𝜎𝑢𝑚𝑡𝑥𝑡𝑢𝑡𝑑𝑡𝑥𝑡𝑢𝑡h(x(t))\sigma_{m}(t,x(t),u(t))+h^{\perp}(x(t))\sigma_{um}(t,x(t),u(t))=d(t,x(t% ),u(t))italic_h ( italic_x ( italic_t ) ) italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) italic_σ start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) = italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ). Moreover, h(x(t))n×(nm)superscriptperpendicular-to𝑥𝑡superscript𝑛𝑛𝑚h^{\perp}(x(t))\in\mathbb{R}^{n\times(n-m)}italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × ( italic_n - italic_m ) end_POSTSUPERSCRIPT denotes a matrix whose columns are perpendicular to h(x(t))n×m𝑥𝑡superscript𝑛𝑚h(x(t))\in\mathbb{R}^{n\times m}italic_h ( italic_x ( italic_t ) ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n × italic_m end_POSTSUPERSCRIPT, such that h(x(t))h(x(t))=0superscript𝑥𝑡topsuperscriptperpendicular-to𝑥𝑡0h(x(t))^{\top}h^{\perp}(x(t))=0italic_h ( italic_x ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) = 0 for any x(t)n𝑥𝑡superscript𝑛x(t)\in\mathbb{R}^{n}italic_x ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT. The existence of h(x(t))superscriptperpendicular-to𝑥𝑡h^{\perp}(x(t))italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) is guaranteed if it is a full-rank matrix. The terms σm(t,x(t))subscript𝜎𝑚𝑡𝑥𝑡\sigma_{m}(t,x(t))italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) ) and σum(t,x(t))subscript𝜎𝑢𝑚𝑡𝑥𝑡\sigma_{um}(t,x(t))italic_σ start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) ) are commonly referred to as matched and unmatched uncertainties, respectively.

Consider the nominal system in the absence of uncertainties

x˙(t)=g(x(t))+h(x(t))u(t),x(0)=x0,formulae-sequencesuperscript˙𝑥𝑡𝑔superscript𝑥𝑡superscript𝑥𝑡superscript𝑢𝑡superscript𝑥0subscript𝑥0\displaystyle\dot{x}^{\star}(t)=g(x^{\star}(t))+h(x^{\star}(t))u^{\star}(t),~{% }\quad x^{\star}(0)=x_{0},over˙ start_ARG italic_x end_ARG start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) = italic_g ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ) + italic_h ( italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) ) italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) , italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ,
Refer to caption
Figure 4: The architecture of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller.

where u(t)superscript𝑢𝑡u^{\star}(t)italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) is the baseline input designed so that the desired performance and safety requirements are satisfied. If we pass the baseline input to the true system in Equation (18), the actual state trajectory x(t)𝑥𝑡x(t)italic_x ( italic_t ) can diverge from the desired state trajectory x(t)superscript𝑥𝑡x^{\star}(t)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) in an unquantifiable manner due to the presence of uncertainties d(t,x(t),u(t))𝑑𝑡𝑥𝑡𝑢𝑡d(t,x(t),u(t))italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ). To avoid this behavior, we employ 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control, which aims to compute an input ua(t)subscript𝑢𝑎𝑡u_{a}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) such that, when combined with the nominal input u(t)superscript𝑢𝑡u^{\star}(t)italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ), forms the augmented input

u(t)=u(t)+ua(t).𝑢𝑡superscript𝑢𝑡subscript𝑢𝑎𝑡u(t)=u^{\star}(t)+u_{a}(t).italic_u ( italic_t ) = italic_u start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) + italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) . (20)

The objective of this approach is to ensure that the true state x(t)𝑥𝑡x(t)italic_x ( italic_t ) remains uniformly and quantifiably bounded around the nominal trajectory x(t)superscript𝑥𝑡x^{\star}(t)italic_x start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT ( italic_t ) under certain conditions.

Next, we explain how the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller achieves this goal. The 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control has three components: the state predictor, the adaptation law, and a low-pass filter. The state predictor for Equation (18) is given by

x^˙(t)=g(x(t))+h(x(t))u(t)+σ^(t)+Asx~(t),˙^𝑥𝑡𝑔𝑥𝑡𝑥𝑡𝑢𝑡^𝜎𝑡subscript𝐴𝑠~𝑥𝑡\dot{\hat{x}}(t)=g(x(t))+h(x(t))u(t)+\hat{\sigma}(t)+A_{s}\tilde{x}(t),over˙ start_ARG over^ start_ARG italic_x end_ARG end_ARG ( italic_t ) = italic_g ( italic_x ( italic_t ) ) + italic_h ( italic_x ( italic_t ) ) italic_u ( italic_t ) + over^ start_ARG italic_σ end_ARG ( italic_t ) + italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ( italic_t ) , (21)

where σ^(t)h(x(t))σ^m(t)+h(x(t))σ^um(t)^𝜎𝑡𝑥𝑡subscript^𝜎𝑚𝑡superscriptperpendicular-to𝑥𝑡subscript^𝜎𝑢𝑚𝑡\hat{\sigma}(t)\triangleq h(x(t))\hat{\sigma}_{m}(t)+h^{\perp}(x(t))\hat{% \sigma}_{um}(t)over^ start_ARG italic_σ end_ARG ( italic_t ) ≜ italic_h ( italic_x ( italic_t ) ) over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) + italic_h start_POSTSUPERSCRIPT ⟂ end_POSTSUPERSCRIPT ( italic_x ( italic_t ) ) over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t ). Moreover, σ^m(t)subscript^𝜎𝑚𝑡\hat{\sigma}_{m}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) and σ^um(t)subscript^𝜎𝑢𝑚𝑡\hat{\sigma}_{um}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t ) are the estimates of the matched and unmatched uncertainties σm(t,x(t))subscript𝜎𝑚𝑡𝑥𝑡\sigma_{m}(t,x(t))italic_σ start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) ) and σum(t,x(t))subscript𝜎𝑢𝑚𝑡𝑥𝑡\sigma_{um}(t,x(t))italic_σ start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t , italic_x ( italic_t ) ), respectively. The initial conditions are given by x^(0)=x0^𝑥0subscript𝑥0\hat{x}(0)=x_{0}over^ start_ARG italic_x end_ARG ( 0 ) = italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, σ^m(0)=0subscript^𝜎𝑚00\hat{\sigma}_{m}(0)=0over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( 0 ) = 0, and σ^um(0)=0subscript^𝜎𝑢𝑚00\hat{\sigma}_{um}(0)=0over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( 0 ) = 0. Here, x^(t)n^𝑥𝑡superscript𝑛\hat{x}(t)\in\mathbb{R}^{n}over^ start_ARG italic_x end_ARG ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is the state of the predictor, u(t)m𝑢𝑡superscript𝑚u(t)\in\mathbb{R}^{m}italic_u ( italic_t ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT is the augmented control input in Equation (20), x~(t)=x^(t)x(t)~𝑥𝑡^𝑥𝑡𝑥𝑡\tilde{x}(t)=\hat{x}(t)-x(t)over~ start_ARG italic_x end_ARG ( italic_t ) = over^ start_ARG italic_x end_ARG ( italic_t ) - italic_x ( italic_t ) denotes the prediction error, and As𝕊nsubscript𝐴𝑠superscript𝕊𝑛A_{s}\in\mathbb{S}^{n}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ blackboard_S start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT is a chosen Hurwitz matrix. The state predictor produces the state estimate x^(t)^𝑥𝑡\hat{x}(t)over^ start_ARG italic_x end_ARG ( italic_t ) induced by the adaptive estimates σ^m(t)subscript^𝜎𝑚𝑡\hat{\sigma}_{m}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) and σ^um(t)subscript^𝜎𝑢𝑚𝑡\hat{\sigma}_{um}(t)over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_u italic_m end_POSTSUBSCRIPT ( italic_t ).

Following the true dynamics in Equation (18) and the state predictor in Equation (21), the dynamics of the prediction error x~(t)=x^(t)x(t)~𝑥𝑡^𝑥𝑡𝑥𝑡\tilde{x}(t)=\hat{x}(t)-x(t)over~ start_ARG italic_x end_ARG ( italic_t ) = over^ start_ARG italic_x end_ARG ( italic_t ) - italic_x ( italic_t ) is given by

x~˙(t)=Asx~(t)+[σ^(t)d(t,x(t),u(t))],x~(0)=0.formulae-sequence˙~𝑥𝑡subscript𝐴𝑠~𝑥𝑡delimited-[]^𝜎𝑡𝑑𝑡𝑥𝑡𝑢𝑡~𝑥00\dot{\tilde{x}}(t)=A_{s}\tilde{x}(t)+\left[\hat{\sigma}(t)-d(t,x(t),u(t))% \right],\quad\tilde{x}(0)=0.over˙ start_ARG over~ start_ARG italic_x end_ARG end_ARG ( italic_t ) = italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT over~ start_ARG italic_x end_ARG ( italic_t ) + [ over^ start_ARG italic_σ end_ARG ( italic_t ) - italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ] , over~ start_ARG italic_x end_ARG ( 0 ) = 0 . (22)

Here, we refer to σ^(t)d(t,x(t),u(t))^𝜎𝑡𝑑𝑡𝑥𝑡𝑢𝑡\hat{\sigma}(t)-d(t,x(t),u(t))over^ start_ARG italic_σ end_ARG ( italic_t ) - italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) as the uncertainty estimation error. Moreover, since Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is Hurwitz, the system Equation (22) governing the evolution of x~(t)~𝑥𝑡\tilde{x}(t)over~ start_ARG italic_x end_ARG ( italic_t ) is exponentially stable in the absence of the exogenous input σ^(t)d(t,x(t),u(t))^𝜎𝑡𝑑𝑡𝑥𝑡𝑢𝑡\hat{\sigma}(t)-d(t,x(t),u(t))over^ start_ARG italic_σ end_ARG ( italic_t ) - italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ). Therefore, x~(t)~𝑥𝑡\tilde{x}(t)over~ start_ARG italic_x end_ARG ( italic_t ) serves as a learning signal for the adaptive law, which we describe next.

We employ a piecewise constant estimation scheme (Hovakimyan & Cao, 2010) based on the solution of Equation (22), which can be expressed using the following equation:

x~(t)=exp(Ast)x~(0)+0texp(As(tτ))(σ^(τ)d(τ,x(τ),u(τ))dτ.\tilde{x}(t)=\exp{(A_{s}t)}\tilde{x}(0)+\int_{0}^{t}\exp{\left(A_{s}(t-\tau)% \right)}(\hat{\sigma}(\tau)-d(\tau,x(\tau),u(\tau))d\tau.over~ start_ARG italic_x end_ARG ( italic_t ) = roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_t ) over~ start_ARG italic_x end_ARG ( 0 ) + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t - italic_τ ) ) ( over^ start_ARG italic_σ end_ARG ( italic_τ ) - italic_d ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) italic_d italic_τ . (23)

For a given sampling time Ts>0subscript𝑇𝑠0T_{s}>0italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT > 0, we use a piecewise constant estimate defined as

σ^(t)=σ^(iTs),t[iTs,(i+1)Ts),i{0}.formulae-sequence^𝜎𝑡^𝜎𝑖subscript𝑇𝑠formulae-sequence𝑡𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠𝑖0\hat{\sigma}(t)=\hat{\sigma}(iT_{s}),\quad t\in[iT_{s},(i+1)T_{s}),~{}i\in% \mathbb{N}\cup\{0\}.over^ start_ARG italic_σ end_ARG ( italic_t ) = over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_t ∈ [ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , italic_i ∈ blackboard_N ∪ { 0 } .

When the system is initialized (i=0𝑖0i=0italic_i = 0), we set x~(t)=0~𝑥𝑡0\tilde{x}(t)=0over~ start_ARG italic_x end_ARG ( italic_t ) = 0 which implies σ^(t)=0^𝜎𝑡0\hat{\sigma}(t)=0over^ start_ARG italic_σ end_ARG ( italic_t ) = 0 for t[0,Ts)𝑡0subscript𝑇𝑠t\in[0,T_{s})italic_t ∈ [ 0 , italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). Now consider a time index i𝑖i\in\mathbb{N}italic_i ∈ blackboard_N, and the time interval [iTs,(i+1)Ts)𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠[iT_{s},(i+1)T_{s})[ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). Over this time interval, the solution of Equation (23), obtained by applying the piecewise constant representation, can be written as

x~(\displaystyle\tilde{x}(over~ start_ARG italic_x end_ARG ( t)=exp(As(tiTs))x~(iTs)\displaystyle t)=\exp{(A_{s}(t-iT_{s}))}\tilde{x}(iT_{s})italic_t ) = roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t - italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT )
+iTstexp(As(tτ))(σ^(τ)d(τ,x(τ),u(τ))dτ,t[iTs,(i+1)Ts).\displaystyle+\int_{iT_{s}}^{t}\exp{\left(A_{s}(t-\tau)\right)}(\hat{\sigma}(% \tau)-d(\tau,x(\tau),u(\tau))d\tau,\quad t\in[iT_{s},(i+1)T_{s}).+ ∫ start_POSTSUBSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_t - italic_τ ) ) ( over^ start_ARG italic_σ end_ARG ( italic_τ ) - italic_d ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) italic_d italic_τ , italic_t ∈ [ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Now, note that over the previous interval t[(i1)Ts,iTs)𝑡𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠t\in[(i-1)T_{s},iT_{s})italic_t ∈ [ ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) the system was affected by uncertainty d(t,x(t),u(t))𝑑𝑡𝑥𝑡𝑢𝑡d(t,x(t),u(t))italic_d ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ), which resulted in x~(iTs)0~𝑥𝑖subscript𝑇𝑠0\tilde{x}(iT_{s})\neq 0over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ≠ 0. At the end of the time interval [iTs,(i+1)Ts)𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠[iT_{s},(i+1)T_{s})[ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ), we obtain

x~((i+1)Ts)=~𝑥𝑖1subscript𝑇𝑠absent\displaystyle\tilde{x}((i+1)T_{s})=over~ start_ARG italic_x end_ARG ( ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = exp(AsTs)x~(iTs)+iTs(i+1)Tsexp(As((i+1)Tsτ))σ^(iTs)𝑑τ+R(i+1)Tssubscript𝐴𝑠subscript𝑇𝑠~𝑥𝑖subscript𝑇𝑠superscriptsubscript𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠subscript𝐴𝑠𝑖1subscript𝑇𝑠𝜏^𝜎𝑖subscript𝑇𝑠differential-d𝜏subscript𝑅𝑖1subscript𝑇𝑠\displaystyle\exp{(A_{s}T_{s})}\tilde{x}(iT_{s})+\int_{iT_{s}}^{(i+1)T_{s}}% \exp{\left(A_{s}((i+1)T_{s}-\tau)\right)}\hat{\sigma}(iT_{s})d\tau+R_{(i+1)T_{% s}}~{}roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + ∫ start_POSTSUBSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d italic_τ + italic_R start_POSTSUBSCRIPT ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT
=\displaystyle== exp(AsTs)x~(iTs)+As1(exp(AsTs)𝕀n)σ^(iTs)+R(i+1)Ts,subscript𝐴𝑠subscript𝑇𝑠~𝑥𝑖subscript𝑇𝑠superscriptsubscript𝐴𝑠1subscript𝐴𝑠subscript𝑇𝑠subscript𝕀𝑛^𝜎𝑖subscript𝑇𝑠subscript𝑅𝑖1subscript𝑇𝑠\displaystyle\exp{(A_{s}T_{s})}\tilde{x}(iT_{s})+A_{s}^{-1}\left(\exp{\left(A_% {s}T_{s}\right)}-\mathbb{I}_{n}\right)\hat{\sigma}(iT_{s})+R_{(i+1)T_{s}}~{},roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) + italic_R start_POSTSUBSCRIPT ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT , (24)

where

R(i+1)TsiTs(i+1)Tsexp(As((i+1)Tsτ))d(τ,x(τ),u(τ))𝑑τsubscript𝑅𝑖1subscript𝑇𝑠superscriptsubscript𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠subscript𝐴𝑠𝑖1subscript𝑇𝑠𝜏𝑑𝜏𝑥𝜏𝑢𝜏differential-d𝜏R_{(i+1)T_{s}}\triangleq-\int_{iT_{s}}^{(i+1)T_{s}}\exp{\left(A_{s}((i+1)T_{s}% -\tau)\right)}d(\tau,x(\tau),u(\tau))d\tauitalic_R start_POSTSUBSCRIPT ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT ≜ - ∫ start_POSTSUBSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) italic_d italic_τ

is the residual term that captures the uncertainty entered during [iTs,(i+1)Ts)𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠[iT_{s},(i+1)T_{s})[ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Let

σ^(iTs)=Φ1(Ts)μ(iTs),σ^(0)=0,formulae-sequence^𝜎𝑖subscript𝑇𝑠superscriptΦ1subscript𝑇𝑠𝜇𝑖subscript𝑇𝑠^𝜎00\displaystyle\hat{\sigma}(iT_{s})=-\Phi^{-1}(T_{s})\mu(iT_{s}),\quad\hat{% \sigma}(0)=0,over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = - roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_μ ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) , over^ start_ARG italic_σ end_ARG ( 0 ) = 0 , (25)

where Φ(Ts)=As1(exp(AsTs)𝕀)Φsubscript𝑇𝑠superscriptsubscript𝐴𝑠1subscript𝐴𝑠subscript𝑇𝑠𝕀\Phi(T_{s})=A_{s}^{-1}(\exp{(A_{s}T_{s})}-\mathbb{I})roman_Φ ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) - blackboard_I ) and μ(iTs)=exp(AsTs)x~(iTs)𝜇𝑖subscript𝑇𝑠subscript𝐴𝑠subscript𝑇𝑠~𝑥𝑖subscript𝑇𝑠\mu(iT_{s})=\exp{(A_{s}T_{s})}\tilde{x}(iT_{s})italic_μ ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) for i=0,1,2,𝑖012i=0,1,2,\cdotsitalic_i = 0 , 1 , 2 , ⋯. Substituting this into Equation (24) removes the first two terms, leaving us only with the residual term, which will appear as the initial condition of the next time interval. In other words, the adaptation law attempts to remove the effect of the uncertainty introduced in the current time interval by addressing it at the start of the subsequent interval. Interested readers can refer to (Kharisov, 2013, Ch. 2) for further details on the piecewise constant adaptive law.

Note that a small sampling time Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT results in a small prediction error x~(iTs)norm~𝑥𝑖subscript𝑇𝑠\|\tilde{x}(iT_{s})\|∥ over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ for each i=1,2,𝑖12i=1,2,\cdotsitalic_i = 1 , 2 , ⋯. Therefore, it is desirable to keep Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT small up to the hardware limit. However, setting a small Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT and/or having large eigenvalues of Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT can lead to a high adaptation gain (Φ1superscriptΦ1\Phi^{-1}roman_Φ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT in Equation (25)). This can result in high-frequency uncertainty estimation, which can reduce the robustness of the controlled system if we directly apply ua(t)=σ^m(t)subscript𝑢𝑎𝑡subscript^𝜎𝑚𝑡u_{a}(t)=-\hat{\sigma}_{m}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) = - over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) to cancel the estimated matched uncertainty. Therefore, we use a low-pass filter in the controller to decouple the fast estimation from the control loop, allowing us to employ an arbitrarily fast adaptation while maintaining the desired robustness. To be specifc, the input ua(t)subscript𝑢𝑎𝑡u_{a}(t)italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_t ) is given by

ua(s)=C(s)𝔏[σ^m(t)],subscript𝑢𝑎𝑠𝐶𝑠𝔏delimited-[]subscript^𝜎𝑚𝑡u_{a}(s)=-C(s)\mathfrak{L}\left[\hat{\sigma}_{m}(t)\right],italic_u start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_s ) = - italic_C ( italic_s ) fraktur_L [ over^ start_ARG italic_σ end_ARG start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ] ,

where 𝔏[]𝔏delimited-[]\mathfrak{L}[\cdot]fraktur_L [ ⋅ ] denotes the Laplace transform, and the C(s)𝐶𝑠C(s)italic_C ( italic_s ) is the low-pass filter. The cutoff frequency of the low-pass filter is chosen to satisfy small-gain stability conditions, examples of which can be found in (Wang & Hovakimyan, 2011; Lakshmanan et al., 2020). We refer the interested reader to (Pravitra et al., 2020; Wu et al., 2022; Cheng et al., 2022) for further reading on the design process of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control.

Appendix B Remarks on Assumption 1

It is evident from Equation (10) that our method relies on the continuous differentiability of F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) with respect to u(t)𝑢𝑡u(t)italic_u ( italic_t ) to ensure the continuity of F^θa(x(t),u(t))superscriptsubscript^𝐹𝜃𝑎𝑥𝑡𝑢𝑡\hat{F}_{\theta}^{a}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ). Such a requirement is readily satisfied when using 𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT (or higher order continuously differentiable) activation functions for F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ), such as sigmoid, tanh, or swish. For MBRL algorithms that use activation functions that are not 𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT, we can skip the switching law (Equation (11)) to avoid making affine approximations at non-differentiable points (e.g., the origin for ReLU).

The first part of Assumption 1 on the regularity of F(x(t),u(t))𝐹𝑥𝑡𝑢𝑡F(x(t),u(t))italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) and W(t,x(t))𝑊𝑡𝑥𝑡W(t,x(t))italic_W ( italic_t , italic_x ( italic_t ) ) is standard to ensure the well-posedness (uniqueness and existence of solutions) for Equation (13(Khalil, 2002, Theorem 3.1). Furthermore, as stated above, since F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) is 𝒞1superscript𝒞1\mathcal{C}^{1}caligraphic_C start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT in its arguments, its derivative F^θ(x(t),u(t))subscript^𝐹𝜃𝑥𝑡𝑢𝑡\hat{F}_{\theta}(x(t),u(t))over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) is continuous and hence, satisfies the local Lipschitz continuity over the compact sets 𝒳𝒳\mathcal{X}caligraphic_X and 𝒰𝒰\mathcal{U}caligraphic_U trivially.

The assumption in Equation (15) is satisfied owing to the Lipschitz continuity of the function in its respective arguments. If the bound is unknown, it is possible to collect data by interacting with the environment under a specific policy and initial condition, and then compute a probabilistic bound. However, such a bound is applicable only to the chosen data set and may not hold for other choices of samples, which is the well-known issue of distribution shift (Quinonero-Candela et al., 2008). This assumption, which explicitly bounds the unknown component of the dynamics, although cannot be guaranteed in the real environment, is commonly made in assessing theoretical guarantees of error propagation when using learned models, as seen in previous papers (Knuth et al., 2021; Manzano et al., 2020; Koller et al., 2018).

Appendix C Proof of Theorem 1

Proof of Theorem 1..

We begin by applying the triangle inequality for the estimation error:

e(t,x(t),u(t))norm𝑒𝑡𝑥𝑡𝑢𝑡\displaystyle\|e(t,x(t),u(t))\|∥ italic_e ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ (26)
=F(x(t),u(t))+W(t,x(t),u(t))Gθ(x(t))Hθ(x(t))u(t)σ^(t)absentnorm𝐹𝑥𝑡𝑢𝑡𝑊𝑡𝑥𝑡𝑢𝑡subscript𝐺𝜃𝑥𝑡subscript𝐻𝜃𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle=\|F(x(t),u(t))+W(t,x(t),u(t))-G_{\theta}(x(t))-H_{\theta}(x(t))u% (t)-\hat{\sigma}(t)\|= ∥ italic_F ( italic_x ( italic_t ) , italic_u ( italic_t ) ) + italic_W ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_G start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) - italic_H start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) ) italic_u ( italic_t ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥
=F^θ(x,u(t))+l(t,x(t),u(t))F^θa(x(t),u(t))σ^(t)absentnormsubscript^𝐹𝜃𝑥𝑢𝑡𝑙𝑡𝑥𝑡𝑢𝑡subscriptsuperscript^𝐹𝑎𝜃𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle=\|\hat{F}_{\theta}(x,u(t))+l(t,x(t),u(t))-\hat{F}^{a}_{\theta}(x% (t),u(t))-\hat{\sigma}(t)\|= ∥ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x , italic_u ( italic_t ) ) + italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥
F^θ(x(t),u(t))F^θa(x(t),u(t))+l(t,x(t),u(t))σ^(t)absentnormsubscript^𝐹𝜃𝑥𝑡𝑢𝑡subscriptsuperscript^𝐹𝑎𝜃𝑥𝑡𝑢𝑡norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle\leq\|\hat{F}_{\theta}(x(t),u(t))-\hat{F}^{a}_{\theta}(x(t),u(t))% \|+\|l(t,x(t),u(t))-\hat{\sigma}(t)\|≤ ∥ over^ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_F end_ARG start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ + ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥
ϵa+l(t,x(t),u(t))σ^(t),i{0},formulae-sequenceabsentsubscriptitalic-ϵ𝑎norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑡𝑖0\displaystyle\leq\epsilon_{a}+\|l(t,x(t),u(t))-\hat{\sigma}(t)\|,\quad i\in\{0% \}\cup\mathbb{N},≤ italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT + ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥ , italic_i ∈ { 0 } ∪ blackboard_N ,

where we used Equation (11).

For the case when i=0𝑖0i=0italic_i = 0, due to the initial conditions x~(0)=0~𝑥00\tilde{x}(0)=0over~ start_ARG italic_x end_ARG ( 0 ) = 0, σ^(0)=0^𝜎00\hat{\sigma}(0)=0over^ start_ARG italic_σ end_ARG ( 0 ) = 0, and the assumption in Equation (15), we get that

l(t,x(t),u(t))σ^(0)=l(t,x(t),u(t))ϵl,t[0,Ts).formulae-sequencenorm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎0norm𝑙𝑡𝑥𝑡𝑢𝑡subscriptitalic-ϵ𝑙for-all𝑡0subscript𝑇𝑠\displaystyle\|l(t,x(t),u(t))-\hat{\sigma}(0)\|=\|l(t,x(t),u(t))\|\leq\epsilon% _{l},\quad\forall t\in[0,T_{s}).∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( 0 ) ∥ = ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) ∥ ≤ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT , ∀ italic_t ∈ [ 0 , italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Substituting this expression into Equation (26) proves the stated result for t[0,Ts)𝑡0subscript𝑇𝑠t\in[0,T_{s})italic_t ∈ [ 0 , italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Next, we bound the term l(t,x(t),u(t))σ^(iTs)norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑖subscript𝑇𝑠\|l(t,x(t),u(t))-\hat{\sigma}(iT_{s})\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ in Equation (26) for all t[Ts,tmax)𝑡subscript𝑇𝑠subscript𝑡t\in[T_{s},t_{\max})italic_t ∈ [ italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ). Consider an i𝑖i\in\mathbb{N}italic_i ∈ blackboard_N that corresponds to t[Ts,tmax)𝑡subscript𝑇𝑠subscript𝑡t\in[T_{s},t_{\max})italic_t ∈ [ italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ), i.e., i{1,,tmax/Ts}𝑖1subscript𝑡subscript𝑇𝑠i\in\left\{1,\dots,\lfloor t_{\max}/T_{s}\rfloor\right\}\triangleq\mathcal{I}italic_i ∈ { 1 , … , ⌊ italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT / italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ⌋ } ≜ caligraphic_I. For any such i𝑖iitalic_i, substituting the adaptation law from Equation (25) into Equation (24) for the interval [(i1)Ts,iTs)𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠[(i-1)T_{s},iT_{s})[ ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) produces the following expression

x~(iTs)=(i1)TsiTsexp(As(iTsτ))d(τ,x(τ),u(τ))𝑑τ,i.formulae-sequence~𝑥𝑖subscript𝑇𝑠superscriptsubscript𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠subscript𝐴𝑠𝑖subscript𝑇𝑠𝜏𝑑𝜏𝑥𝜏𝑢𝜏differential-d𝜏for-all𝑖\tilde{x}(iT_{s})=-\int_{(i-1)T_{s}}^{iT_{s}}\exp(A_{s}(iT_{s}-\tau))d(\tau,x(% \tau),u(\tau))d\tau,\quad\forall i\in\mathcal{I}.over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = - ∫ start_POSTSUBSCRIPT ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) italic_d italic_τ , ∀ italic_i ∈ caligraphic_I . (27)

Replacing d(τ,x(τ),u(τ))=[d1(τ,x(τ),u(τ))dn(τ,x(τ),u(τ))]𝑑𝜏𝑥𝜏𝑢𝜏superscriptmatrixsubscript𝑑1𝜏𝑥𝜏𝑢𝜏subscript𝑑𝑛𝜏𝑥𝜏𝑢𝜏topd(\tau,x(\tau),u(\tau))=\begin{bmatrix}d_{1}(\tau,x(\tau),u(\tau))&\dots&d_{n}% (\tau,x(\tau),u(\tau))\end{bmatrix}^{\top}italic_d ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) = [ start_ARG start_ROW start_CELL italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) end_CELL start_CELL … end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT in Equation (27) and, by the definition of Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT in the theorem statement, we obtain

x~(iTs)=\bigints(i1)TsiTs[exp(λ1(iTsτ))d1(τ,x(τ),u(τ))exp(λn(iTsτ))dn(τ,x(τ),u(τ))]dτ,~𝑥𝑖subscript𝑇𝑠superscriptsubscript\bigints𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠delimited-[]subscript𝜆1𝑖subscript𝑇𝑠𝜏subscript𝑑1𝜏𝑥𝜏𝑢𝜏subscript𝜆𝑛𝑖subscript𝑇𝑠𝜏subscript𝑑𝑛𝜏𝑥𝜏𝑢𝜏𝑑𝜏\displaystyle\tilde{x}(iT_{s})=-\bigints_{(i-1)T_{s}}^{iT_{s}}\left[\begin{% array}[]{c}\exp(\lambda_{1}(iT_{s}-\tau))d_{1}(\tau,x(\tau),u(\tau))\\ \vdots\\ \exp(\lambda_{n}(iT_{s}-\tau))d_{n}(\tau,x(\tau),u(\tau))\end{array}\right]d\tau,over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = - start_POSTSUBSCRIPT ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT [ start_ARRAY start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) end_CELL end_ROW end_ARRAY ] italic_d italic_τ ,

for all i𝑖i\in\mathcal{I}italic_i ∈ caligraphic_I. For brevity, we denote dj(τ)=dj(τ,x(τ),u(τ))subscript𝑑𝑗𝜏subscript𝑑𝑗𝜏𝑥𝜏𝑢𝜏d_{j}(\tau)=d_{j}(\tau,x(\tau),u(\tau))italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ ) = italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) for j=1,2,,n𝑗12𝑛j=1,2,\ldots,nitalic_j = 1 , 2 , … , italic_n.

Since dj(t)subscript𝑑𝑗𝑡d_{j}(t)italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( italic_t ) is continuous due to Assumption 1 and exp(As(iTsτ))subscript𝐴𝑠𝑖subscript𝑇𝑠𝜏\exp(A_{s}(iT_{s}-\tau))roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) is positive semi-definite, we invoke the Mean Value Theorem (Mendelson, 2022, Sec. 24.1) element wise. We conclude that there exist tcj[(i1)Ts,iTs)subscript𝑡subscript𝑐𝑗𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠t_{c_{j}}\in[(i-1)T_{s},iT_{s})italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∈ [ ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) for each j{1,,n}𝑗1𝑛j\in\{1,\dots,n\}italic_j ∈ { 1 , … , italic_n } such that

x~(iTs)~𝑥𝑖subscript𝑇𝑠\displaystyle\tilde{x}(iT_{s})over~ start_ARG italic_x end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) =[(i1)TsiTsexp(λ1(iTsτ))d1(tc1)𝑑τ(i1)TsiTsexp(λn(iTsτ))dn(tcn)𝑑τ]absentdelimited-[]superscriptsubscript𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠subscript𝜆1𝑖subscript𝑇𝑠𝜏subscript𝑑1subscript𝑡subscript𝑐1differential-d𝜏superscriptsubscript𝑖1subscript𝑇𝑠𝑖subscript𝑇𝑠subscript𝜆𝑛𝑖subscript𝑇𝑠𝜏subscript𝑑𝑛subscript𝑡subscript𝑐𝑛differential-d𝜏\displaystyle=-\left[\begin{array}[]{c}\int_{(i-1)T_{s}}^{iT_{s}}\exp(\lambda_% {1}(iT_{s}-\tau))d_{1}(t_{c_{1}})d\tau\\ \vdots\\ \int_{(i-1)T_{s}}^{iT_{s}}\exp(\lambda_{n}(iT_{s}-\tau))d_{n}(t_{c_{n}})d\tau% \end{array}\right]= - [ start_ARRAY start_ROW start_CELL ∫ start_POSTSUBSCRIPT ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_d italic_τ end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL ∫ start_POSTSUBSCRIPT ( italic_i - 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT roman_exp ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT - italic_τ ) ) italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) italic_d italic_τ end_CELL end_ROW end_ARRAY ] (28)
=[1λ1(1exp(λ1Ts))d1(tc1)1λn(1exp(λnTs))dn(tcn)].absentdelimited-[]1subscript𝜆11subscript𝜆1subscript𝑇𝑠subscript𝑑1subscript𝑡subscript𝑐11subscript𝜆𝑛1subscript𝜆𝑛subscript𝑇𝑠subscript𝑑𝑛subscript𝑡subscript𝑐𝑛\displaystyle=\left[\begin{array}[]{c}\frac{1}{\lambda_{1}}(1-\exp(\lambda_{1}% T_{s}))d_{1}(t_{c_{1}})\\ \vdots\\ \frac{1}{\lambda_{n}}(1-\exp(\lambda_{n}T_{s}))d_{n}(t_{c_{n}})\end{array}% \right].= [ start_ARRAY start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_ARG ( 1 - roman_exp ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL divide start_ARG 1 end_ARG start_ARG italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG ( 1 - roman_exp ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] .

Substituting  Equation (28) into Equation (25) gives

σ^(t)=σ^(iTs)=[exp(λ1Ts)d1(tc1)exp(λnTs)dn(tcn)],^𝜎𝑡^𝜎𝑖subscript𝑇𝑠delimited-[]subscript𝜆1subscript𝑇𝑠subscript𝑑1subscript𝑡subscript𝑐1subscript𝜆𝑛subscript𝑇𝑠subscript𝑑𝑛subscript𝑡subscript𝑐𝑛\hat{\sigma}(t)=\hat{\sigma}(iT_{s})=\left[\begin{array}[]{c}\exp(\lambda_{1}T% _{s})d_{1}(t_{c_{1}})\\ \vdots\\ \exp(\lambda_{n}T_{s})d_{n}(t_{c_{n}})\end{array}\right],over^ start_ARG italic_σ end_ARG ( italic_t ) = over^ start_ARG italic_σ end_ARG ( italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) = [ start_ARRAY start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] , (29)

which is the piece-wise constant estimate of the uncertainty for t[iTs,(i+1)Ts)𝑡𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠t\in[iT_{s},(i+1)T_{s})italic_t ∈ [ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ).

Next, using the piece-wise constant adaptation law from Equation (29), we bound l(t,x(t),u(t))σ^(t)norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑡\|l(t,x(t),u(t))-\hat{\sigma}(t)\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥, for t[iTs,(i+1)Ts)𝑡𝑖subscript𝑇𝑠𝑖1subscript𝑇𝑠t\in[iT_{s},(i+1)T_{s})italic_t ∈ [ italic_i italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , ( italic_i + 1 ) italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ). Since Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is Hurwitz and diagonal, its diagonal elements satisfy λj<0subscript𝜆𝑗0\lambda_{j}<0italic_λ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT < 0, for j=1,,n𝑗1𝑛j=1,\ldots,nitalic_j = 1 , … , italic_n. Thus, we have

l(t,x(t),u(t))σ^(t)norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle\|l(t,x(t),u(t))-\hat{\sigma}(t)\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥ =l(t,x(t),u(t))[exp(λ1Ts)d1(tc1)exp(λnTs)dn(tcn)]absentnorm𝑙𝑡𝑥𝑡𝑢𝑡delimited-[]subscript𝜆1subscript𝑇𝑠subscript𝑑1subscript𝑡subscript𝑐1subscript𝜆𝑛subscript𝑇𝑠subscript𝑑𝑛subscript𝑡subscript𝑐𝑛\displaystyle=\left\|l(t,x(t),u(t))-\left[\begin{array}[]{c}\exp(\lambda_{1}T_% {s})d_{1}(t_{c_{1}})\\ \vdots\\ \exp(\lambda_{n}T_{s})d_{n}(t_{c_{n}})\end{array}\right]\right\|= ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - [ start_ARRAY start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL roman_exp ( italic_λ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARRAY ] ∥ (33)
=l(t,x(t),u(t))d(tc)+(𝕀nexp(AsTs)d(tc))absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐subscript𝕀𝑛subscript𝐴𝑠subscript𝑇𝑠𝑑subscript𝑡𝑐\displaystyle=\|l(t,x(t),u(t))-d(t_{c})+(\mathbb{I}_{n}-\exp{(A_{s}T_{s})}d(t_% {c}))\|= ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + ( blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥
l(t,x(t),u(t))d(tc)+𝕀nexp(AsTs)d(tc)absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐normsubscript𝕀𝑛subscript𝐴𝑠subscript𝑇𝑠norm𝑑subscript𝑡𝑐\displaystyle\leq\|l(t,x(t),u(t))-d(t_{c})\|+\|\mathbb{I}_{n}-\exp{(A_{s}T_{s}% )}\|\|d(t_{c})\|≤ ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ + ∥ blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT - roman_exp ( italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ∥ ∥ italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥
l(t,x(t),u(t))d(tc)+(1exp(λminTs))d(tc),absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐1subscript𝜆subscript𝑇𝑠norm𝑑subscript𝑡𝑐\displaystyle\leq\|l(t,x(t),u(t))-d(t_{c})\|+(1-\exp(\lambda_{\min}T_{s}))\|d(% t_{c})\|,≤ ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ + ( 1 - roman_exp ( italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ∥ italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ , (34)

where d(tc):=[d1(tc1)dn(tcn)]assign𝑑subscript𝑡𝑐superscriptmatrixsubscript𝑑1subscript𝑡subscript𝑐1subscript𝑑𝑛subscript𝑡subscript𝑐𝑛topd(t_{c}):=\begin{bmatrix}d_{1}(t_{c_{1}})&\dots&d_{n}(t_{c_{n}})\end{bmatrix}^% {\top}italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) := [ start_ARG start_ROW start_CELL italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL start_CELL … end_CELL start_CELL italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_t start_POSTSUBSCRIPT italic_c start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT, d(τ)=l(τ,x(τ),u(τ))+a(x(τ),u(τ))𝑑𝜏𝑙𝜏𝑥𝜏𝑢𝜏𝑎𝑥𝜏𝑢𝜏d(\tau)=l(\tau,x(\tau),u(\tau))+a(x(\tau),u(\tau))italic_d ( italic_τ ) = italic_l ( italic_τ , italic_x ( italic_τ ) , italic_u ( italic_τ ) ) + italic_a ( italic_x ( italic_τ ) , italic_u ( italic_τ ) ), and λminsubscript𝜆\lambda_{\min}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT denotes the eigenvalue of Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT that has the minimum absolute value. Using the triangle inequality, we get d(tc)=l(tc,x(tc),u(tc)+a(x(tc),u(tc))ϵl+ϵa\|d(t_{c})\|=\|l(t_{c},x(t_{c}),u(t_{c})+a(x(t_{c}),u(t_{c}))\|\leq\epsilon_{l% }+\epsilon_{a}∥ italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ = ∥ italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) + italic_a ( italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥ ≤ italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT. Thus, Equation (33) can be written as

l(t,x(t),u(t))σ^(t)norm𝑙𝑡𝑥𝑡𝑢𝑡^𝜎𝑡\displaystyle\|l(t,x(t),u(t))-\hat{\sigma}(t)\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - over^ start_ARG italic_σ end_ARG ( italic_t ) ∥ l(t,x(t),u(t))d(tc)+(1exp(λminTs))(ϵl+ϵa).absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐1subscript𝜆subscript𝑇𝑠subscriptitalic-ϵ𝑙subscriptitalic-ϵ𝑎\displaystyle\leq\|l(t,x(t),u(t))-d(t_{c})\|+(1-\exp(\lambda_{\min}T_{s}))(% \epsilon_{l}+\epsilon_{a}).≤ ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ + ( 1 - roman_exp ( italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ( italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) . (35)

Next, we obtain the upper bound for l(t,x(t),u(t))d(tc)norm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐\|l(t,x(t),u(t))-d(t_{c})\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ as

l(t,x(t),u(t))d(tc)norm𝑙𝑡𝑥𝑡𝑢𝑡𝑑subscript𝑡𝑐\displaystyle\|l(t,x(t),u(t))-d(t_{c})\|∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_d ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ =l(t,x(t),u(t))l(tc,x(tc),u(tc))a(x(tc),u(tc))absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑙subscript𝑡𝑐𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐𝑎𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐\displaystyle=\|l(t,x(t),u(t))-l(t_{c},x(t_{c}),u(t_{c}))-a(x(t_{c}),u(t_{c}))\|= ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) - italic_a ( italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥
l(t,x(t),u(t))l(tc,x(tc),u(tc))+a(x(tc),u(tc))absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑙subscript𝑡𝑐𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐norm𝑎𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐\displaystyle\leq\|l(t,x(t),u(t))-l(t_{c},x(t_{c}),u(t_{c}))\|+\|a(x(t_{c}),u(% t_{c}))\|≤ ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥ + ∥ italic_a ( italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥
l(t,x(t),u(t))l(tc,x(tc),u(tc))+ϵa,absentnorm𝑙𝑡𝑥𝑡𝑢𝑡𝑙subscript𝑡𝑐𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐subscriptitalic-ϵ𝑎\displaystyle\leq\|l(t,x(t),u(t))-l(t_{c},x(t_{c}),u(t_{c}))\|+\epsilon_{a},≤ ∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥ + italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , (36)

where we used the triangle inequality and Equation (11). Due to the Assumption 1, l(t,x(t),u(t))𝑙𝑡𝑥𝑡𝑢𝑡l(t,x(t),u(t))italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) is Lipschitz over the domain of its arguments. Hence, there exist positive scalars Ll,t,Ll,x,Ll,usubscript𝐿𝑙𝑡subscript𝐿𝑙𝑥subscript𝐿𝑙𝑢L_{l,t},L_{l,x},L_{l,u}italic_L start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_l , italic_x end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_l , italic_u end_POSTSUBSCRIPT such that

l(t,x(t),u(t))l(tc,x(tc),u(tc))Ll,t|ttc|+Ll,xx(t)x(tc)+Ll,uu(t)u(tc).norm𝑙𝑡𝑥𝑡𝑢𝑡𝑙subscript𝑡𝑐𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐subscript𝐿𝑙𝑡𝑡subscript𝑡𝑐subscript𝐿𝑙𝑥norm𝑥𝑡𝑥subscript𝑡𝑐subscript𝐿𝑙𝑢norm𝑢𝑡𝑢subscript𝑡𝑐\|l(t,x(t),u(t))-l(t_{c},x(t_{c}),u(t_{c}))\|\leq L_{l,t}|t-t_{c}|+L_{l,x}\|x(% t)-x(t_{c})\|+L_{l,u}\|u(t)-u(t_{c})\|.∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥ ≤ italic_L start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT | italic_t - italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | + italic_L start_POSTSUBSCRIPT italic_l , italic_x end_POSTSUBSCRIPT ∥ italic_x ( italic_t ) - italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ + italic_L start_POSTSUBSCRIPT italic_l , italic_u end_POSTSUBSCRIPT ∥ italic_u ( italic_t ) - italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ . (37)

Furthermore, due to the compactness of 𝒳𝒳\mathcal{X}caligraphic_X and 𝒰𝒰\mathcal{U}caligraphic_U, there exist Lx,t,Lu,tsubscript𝐿𝑥𝑡subscript𝐿𝑢𝑡L_{x,t},L_{u,t}italic_L start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT such that the following inequalities hold:

x(t)x(tc)norm𝑥𝑡𝑥subscript𝑡𝑐\displaystyle\|x(t)-x(t_{c})\|∥ italic_x ( italic_t ) - italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ Lx,t|ttc|,u(t)u(tc)Lu,t|ttc|.formulae-sequenceabsentsubscript𝐿𝑥𝑡𝑡subscript𝑡𝑐norm𝑢𝑡𝑢subscript𝑡𝑐subscript𝐿𝑢𝑡𝑡subscript𝑡𝑐\displaystyle\leq L_{x,t}|t-t_{c}|,\quad\|u(t)-u(t_{c})\|\leq L_{u,t}|t-t_{c}|.≤ italic_L start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT | italic_t - italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | , ∥ italic_u ( italic_t ) - italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ∥ ≤ italic_L start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT | italic_t - italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | .

Substituting these bounds into Equation (37), we get

l(t,x(t),u(t))l(tc,x(tc),u(tc))L|ttc|LTs,norm𝑙𝑡𝑥𝑡𝑢𝑡𝑙subscript𝑡𝑐𝑥subscript𝑡𝑐𝑢subscript𝑡𝑐𝐿𝑡subscript𝑡𝑐𝐿subscript𝑇𝑠\|l(t,x(t),u(t))-l(t_{c},x(t_{c}),u(t_{c}))\|\leq L\left|t-t_{c}\right|\leq LT% _{s},∥ italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ) - italic_l ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_x ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) , italic_u ( italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ) ∥ ≤ italic_L | italic_t - italic_t start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT | ≤ italic_L italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , (38)

where LLl,t+Ll,xLx,t+Ll,uLu,t<𝐿subscript𝐿𝑙𝑡subscript𝐿𝑙𝑥subscript𝐿𝑥𝑡subscript𝐿𝑙𝑢subscript𝐿𝑢𝑡L\triangleq L_{l,t}+L_{l,x}L_{x,t}+L_{l,u}L_{u,t}<\inftyitalic_L ≜ italic_L start_POSTSUBSCRIPT italic_l , italic_t end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_l , italic_x end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_x , italic_t end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_l , italic_u end_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_u , italic_t end_POSTSUBSCRIPT < ∞. We proceed by sequentially applying the derived bounds, starting with the substitution of Equation (38) into Equation (C), and then employing the resulting bound in Equation (35). The proof is then concluded by incorporating the final bound into Equation (26) and noting that

(1exp(λminTs))(ϵl+ϵa)+LTs𝒪(Ts).1subscript𝜆subscript𝑇𝑠subscriptitalic-ϵ𝑙subscriptitalic-ϵ𝑎𝐿subscript𝑇𝑠𝒪subscript𝑇𝑠(1-\exp(\lambda_{\min}T_{s}))(\epsilon_{l}+\epsilon_{a})+LT_{s}\in\mathcal{O}(% T_{s}).( 1 - roman_exp ( italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) ) ( italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT + italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) + italic_L italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∈ caligraphic_O ( italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) .

Remark 2.

We provide some insights into the interpretation of this theorem. The theorem serves to quantify the predictive quality of the state predictor in the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT add-on scheme in terms of the model approximation errors ϵlsubscriptitalic-ϵ𝑙\epsilon_{l}italic_ϵ start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT and ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, and the parameters governing the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT add-on scheme (Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, λminsubscript𝜆\lambda_{\min}italic_λ start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT). As the control input is computed by low-pass filtering the uncertainty estimate, the performance of the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation is inherently tied to its predictive quality. Theorem 1 establishes that the error in state prediction, induced by estimated uncertainty, can be reduced down to 2ϵa2subscriptitalic-ϵ𝑎2\epsilon_{a}2 italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT by reducing the sampling time Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. In other words, we can accurately estimate the learning error l(t,x(t),u(t))𝑙𝑡𝑥𝑡𝑢𝑡l(t,x(t),u(t))italic_l ( italic_t , italic_x ( italic_t ) , italic_u ( italic_t ) ), with the predictive accuracy being bounded only by the tunable parameter ϵasubscriptitalic-ϵ𝑎\epsilon_{a}italic_ϵ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT.

Appendix D Extended Simulation Results

D.1 Experiment setup

We provide the dimensionality of the selected environments for our simulation analysis in Table 3. For 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO, the number of iterations for each environment was chosen to obtain asymptotic performance, whereas for 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBMF we fixed the number of iterations to 200K. Such a setup for MBMF is due to the unique structure of MBMF to use MBRL only to serve as the policy initialization, with which MFRL is executed until the performance reaches the asymptotic results. See Appendix D.2.2 for a detailed explanation of MBMF.

Table 3: Dimensions of state and action space of the environments used in the simulations.
Environment Name State space dimension (n)𝑛(n)( italic_n ) Action space dimension (m)𝑚(m)( italic_m )
Inverted Pendulum 4 1
Swimmer 8 2
Hopper 11 3
Walker 17 6
Halfcheetah 17 6

We adopted the hyperparameters that have been reported to be effective by the baseline MBRL, which in our case are METRPO and MBMF (Wang et al., 2019, Appendix B.4, B.5). Additional hyperparameters introduced by the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBRL scheme are the affinization threshold ϵitalic-ϵ\epsilonitalic_ϵ, the cutoff frequency ω𝜔\omegaitalic_ω, and the Hurwitz matrix Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT. Throughout all experiments, we fixed Assubscript𝐴𝑠A_{s}italic_A start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT as a negative identity matrix 𝕀nsubscript𝕀𝑛-\mathbb{I}_{n}- blackboard_I start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. For the Inverted Pendulum environment, we set ϵ=1italic-ϵ1\epsilon=1italic_ϵ = 1 and for Halfcheetah ϵ=3italic-ϵ3\epsilon=3italic_ϵ = 3, while for other environments, we chose ϵ=0.3italic-ϵ0.3\epsilon=0.3italic_ϵ = 0.3. Additionally, we selected a cutoff frequency of ω=0.35/Ts𝜔0.35subscript𝑇𝑠\omega=0.35/T_{s}italic_ω = 0.35 / italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT, where Tssubscript𝑇𝑠T_{s}italic_T start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT represents the sampling time interval of the environment. It is important to note that the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT controller has not been redesigned or retuned through all the experiments.

D.2 Technical Remarks

If the baseline algorithm employs any data processing techniques such as input/output normalization, as discussed briefly in Section 3.1, our state predictor and controller (Equation (12a),Equation (12c)) must also follow the corresponding process.

D.2.1 METRPO

METRPO trains an ensemble model, from which fictitious samples are generated. Then, the policy network is updated following the TRPO (Schulman et al., 2015) in the policy improvement step. The input and output of the neural network are normalized during the training step, and consequently, calculation of the Jacobian in 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO must unnormalize the result. Specifically, this process is carried out by applying the chain rule, which includes multiplying the normalized Jacobian matrix (Jsuperscript𝐽J^{\prime}italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT) by the standard deviations of the inputs and outputs given by Equation (39):

J=DΔxJDx,u1,𝐽subscript𝐷Δ𝑥superscript𝐽superscriptsubscript𝐷𝑥𝑢1J=D_{\Delta x}J^{\prime}D_{x,u}^{-1},italic_J = italic_D start_POSTSUBSCRIPT roman_Δ italic_x end_POSTSUBSCRIPT italic_J start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_D start_POSTSUBSCRIPT italic_x , italic_u end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT , (39)

where DΔx=𝚍𝚒𝚊𝚐{σΔx1,,σΔxn}subscript𝐷Δ𝑥𝚍𝚒𝚊𝚐subscript𝜎Δsubscript𝑥1subscript𝜎Δsubscript𝑥𝑛D_{\Delta x}=\texttt{diag}\{\sigma_{\Delta x_{1}},\ldots,\sigma_{\Delta x_{n}}\}italic_D start_POSTSUBSCRIPT roman_Δ italic_x end_POSTSUBSCRIPT = diag { italic_σ start_POSTSUBSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT roman_Δ italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT } and Dx,u=𝚍𝚒𝚊𝚐{σx1,,σxn,σu1,,σum}.subscript𝐷𝑥𝑢𝚍𝚒𝚊𝚐subscript𝜎subscript𝑥1subscript𝜎subscript𝑥𝑛subscript𝜎subscript𝑢1subscript𝜎subscript𝑢𝑚D_{x,u}=\texttt{diag}\{\sigma_{x_{1}},\ldots,\sigma_{x_{n}},\sigma_{u_{1}},% \ldots,\sigma_{u_{m}}\}.italic_D start_POSTSUBSCRIPT italic_x , italic_u end_POSTSUBSCRIPT = diag { italic_σ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_σ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , italic_σ start_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT } .

This unnormalized Jacobian (J𝐽Jitalic_J) is subsequently utilized to generate the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive control output.

D.2.2 MBMF

In the MBMF algorithm (Nagabandi et al., 2018), the authors begin by training a Random Shooting (RS) controller. This controller is then distilled into a neural network policy using the supervised framework DAgger (Ross et al., 2011), which minimizes the KL divergence loss between the neural network policy and the RS controller. Then, the policy is fine-tuned using standard model-free algorithms like TRPO (Schulman et al., 2015) or PPO (Schulman et al., 2017). We adopt a similar approach to what was done for METRPO. The Jacobian matrix of the neural network is unnormalized based on Equation (39). The adaptive controller is augmented to the RS controller based on the latest model trained.

D.3 Experiment results

In this section, we first present the results of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBMF in comparison to MBMF without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation. The corresponding tabular results are summarized in Table 4. Noticeably, 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation improves the MBMF algorithm in every case uniformly.

Table 4: Performance comparison between MBMF and 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MBMF (Ours). The performance is averaged across multiple random seeds with a window size of 5000 timesteps at the end of the training. Higher performance is written in bold and green.
Noise-free σ𝐚=0.1subscript𝜎𝐚0.1\mathbf{\sigma_{a}=0.1}italic_σ start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT = bold_0.1 σ𝐨=0.1subscript𝜎𝐨0.1\mathbf{\sigma_{o}=0.1}italic_σ start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.1
Env. MB-MF 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MB-MF MB-MF 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MB-MF MB-MF 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-MB-MF
Inv. P. 100.5±4.3plus-or-minus100.54.3-100.5\pm 4.3- 100.5 ± 4.3 10.5±3.7plus-or-minus10.53.7\mathbf{-10.5\pm 3.7}- bold_10.5 ± bold_3.7 7.4±1.5plus-or-minus7.41.5-7.4\pm 1.5- 7.4 ± 1.5 4.8±1.9plus-or-minus4.81.9\mathbf{-4.8\pm 1.9}- bold_4.8 ± bold_1.9 10.2±2.4plus-or-minus10.22.4-10.2\pm 2.4- 10.2 ± 2.4 5.09±1.6plus-or-minus5.091.6\mathbf{-5.09\pm 1.6}- bold_5.09 ± bold_1.6
Swimmer 284.9±25.1plus-or-minus284.925.1284.9\pm 25.1284.9 ± 25.1 314.3±3.3plus-or-minus314.33.3\mathbf{314.3\pm 3.3}bold_314.3 ± bold_3.3 304.8±1.9plus-or-minus304.81.9304.8\pm 1.9304.8 ± 1.9 314.5±0.6plus-or-minus314.50.6\mathbf{314.5\pm 0.6}bold_314.5 ± bold_0.6 292.8±1.3plus-or-minus292.81.3292.8\pm 1.3292.8 ± 1.3 294.3±4.3plus-or-minus294.34.3\mathbf{294.3\pm 4.3}bold_294.3 ± bold_4.3
Hopper 1047.4±1098.7plus-or-minus1047.41098.7-1047.4\pm 1098.7- 1047.4 ± 1098.7 350.1±465.2plus-or-minus350.1465.2\mathbf{350.1\pm 465.2}bold_350.1 ± bold_465.2 877.9±383.4plus-or-minus877.9383.4-877.9\pm 383.4- 877.9 ± 383.4 285.4±65.3plus-or-minus285.465.3\mathbf{-285.4\pm 65.3}- bold_285.4 ± bold_65.3 996.9±206.0plus-or-minus996.9206.0-996.9\pm 206.0- 996.9 ± 206.0 171.5±317.3plus-or-minus171.5317.3\mathbf{-171.5\pm 317.3}- bold_171.5 ± bold_317.3
Walker 1743.7±233.3plus-or-minus1743.7233.3-1743.7\pm 233.3- 1743.7 ± 233.3 1481.7±322.9plus-or-minus1481.7322.9\mathbf{-1481.7\pm 322.9}- bold_1481.7 ± bold_322.9 2962.2±178.6plus-or-minus2962.2178.6-2962.2\pm 178.6- 2962.2 ± 178.6 2447.4±329.7plus-or-minus2447.4329.7\mathbf{-2447.4\pm 329.7}- bold_2447.4 ± bold_329.7 3348.8±210.1plus-or-minus3348.8210.1-3348.8\pm 210.1- 3348.8 ± 210.1 2261.4±𝟑𝟖𝟏plus-or-minus2261.4381\mathbf{-2261.4\pm 381}- bold_2261.4 ± bold_381
Halfcheetah 126.9±72.7plus-or-minus126.972.7{126.9\pm 72.7}126.9 ± 72.7 304.5±56.0plus-or-minus304.556.0\mathbf{304.5\pm 56.0}bold_304.5 ± bold_56.0 184.0±148.9plus-or-minus184.0148.9184.0\pm 148.9184.0 ± 148.9 299.8±61.0plus-or-minus299.861.0\mathbf{299.8\pm 61.0}bold_299.8 ± bold_61.0 146.1±87.8plus-or-minus146.187.8146.1\pm 87.8146.1 ± 87.8 235.2±19.2plus-or-minus235.219.2\mathbf{235.2\pm 19.2}bold_235.2 ± bold_19.2

Additionally, we provide detailed tabular values corresponding to the results shown in Fig. 3. Table 5 provides a summary of the scenarios where 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control is augmented only during either training or testing. The application of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control during the testing phase clearly benefits from the explicit rejection of system uncertainty, leading to performance improvement. On the contrary, when 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control is applied during the training phase, it not only mitigates uncertainty along the trajectory but also implicitly affects the training process by inducing a shift in the distribution of the training dataset. This study compares these two types of impact brought about by the 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation.

Table 5: Comparison of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation effects during training and testing. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO (Train) refers to the application of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation solely during training, whereas 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO (Test) indicates training without 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation and the application of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT only during testing.
Noise-free σ𝐚=0.1subscript𝜎𝐚0.1\mathbf{\sigma_{a}=0.1}italic_σ start_POSTSUBSCRIPT bold_a end_POSTSUBSCRIPT = bold_0.1 σ𝐨=0.1subscript𝜎𝐨0.1\mathbf{\sigma_{o}=0.1}italic_σ start_POSTSUBSCRIPT bold_o end_POSTSUBSCRIPT = bold_0.1
Env. 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO (Train) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO(Test) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO(Train) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO(Test) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO(Train) 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT-METRPO(Test)
Inv. P. 8.50±20.75plus-or-minus8.5020.75\mathbf{-8.50\pm 20.75}- bold_8.50 ± bold_20.75 19.36±22.3plus-or-minus19.3622.3-19.36\pm 22.3- 19.36 ± 22.3 3.52±8.08plus-or-minus3.528.08\mathbf{-3.52\pm 8.08}- bold_3.52 ± bold_8.08 49.72±47.34plus-or-minus49.7247.34-49.72\pm 47.34- 49.72 ± 47.34 41.63±19.11plus-or-minus41.6319.11-41.63\pm 19.11- 41.63 ± 19.11 37.00±39.07plus-or-minus37.0039.07\mathbf{-37.00\pm 39.07}- bold_37.00 ± bold_39.07
Swimmer 332.6±1.3plus-or-minus332.61.3332.6\pm 1.3332.6 ± 1.3 332.6±1.6plus-or-minus332.61.6332.6\pm 1.6332.6 ± 1.6 321.8±1.0plus-or-minus321.81.0\mathbf{321.8\pm 1.0}bold_321.8 ± bold_1.0 298.9±3.1plus-or-minus298.93.1298.9\pm 3.1298.9 ± 3.1 32.9±1.5plus-or-minus32.91.532.9\pm 1.532.9 ± 1.5 52.0±8.7plus-or-minus52.08.7\mathbf{52.0\pm 8.7}bold_52.0 ± bold_8.7
Hopper 1201.2±90.8plus-or-minus1201.290.81201.2\pm 90.81201.2 ± 90.8 1269.9±752.9plus-or-minus1269.9752.9\mathbf{1269.9\pm 752.9}bold_1269.9 ± bold_752.9 771.1±49.8plus-or-minus771.149.8771.1\pm 49.8771.1 ± 49.8 818.1±394.2plus-or-minus818.1394.2\mathbf{818.1\pm 394.2}bold_818.1 ± bold_394.2 931.7±15.4plus-or-minus931.715.4\mathbf{-931.7\pm 15.4}- bold_931.7 ± bold_15.4 976.8±73.1plus-or-minus976.873.1-976.8\pm 73.1- 976.8 ± 73.1
Walker 7.0±0.1plus-or-minus7.00.1-7.0\pm 0.1- 7.0 ± 0.1 5.9±0.0plus-or-minus5.90.0\mathbf{-5.9\pm 0.0}- bold_5.9 ± bold_0.0 6.5±0.3plus-or-minus6.50.3\mathbf{-6.5\pm 0.3}- bold_6.5 ± bold_0.3 7.5±0.2plus-or-minus7.50.2-7.5\pm 0.2- 7.5 ± 0.2 6.3±0.0plus-or-minus6.30.0\mathbf{-6.3\pm 0.0}- bold_6.3 ± bold_0.0 10.4±0.2plus-or-minus10.40.2-10.4\pm 0.2- 10.4 ± 0.2
Halfcheetah 2706.2±1170.4plus-or-minus2706.21170.4\mathbf{2706.2\pm 1170.4}bold_2706.2 ± bold_1170.4 1921.56±821.34plus-or-minus1921.56821.341921.56\pm 821.341921.56 ± 821.34 1834±434.87plus-or-minus1834434.871834\pm 434.871834 ± 434.87 1957.5±581.6plus-or-minus1957.5581.6\mathbf{1957.5\pm 581.6}bold_1957.5 ± bold_581.6 987.90±435.90plus-or-minus987.90435.90987.90\pm 435.90987.90 ± 435.90 1022.1±619.8plus-or-minus1022.1619.8\mathbf{1022.1\pm 619.8}bold_1022.1 ± bold_619.8

Notably, there is no consistent trend regarding whether 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT control has a greater impact during testing or training phases. The primary conclusion drawn from this ablation study - in conjunction with Fig. 3 - is that 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation yields the greatest benefits when applied to both training and testing. One possible explanation for this observation is that such consistent augmentation avoids a shift in the policy distribution, leading to desired performance.

Next, in Fig. 5, we report the learning curves of the main result.

Refer to caption
Figure 5: Plots of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO learning curves as a function of episodic steps. The performance is averaged across multiple random seeds such that the solid lines indicate the average return at the corresponding timestep, and the shaded regions indicate one standard deviation.
Refer to caption
Figure 6: Plots of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBMF learning curves as a function of episodic steps. The evaluation of the performance is identical to 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -METRPO.

D.4 Comparison with Probabilistic Models

Refer to caption
Figure 7: Plots of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -DETS vs PETS learning curves as a function of episodic steps.

Probabilistic models, as discussed in  (Chua et al., 2018; Wang & Ba, 2020), offer a common approach in Reinforcement Learning (RL) to tackle model uncertainty. In contrast, our approach, centered on a robust controller, shares a similar spirit but differs in architecture. While previous works directly integrate uncertainty into decision-making, for example, through methods like sampling-based Model Predictive Control (MPC) (Chua et al., 2018), our approach takes a unique path by decoupling the process. We address uncertainty by explicitly estimating and mitigating it based on the learned deterministic nominal dynamics, allowing the MBRL algorithm to operate as intended.

Recently, the authors in (Zheng et al., 2022) emphasized that the empirical success of probabilistic dynamic model ensembles is attributed to their Lipschitz-regularizing aspect on the value functions. This observation led to the hypothesis that the ensemble’s key functionality is to regularize the Lipschitz constant of the value function, not in its probabilistic formulation. The authors have shown that the predictive quality of deterministic models does not show much difference with probabilistic (ensemble) models, leading to the conclusion that deterministic models can offer computational efficiency and practicality for many MBRL scenarios. In this context, our work exploits the practical advantages of using deterministic models, while 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT adaptive controller accounts for the randomness present in the environment.

In Fig. 7, we conducted supplementary experiments comparing PETS and its deterministic counterpart, DETS, with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation. The results were obtained using multiple random seeds and 200,000 timesteps in the Inverted Pendulum environment with an action noise of σa=0.3subscript𝜎𝑎0.3\sigma_{a}=0.3italic_σ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = 0.3, demonstrating that the deterministic model with 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT augmentation (1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -DETS) can outperform the probabilistic model approach (PETS). However, it’s important to note that this comparison is specific to one environment. We refrain from making broad claims regarding DETS’s superiority over PETS without further in-depth analysis and experimentation. In conclusion, we express our intent to explore the development of 1subscript1\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT -MBRL that can effectively work alongside probabilistic models, recognizing the potential advantages of both approaches.