\RenewDocumentCommand

˝moH\IfValueTF#2[#1 | #2](#1) \NewDocumentCommand\HsmmoH\IfValueTF#2[#1 ∣#2](#1) \NewDocumentCommand\ImmoI\IfValueTF#3​(#1;#2 | #3)(#1; #2) \NewDocumentCommand\IsmmmoI\IfValueTF#3(#1;#2 ∣#3)(#1; #2)

\etocdepthtag

.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone

When to Sense and Control?
A Time-adaptive Approach for Continuous-Time RL

Lenart Treven, Bhavya Sukhija, Yarden As, Florian Dörfler, Andreas Krause
ETH Zurich, Switzerland
Correspondence to [email protected]
Abstract

Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP). However, various systems are inherently continuous in time, making discrete-time MDPs an inexact modeling choice. In many applications, such as greenhouse control or medical treatments, each interaction (measurement or switching of action) involves manual intervention and thus is inherently costly. Therefore, we generally prefer a time-adaptive approach with fewer interactions with the system. In this work, we formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application. Our formulation results in an extended MDP that any standard RL algorithm can solve. We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart while retaining the same or improved performance, and exhibiting robustness over discretization frequency. Finally, we propose OTaCoS, an efficient model-based algorithm for our setting. We show that OTaCoS enjoys sublinear regret for systems with sufficiently smooth dynamics and empirically results in further sample-efficiency gains.

1 Introduction

Nearly all state-of-the-art RL algorithms (Schulman et al.,, 2017; Haarnoja et al.,, 2018; Lillicrap et al.,, 2015; Schulman et al.,, 2015) were developed for discrete-time MDPs. Nevertheless, continuous-time systems are ubiquitous in nature, ranging from robotics, biology, medicine, environment and sustainability etc. (cf. Spong et al.,, 2006; Jones et al.,, 2009; Lenhart and Workman,, 2007; Panetta and Fister,, 2003; Turchetta et al.,, 2022). Such systems can be naturally modeled with stochastic differential equations (SDEs), but computational approaches necessitate discretization. Furthermore, in many applications, obtaining measurements and switching actions is expensive. For instance, consider a greenhouse of fruits or medical treatment recommendations. In both cases, each measurement (crop inspection, medical exam) or switching of actions (climate control, treatment adjustment) typically involves costly human intervention. Hence, minimizing such interactions with the underlying system is desirable. This underlying challenge is rarely addressed in the RL literature.

In practice, a time-equidistant discretization frequency is set, often manually, adjusted to the underlying system’s characteristic time scale. This is challenging, however, especially for unknown/uncertain systems, and systems with multiple dominant time scales (Engquist et al.,, 2007). Therefore, for many real-world applications having a global frequency of control is inadequate and wasteful. For example, in medicine, patient monitoring often requires higher frequency interaction during the onset of illness and lower frequency interactions as the patient recovers (Kaandorp and Koole,, 2007).

In this work, we address this limitation of standard RL methods and propose a novel RL framework, Time-adaptive Control & Sensing (TaCoS). TaCoS reduces a general continuous-time RL problem with underlying SDE dynamics to an equivalent discrete-time MDP, that can be solved with any RL algorithm, including standard policy gradient methods like PPO and SAC (Schulman et al.,, 2017; Haarnoja et al.,, 2018). We summarize our contributions below.

Contributions
  1. 1.

    We reformulate the problem of time-adaptive continuous time RL to an equaivalent discrete-time MDP that can be solved with standard RL algorithms.

  2. 2.

    Using our formulation, we extend standard policy gradient techniques (Haarnoja et al., (2018) and Schulman et al., (2017)) to the time-adaptive setting. Our empirical results on standard RL benchmarks (Freeman et al.,, 2021) show that TaCoS outperforms its discrete-time counterpart in terms of policy performance, computational cost, and sample efficiency.

  3. 3.

    To further improve sample efficiency, we propose a model-based RL algorithm, OTaCoS. OTaCoS uses well-calibrated probabilistic models to capture epistemic uncertainty and, similar to Curi et al., (2020) and Treven et al., (2023), leverages the principle of optimism in the face of uncertainty to guide exploration during learning. We theoretically prove that OTaCoS suffers no regret and empirically demonstrate its sample efficiency.

2 Problem statement

We consider a general nonlinear continuous time dynamical system with continuous state 𝒳d𝒙𝒳superscriptsubscript𝑑𝒙{\mathcal{X}}\subset\mathbb{R}^{d_{{\bm{x}}}}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and action 𝒰d𝒖𝒰superscriptsubscript𝑑𝒖{\mathcal{U}}\subset\mathbb{R}^{d_{{\bm{u}}}}caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT space. The underlying dynamics are governed by a (controllable) SDE:

d𝒙t=𝒇(𝒙t,𝒖t)dt+𝒈(𝒙t,𝒖t)d𝑩t.𝑑subscript𝒙𝑡superscript𝒇subscript𝒙𝑡subscript𝒖𝑡𝑑𝑡superscript𝒈subscript𝒙𝑡subscript𝒖𝑡𝑑subscript𝑩𝑡\displaystyle d{\bm{x}}_{t}={\bm{f}}^{*}({\bm{x}}_{t},{\bm{u}}_{t})dt+{\bm{g}}% ^{*}({\bm{x}}_{t},{\bm{u}}_{t})d{\bm{B}}_{t}.italic_d bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d italic_t + bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) italic_d bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT . (1)

Here 𝒙t𝒳subscript𝒙𝑡𝒳{\bm{x}}_{t}\in{\mathcal{X}}bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_X is the state at time t𝑡titalic_t, 𝒖t𝒰subscript𝒖𝑡𝒰{\bm{u}}_{t}\in{\mathcal{U}}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_U the control input, 𝒇,𝒈superscript𝒇superscript𝒈{\bm{f}}^{*},{\bm{g}}^{*}bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are unknown drift and diffusion functions and 𝑩tsubscript𝑩𝑡{\bm{B}}_{t}bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is a standard Brownian motion in d𝑩superscriptsubscript𝑑𝑩\mathbb{R}^{d_{{\bm{B}}}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_B end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. Our goal is to find a control policy 𝝅𝝅{\bm{\pi}}bold_italic_π which maximizes an unknown reward b(𝒙t,𝒖t)superscript𝑏subscript𝒙𝑡subscript𝒖𝑡b^{*}({\bm{x}}_{t},{\bm{u}}_{t})italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) over a fixed horizon 𝒯[0,T]𝒯0𝑇{\mathcal{T}}\in[0,T]caligraphic_T ∈ [ 0 , italic_T ], i.e.,

max𝝅Π𝔼[t𝒯b(𝒙t,𝝅(𝒙t))𝑑t],subscript𝝅Π𝔼delimited-[]subscript𝑡𝒯superscript𝑏subscript𝒙𝑡𝝅subscript𝒙𝑡differential-d𝑡\max_{{\bm{\pi}}\in\Pi}\mathbb{E}\left[\int_{t\in{\mathcal{T}}}b^{*}({\bm{x}}_% {t},{\bm{\pi}}({\bm{x}}_{t}))dt\right],roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π end_POSTSUBSCRIPT blackboard_E [ ∫ start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) italic_d italic_t ] ,

where the expectation is taken w.r.t. the policy and stochastic dynamics and ΠΠ\Piroman_Π is the class of policies111We assume that ΠΠ\Piroman_Π is the set of L𝝅subscript𝐿𝝅L_{{\bm{\pi}}}italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT-Lipschitz policies over which we search.

In practice, we can only measure the system state and execute control policies in discrete points in time. In this work, we focus on problems where state measurement and control are synchronized in time. We refer to these synchronized time points as interactions in the following parts of this paper. Synchronizing state measurement and control contrasts standard time-adaptive approaches such as event-triggered control (Heemels et al.,, 2021), where the state is measured arbitrarily high frequency and control inputs are changed only so often to ensure stability. It is also in contrast to the complementary setting, where control inputs are changing at an arbitrarily high frequency but measurements are collected adaptively in time (Treven et al.,, 2023). An adaptive control approach as Heemels et al., (2021) is very important for many real-world applications but similarly, an adaptive measurement strategy is crucial for efficient learning in RL (Treven et al.,, 2023). Our approach treats both of these requirements jointly.

We consider two different scenarios for continuous-time control: (i) Penalizing interactions with some cost, (ii) bounded number of interactions, i.e., hard constraint on control/measurement steps.

Interaction cost

We consider the setting where every interaction we take has an inherent cost c(𝒙t,𝒖t)>0𝑐subscript𝒙𝑡subscript𝒖𝑡0c({\bm{x}}_{t},{\bm{u}}_{t})>0italic_c ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) > 0. Note that we consider this cost structure for its simplicity and TaCoS works for more general cost functions that depend on the duration of application for the action 𝒖tsubscript𝒖𝑡{\bm{u}}_{t}bold_italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT or the previous action 𝒖t1subscript𝒖𝑡1{\bm{u}}_{t-1}bold_italic_u start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT and thus captures many practical real-world settings. We define this task more formally below

max𝝅Π,π𝒯𝔼[i=0K1ti1tib(𝒙t,𝝅(𝒙ti1))𝑑tc(𝒙ti1,𝝅(𝒙ti1))],subscript𝝅Πsubscript𝜋𝒯𝔼delimited-[]subscriptsuperscript𝐾1𝑖0subscriptsuperscriptsubscript𝑡𝑖subscript𝑡𝑖1superscript𝑏subscript𝒙𝑡𝝅subscript𝒙subscript𝑡𝑖1differential-d𝑡𝑐subscript𝒙subscript𝑡𝑖1𝝅subscript𝒙subscript𝑡𝑖1\displaystyle\max_{{\bm{\pi}}\in\Pi,\pi_{{\mathcal{T}}}}\mathbb{E}\left[\sum^{% K-1}_{i=0}\int^{t_{i}}_{t_{i-1}}b^{*}({\bm{x}}_{t},{\bm{\pi}}({\bm{x}}_{t_{i-1% }}))dt-c({\bm{x}}_{t_{i-1}},{\bm{\pi}}({\bm{x}}_{t_{i-1}}))\right],roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π , italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT ∫ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) italic_d italic_t - italic_c ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) ] , (2)
ti=π𝒯(𝒙ti1)+ti1,t0=0,tK=T,𝒙𝒳;π𝒯(𝒙)[tmin,tmax].formulae-sequencesubscript𝑡𝑖subscript𝜋𝒯subscript𝒙subscript𝑡𝑖1subscript𝑡𝑖1formulae-sequencesubscript𝑡00formulae-sequencesubscript𝑡𝐾𝑇formulae-sequencefor-all𝒙𝒳subscript𝜋𝒯𝒙subscript𝑡subscript𝑡\displaystyle t_{i}=\pi_{{\mathcal{T}}}({\bm{x}}_{t_{i-1}})+t_{i-1},\;t_{0}=0,% t_{K}=T,\;\forall{\bm{x}}\in{\mathcal{X}};\pi_{{\mathcal{T}}}({\bm{x}})\in[t_{% \min},t_{\max}].italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_T , ∀ bold_italic_x ∈ caligraphic_X ; italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x ) ∈ [ italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] .

Here tmin>0subscript𝑡0t_{\min}>0italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT > 0 is the minimal duration for which we have to apply the control, tmax[tmin,T]subscript𝑡subscript𝑡𝑇t_{\max}\in[t_{\min},T]italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ∈ [ italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_T ] the maximum duration, and π𝒯subscript𝜋𝒯\pi_{{\mathcal{T}}}italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT is a policy that predicts the duration of applying the action.

Bounded number of interactions

In this setting, the number of interactions with the system is limited by a known amount K𝐾Kitalic_K. Intuitively, this represents a scenario where we have a finite budget for the inputs that we can apply and have to decide on the best strategy to space these K𝐾Kitalic_K inputs over the full horizon. A formal definition of this task is given below

max𝝅Π,π𝒯𝔼[i=0K1ti1tib(𝒙t,𝝅(𝒙ti1))𝑑t],subscript𝝅Πsubscript𝜋𝒯𝔼delimited-[]subscriptsuperscript𝐾1𝑖0subscriptsuperscriptsubscript𝑡𝑖subscript𝑡𝑖1superscript𝑏subscript𝒙𝑡𝝅subscript𝒙subscript𝑡𝑖1differential-d𝑡\displaystyle\max_{{\bm{\pi}}\in\Pi,\pi_{{\mathcal{T}}}}\mathbb{E}\left[\sum^{% K-1}_{i=0}\int^{t_{i}}_{t_{i-1}}b^{*}({\bm{x}}_{t},{\bm{\pi}}({\bm{x}}_{t_{i-1% }}))dt\right],roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π , italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 0 end_POSTSUBSCRIPT ∫ start_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) ) italic_d italic_t ] , (3)
ti=π𝒯(𝒙ti1)+ti1,t0=0,tK=T,𝒙𝒳;π𝒯(𝒙)[tmin,tmax].formulae-sequencesubscript𝑡𝑖subscript𝜋𝒯subscript𝒙subscript𝑡𝑖1subscript𝑡𝑖1formulae-sequencesubscript𝑡00formulae-sequencesubscript𝑡𝐾𝑇formulae-sequencefor-all𝒙𝒳subscript𝜋𝒯𝒙subscript𝑡subscript𝑡\displaystyle t_{i}=\pi_{{\mathcal{T}}}({\bm{x}}_{t_{i-1}})+t_{i-1},\;t_{0}=0,% t_{K}=T,\;\forall{\bm{x}}\in{\mathcal{X}};\pi_{{\mathcal{T}}}({\bm{x}})\in[t_{% \min},t_{\max}].italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) + italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = 0 , italic_t start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT = italic_T , ∀ bold_italic_x ∈ caligraphic_X ; italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x ) ∈ [ italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] .

In the absence of the transition costs or the bound on the number of interactions, intuitively the policy would propose to interact with the system as frequently as possible, i.e., every tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT seconds. The additional costs/constraints ensure that we do not converge to this trivial (but unrealistic) solution.

3 TaCoS: Time Adaptive Control or Sensing

In the following, we reformulate the continuous-time problem as an equivalent discrete-time MDP. We first denote the state and running reward flows of Equation 1. The state flow by applying action 𝒖ksubscript𝒖𝑘{\bm{u}}_{k}bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT time reads:

𝒙k+1subscript𝒙𝑘1\displaystyle{\bm{x}}_{k+1}bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT =𝚵(𝒙k,𝒖k,tk),absent𝚵subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘\displaystyle={\bm{\Xi}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k}),= bold_Ξ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,
𝚵(𝒙,𝒖,t)𝚵𝒙𝒖𝑡\displaystyle{\bm{\Xi}}({\bm{x}},{\bm{u}},t)bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) =def𝒙+0t𝒇(𝒙s,𝒖)𝑑s+0t𝒈(𝒙s,𝒖)𝑑𝑩s.def𝒙superscriptsubscript0𝑡superscript𝒇subscript𝒙𝑠𝒖differential-d𝑠superscriptsubscript0𝑡superscript𝒈subscript𝒙𝑠𝒖differential-dsubscript𝑩𝑠\displaystyle\overset{\text{def}}{=}{\bm{x}}+\int_{0}^{t}{\bm{f}}^{*}({\bm{x}}% _{s},{\bm{u}})ds+\int_{0}^{t}{\bm{g}}^{*}({\bm{x}}_{s},{\bm{u}})d{\bm{B}}_{s}.overdef start_ARG = end_ARG bold_italic_x + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) italic_d italic_s + ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) italic_d bold_italic_B start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT .

We assume that every time we interact with the system, we also obtain the integrated reward and define the reward flow as

Ξb(𝒙,𝒖,t)=0tb(𝚵(𝒙,𝒖,s),𝒖)𝑑s.subscriptΞsuperscript𝑏𝒙𝒖𝑡absentsuperscriptsubscript0𝑡superscript𝑏𝚵𝒙𝒖𝑠𝒖differential-d𝑠\begin{aligned} \Xi_{b^{*}}({\bm{x}},{\bm{u}},t)&=\int_{0}^{t}b^{*}\left({\bm{% \Xi}}({\bm{x}},{\bm{u}},s),{\bm{u}}\right)ds\\ \end{aligned}.start_ROW start_CELL roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) end_CELL start_CELL = ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) , bold_italic_u ) italic_d italic_s end_CELL end_ROW . (4)

Due to the stochasticity of (𝑩t)t𝒯subscriptsubscript𝑩𝑡𝑡𝒯({\bm{B}}_{t})_{t\in{\mathcal{T}}}( bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT, the state flow 𝚵(𝒙,𝒖,t)𝚵𝒙𝒖𝑡{\bm{\Xi}}({\bm{x}},{\bm{u}},t)bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) and the reward flow Ξb(𝒙,𝒖,t)subscriptΞsuperscript𝑏𝒙𝒖𝑡\Xi_{b^{*}}({\bm{x}},{\bm{u}},t)roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) are stochastic. For ease of notation, we denote

𝚽𝒇(𝒙k,𝒖k,tk)=def𝔼[𝚵(𝒙k,𝒖k,tk)],Φb(𝒙k,𝒖k,tk)=def𝔼[Ξb(𝒙k,𝒖k,tk)]subscript𝚽superscript𝒇subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘def𝔼delimited-[]𝚵subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘subscriptΦsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘def𝔼delimited-[]subscriptΞsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘\displaystyle{\bm{\Phi}}_{{\bm{f}}^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})% \overset{\text{def}}{=}\mathbb{E}\left[{\bm{\Xi}}({\bm{x}}_{k},{\bm{u}}_{k},t_% {k})\right],\quad\Phi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})\overset{\text{% def}}{=}\mathbb{E}\left[\Xi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})\right]bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) overdef start_ARG = end_ARG blackboard_E [ bold_Ξ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ] , roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) overdef start_ARG = end_ARG blackboard_E [ roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ]
𝒘k𝒙=def𝚵(𝒙k,𝒖k,tk)𝚽(𝒙k,𝒖k,tk),wkb=defΞb(𝒙k,𝒖k,tk)Φb(𝒙k,𝒖k,tk),superscriptsubscript𝒘𝑘𝒙def𝚵subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘𝚽subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘superscriptsubscript𝑤𝑘superscript𝑏defsubscriptΞsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘subscriptΦsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘\displaystyle{\bm{w}}_{k}^{{\bm{x}}}\overset{\text{def}}{=}{\bm{\Xi}}({\bm{x}}% _{k},{\bm{u}}_{k},t_{k})-{\bm{\Phi}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k}),\quad w_% {k}^{b^{*}}\overset{\text{def}}{=}\Xi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})% -\Phi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k}),bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT overdef start_ARG = end_ARG bold_Ξ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - bold_Φ ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT overdef start_ARG = end_ARG roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) - roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ,

and the concatenated state and reward flow function, and noise as:

𝚽(𝒙k,𝒖k,tk)superscript𝚽subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘\displaystyle{\bm{\Phi}}^{*}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =(𝚽𝒇(𝒙k,𝒖k,tk)Φb(𝒙k,𝒖k,tk)),𝒘k=(𝒘k𝒙wkb).formulae-sequenceabsentmatrixsubscript𝚽superscript𝒇subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘subscriptΦsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝑡𝑘subscript𝒘𝑘matrixsuperscriptsubscript𝒘𝑘𝒙superscriptsubscript𝑤𝑘superscript𝑏\displaystyle=\begin{pmatrix}{\bm{\Phi}}_{{\bm{f}}^{*}}({\bm{x}}_{k},{\bm{u}}_% {k},t_{k})\\ \Phi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},t_{k})\end{pmatrix},\quad{\bm{w}}_{k}=% \begin{pmatrix}{\bm{w}}_{k}^{{\bm{x}}}\\ w_{k}^{b^{*}}\end{pmatrix}.= ( start_ARG start_ROW start_CELL bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) end_CELL end_ROW end_ARG ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( start_ARG start_ROW start_CELL bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ) . (5)

In this work, we search for policies that return the next control we apply and also the time for how long to apply the control.

3.1 Reforumlation of Interaction Cost setting to Discrete-time MDPs

Refer to caption
(a) We add a constant switch cost of 0.1 and significantly reduce the number of interactions from 200 to 24. Initially, the policy applies maximal bang-bang torque for longer times, until the pendulum reaches the top. On the top, we measure and change the controller at a higher frequency in order to keep the pendulum stable, at the position with the highest reward.
Refer to caption
(b) We set a tight bound of K=5𝐾5K=5italic_K = 5 for the number of interactions and observe that we can still solve the task.
Figure 1: Experiment on the Pendulum environment for the average cost and a bounded number of switches setting.

We convert the problem with interaction costs to a standard MDP which any RL algorithm for continuous state-action spaces can solve. To this end, we restrict ourselves to a policy class:

ΠTC={𝝅:𝒳×𝒯𝒰×𝒯π𝒯(,t)[tmin,tmax],𝝅 is L𝝅Lipschitz}.subscriptΠ𝑇𝐶conditional-set𝝅formulae-sequence𝒳𝒯conditional𝒰𝒯subscript𝜋𝒯𝑡subscript𝑡subscript𝑡𝝅 is subscript𝐿𝝅Lipschitz\displaystyle\Pi_{TC}=\left\{{\bm{\pi}}:{\mathcal{X}}\times{\mathcal{T}}\to{% \mathcal{U}}\times{\mathcal{T}}\mid\pi_{{\mathcal{T}}}(\cdot,t)\in[t_{\min},t_% {\max}],{\bm{\pi}}\text{ is }L_{{\bm{\pi}}}-\text{Lipschitz}\right\}.roman_Π start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT = { bold_italic_π : caligraphic_X × caligraphic_T → caligraphic_U × caligraphic_T ∣ italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( ⋅ , italic_t ) ∈ [ italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] , bold_italic_π is italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT - Lipschitz } .

For simplicity, we denote by π𝒯subscript𝜋𝒯\pi_{{\mathcal{T}}}italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT the component of the policy that predicts the duration of applying the action. The policies we consider map state 𝒙𝒙{\bm{x}}bold_italic_x and time-to-go t𝑡titalic_t to control 𝒖𝒖{\bm{u}}bold_italic_u and the time τ𝜏\tauitalic_τ for how long we apply the action 𝒖𝒖{\bm{u}}bold_italic_u. We define the augmented state 𝒔=(𝒙,b,t)𝒔𝒙𝑏𝑡{\bm{s}}=({\bm{x}},b,t)bold_italic_s = ( bold_italic_x , italic_b , italic_t ), where 𝒙𝒙{\bm{x}}bold_italic_x is the state, b𝑏bitalic_b integrated reward and t𝑡titalic_t time-to-go. With the introduced notation we arrive at a discrete-time MDP problem formulation

max𝝅ΠTCV𝝅,𝚽(𝒙0,T)=max𝝅ΠTC𝔼[k=0K1r(𝒔k,𝝅(𝒔k))]subscript𝝅subscriptΠ𝑇𝐶subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝝅subscriptΠ𝑇𝐶𝔼delimited-[]superscriptsubscript𝑘0𝐾1𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle\max_{{\bm{\pi}}\in\Pi_{TC}}V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x% }}_{0},T)=\max_{{\bm{\pi}}\in\Pi_{TC}}\,\,\mathbb{E}\left[\sum_{k=0}^{K-1}r({% \bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))\right]roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) = roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ] (6)
s.t.𝒔k+1=𝚿𝚽(𝒔k,𝝅(𝒔k),𝒘k),𝒔0=(𝒙0,0,T),k=0K1π𝒯(𝒙k,tk)=T,formulae-sequences.t.subscript𝒔𝑘1subscript𝚿superscript𝚽subscript𝒔𝑘𝝅subscript𝒔𝑘subscript𝒘𝑘formulae-sequencesubscript𝒔0subscript𝒙00𝑇superscriptsubscript𝑘0𝐾1subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘𝑇\displaystyle\text{s.t.}\quad{\bm{s}}_{k+1}={\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm% {s}}_{k},{\bm{\pi}}({\bm{s}}_{k}),{\bm{w}}_{k}),\;{\bm{s}}_{0}=({\bm{x}}_{0},0% ,T),\quad\sum_{k=0}^{K-1}\pi_{{\mathcal{T}}}({\bm{x}}_{k},t_{k})=T,s.t. bold_italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 , italic_T ) , ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_T ,

where we have

𝚿𝚽(𝒔k,𝝅(𝒔k),𝒘k)subscript𝚿superscript𝚽subscript𝒔𝑘𝝅subscript𝒔𝑘subscript𝒘𝑘\displaystyle{\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k% }),{\bm{w}}_{k})bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =(𝚽(𝒙k,𝝅(𝒙k,tk))+𝒘k,tkπ𝒯(𝒙k,tk))absentsuperscript𝚽subscript𝒙𝑘𝝅subscript𝒙𝑘subscript𝑡𝑘subscript𝒘𝑘subscript𝑡𝑘subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘\displaystyle=\left({\bm{\Phi}}^{*}({\bm{x}}_{k},{\bm{\pi}}({\bm{x}}_{k},t_{k}% ))+{\bm{w}}_{k},t_{k}-\pi_{{\mathcal{T}}}({\bm{x}}_{k},t_{k})\right)= ( bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) + bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) )
r(𝒔k,𝝅(𝒔k))𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle r({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) =Ξb(𝒙k,𝒖k,π𝒯(𝒙k,tk))c(𝒙k,𝝅(𝒙k,tk)).absentsubscriptΞsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘𝑐subscript𝒙𝑘𝝅subscript𝒙𝑘subscript𝑡𝑘\displaystyle=\Xi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},\pi_{{\mathcal{T}}}({\bm{x% }}_{k},t_{k}))-c({\bm{x}}_{k},{\bm{\pi}}({\bm{x}}_{k},t_{k})).= roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) - italic_c ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) .

3.2 Reformulation of Bounded Number of Interactions to Discrete-time MDPs

The second setting is similar to the one studied by Ni and Jang, (2022). In this case, we consider the following class of policies:

ΠBT={𝝅:𝒳×𝒯×𝒰×𝒯k[K]:𝝅(,,k) is L𝝅 – Lipschitz}.subscriptΠ𝐵𝑇conditional-set𝝅:𝒳𝒯conditional𝒰𝒯for-all𝑘delimited-[]𝐾𝝅𝑘 is L𝝅 – Lipschitz\displaystyle\Pi_{BT}=\left\{{\bm{\pi}}:{\mathcal{X}}\times{\mathcal{T}}\times% \mathbb{N}\to{\mathcal{U}}\times{\mathcal{T}}\mid\forall k\in[K]:{\bm{\pi}}(% \cdot,\cdot,k)\text{ is $L_{{\bm{\pi}}}$ -- Lipschitz}\right\}.roman_Π start_POSTSUBSCRIPT italic_B italic_T end_POSTSUBSCRIPT = { bold_italic_π : caligraphic_X × caligraphic_T × blackboard_N → caligraphic_U × caligraphic_T ∣ ∀ italic_k ∈ [ italic_K ] : bold_italic_π ( ⋅ , ⋅ , italic_k ) is italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT – Lipschitz } .

For an augmented state 𝒔=(𝒙,b,t,k)𝒔𝒙𝑏𝑡𝑘{\bm{s}}=({\bm{x}},b,t,k)bold_italic_s = ( bold_italic_x , italic_b , italic_t , italic_k ), our policies map states 𝒙𝒙{\bm{x}}bold_italic_x, time-to-go t𝑡titalic_t, number of past interactions k𝑘kitalic_k to a controller 𝒖𝒖{\bm{u}}bold_italic_u and the time duration τ𝜏\tauitalic_τ for applying the action. Here the optimal control problem reads

max𝝅ΠBTV𝝅,𝚽(𝒙0,T)=max𝝅ΠBT𝔼[k=0K1r(𝒔k,𝝅(𝒔k))]subscript𝝅subscriptΠ𝐵𝑇subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝝅subscriptΠ𝐵𝑇𝔼delimited-[]superscriptsubscript𝑘0𝐾1𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle\max_{{\bm{\pi}}\in\Pi_{BT}}V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x% }}_{0},T)=\max_{{\bm{\pi}}\in\Pi_{BT}}\,\,\mathbb{E}\left[\sum_{k=0}^{K-1}r({% \bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))\right]roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_B italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) = roman_max start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π start_POSTSUBSCRIPT italic_B italic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ] (7)
s.t.𝒔k+1=𝚿𝚽(𝒔k,𝝅(𝒔k),𝒘k),𝒔0=(𝒙0,0,T,0),formulae-sequences.t.subscript𝒔𝑘1subscript𝚿superscript𝚽subscript𝒔𝑘𝝅subscript𝒔𝑘subscript𝒘𝑘subscript𝒔0subscript𝒙00𝑇0\displaystyle\text{s.t.}\quad{\bm{s}}_{k+1}={\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm% {s}}_{k},{\bm{\pi}}({\bm{s}}_{k}),{\bm{w}}_{k}),\;{\bm{s}}_{0}=({\bm{x}}_{0},0% ,T,0),s.t. bold_italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , 0 , italic_T , 0 ) ,

where,

𝚿𝚽(𝒔k,𝝅(𝒔k),𝒘k)subscript𝚿superscript𝚽subscript𝒔𝑘𝝅subscript𝒔𝑘subscript𝒘𝑘\displaystyle{\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k% }),{\bm{w}}_{k})bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) =(𝚽(𝒙k,𝝅(𝒙k,tk,k))+𝒘k,tkπ𝒯(𝒙k,tk,k),k+1)absentsuperscript𝚽subscript𝒙𝑘𝝅subscript𝒙𝑘subscript𝑡𝑘𝑘subscript𝒘𝑘subscript𝑡𝑘subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘𝑘𝑘1\displaystyle=\left({\bm{\Phi}}^{*}({\bm{x}}_{k},{\bm{\pi}}({\bm{x}}_{k},t_{k}% ,k))+{\bm{w}}_{k},t_{k}-\pi_{{\mathcal{T}}}({\bm{x}}_{k},t_{k},k),k+1\right)= ( bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ) + bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT - italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) , italic_k + 1 )
r(𝒔k,𝝅(𝒔k))𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle r({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) =Ξb(𝒙k,𝒖k,π𝒯(𝒙k,tk,k)).absentsubscriptΞsuperscript𝑏subscript𝒙𝑘subscript𝒖𝑘subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘𝑘\displaystyle=\Xi_{b^{*}}({\bm{x}}_{k},{\bm{u}}_{k},\pi_{{\mathcal{T}}}({\bm{x% }}_{k},t_{k},k)).= roman_Ξ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_k ) ) .

In the following, we provide a simple proposition which shows that our reformulated problem is equivalent to its continuous-time counterpart from Section 2.

Proposition 1.

The problem in Equation 2 and 3 are equivalent to Equation 6 and 7, respectively.

Figure 1 depicts the influence of interaction cost and K𝐾Kitalic_K on the controller’s performance for the pendulum environment.

4 TaCoS with Model-free RL Algorithms

We now illustrate the performance of TaCoS on several well-studied robotic RL tasks. We consider the RC car (Kabzan et al.,, 2020), Greenhouse (Tap,, 2000), Pendulum, Reacher, Halfcheetah and Humanoid environments from Brax (Freeman et al.,, 2021). Thus our experiments range from environments necessitating time-adaptive control like the Greenhouse, a realistic and highly dynamic race car simulation, and a very high dimensional RL task like the Humanoid.222𝒳244,𝒰17formulae-sequence𝒳superscript244𝒰superscript17{\mathcal{X}}\subset\mathbb{R}^{244},{\mathcal{U}}\subset\mathbb{R}^{17}caligraphic_X ⊂ blackboard_R start_POSTSUPERSCRIPT 244 end_POSTSUPERSCRIPT , caligraphic_U ⊂ blackboard_R start_POSTSUPERSCRIPT 17 end_POSTSUPERSCRIPT. We provide our implementation at https://github.com/lasgroup/TaCoS.

We investigate both the bounded number of interactions and interaction cost settings in our experiments. In particular, we study how the bound K𝐾Kitalic_K affects the performance of TaCoS and compare it to the standard equidistant baseline. We further study the interplay between the stochasticity of the environments (magnitude of 𝒈superscript𝒈{\bm{g}}^{*}bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT) and interaction costs and the influence of tminsubscript𝑡mint_{\text{min}}italic_t start_POSTSUBSCRIPT min end_POSTSUBSCRIPTon TaCoS. For all experiments in this section, we combine SAC with TaCoS (SAC-TaCoS).

How does the bound on the number of interactions K𝐾Kitalic_K affect TaCoS?
Refer to caption
Figure 2: We study the effects of the bound on interactions K𝐾Kitalic_K on the performance of the agent. TaCoS performs significantly better than equidistant discretization, especially for small values of K𝐾Kitalic_K.

We analyze the bounded number of interactions setting (cf. Section 3.2) of TaCoS, studying the relationship between the number of interactions and the achieved episode reward. We compare our algorithm with the standard equidistant time discretization approach which splits the whole horizon T𝑇Titalic_T into T/K𝑇𝐾T/Kitalic_T / italic_K discrete time steps at which an interaction takes place. We evaluate the two methods in the greenhouse and pendulum environments. For the pendulum, we consider the swing-up and swing-down tasks. The results are reported in Figure 2. The time-adaptive approach performs significantly better than the standard equidistant time discretization. This is particularly the case for the greenhouse and pendulum swing-down tasks. Both tasks involve driving the system to a stable equilibrium and thus, while high-frequency interaction might be necessary at the initial stages, a fairly low interaction frequency can be maintained when the system has reached the equilibrium state. This demonstrates the practical benefits of time-adaptive control.

How does the interaction cost magnitude influence TaCoS?
Refer to caption
Figure 3: Effect of interaction cost (first row) and environment stochasticity (second row) on the number of interactions and episode reward for the Pendulum and Greenhouse tasks.

We investigate the setting from Section 3.1 with interaction costs. In our experiments, we always pick a constant cost, i.e., c(𝒙,𝒖)=C𝑐𝒙𝒖𝐶c({\bm{x}},{\bm{u}})=Citalic_c ( bold_italic_x , bold_italic_u ) = italic_C. We study the influence of C𝐶Citalic_C on the episode reward and on the number of interactions that the policy has with the system within an episode. We again evaluate this on the greenhouse and pendulum environment. For the pendulum, we consider the swing-up task. The results are presented in the first row of Figure 3. Noticeably, increasing C𝐶Citalic_C reduces the number of interactions. The decrease is drastic for the greenhouse environment since it can be controlled with considerably fewer interactions without having any effect on the performance. Generally, we observe that decreasing the number of interactions, that is, increasing C𝐶Citalic_C, also results in a slight decline in episode reward.

How does environment stochasticity influence the number of interactions?

We analyze the influence of the environment’s stochasticity, i.e., the magnitude of the diffusion term 𝒈superscript𝒈{\bm{g}}^{*}bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, on the episode reward and number of interactions on TaCoS. Intuitively, the more stochastic the environment, the more interactions we would require to stabilize the system. We again evaluate our method on the greenhouse and pendulum swing-up tasks. The results are reported in the second row of Figure 3. The results verify our intuition that more stochasticity in the environment generally leads to more interactions. However, we observe that the policy is still able to achieve high rewards for a wide range of magnitude of 𝒈superscript𝒈{\bm{g}}^{*}bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. This showcases the robustness and adaptability of TaCoS to stochastic environments.

How does tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT influence TaCoS?
Refer to caption
Figure 4: We compare the performance of TaCoS in combination with SAC and PPO with the standard SAC algorithm and SAC with more compute (SAC-MC) over a range of values for tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT (first row). In the second row, we plot the episode reward versus the physical time in seconds spent in the environment for SAC-TaCoS, SAC, and SAC-MC for a specific evaluation frequency 1/teval1subscript𝑡eval\nicefrac{{1}}{{t_{\text{eval}}}}/ start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT eval end_POSTSUBSCRIPT end_ARG. We exclude PPO-TaCoS in this plot as it, being on-policy, requires significantly more samples than the off-policy methods. While all methods perform equally well for standard discretization (denoted with 1/t1superscript𝑡1/t^{*}1 / italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT), our method is robust to interaction frequency and does not suffer a performance drop when we decrease tminsubscript𝑡𝑚𝑖𝑛t_{min}italic_t start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT.

As highlighted in Section 1, picking the right discretization for interactions is a challenging task. We show that TaCoS can naturally alleviate this issue and adaptively pick the frequency of interaction while also being more computationally and data-efficient. Moreover, we show that TaCoS is robust to the choice of tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT, which represents the minimal duration an action has to be applied, i.e., its inverse is the highest frequency at which we can control the system. In this experiment, we consider SAC-TaCoS and compare it to the standard SAC algorithm. TaCoS adaptively picks the number of interactions and therefore during an episode of time T𝑇Titalic_T, it effectively collects less data than the standard discrete-time RL algorithm.333A standard RL algorithm would collect T/tmin𝑇subscript𝑡\nicefrac{{T}}{{t_{\min}}}/ start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG data points per episode. This makes comparison to the discrete-time setting challenging since environment interactions and physical time on the environment are not linearly related for TaCoS as opposed to the standard discrete-time setting. Nevertheless, to be fair to the discrete-time method, we give SAC more physical time on the system for all environments, effectively resulting in the collection of more data for learning. Since the standard SAC algorithm updates the policy relative to the data amount, we consider a version of SAC, SAC-MC (SAC more compute), which leverages the additional data it collects to perform more gradient updates. This version essentially performs more policy updates than SAC-TaCoS and thus is computationally more expensive. Furthermore, to demonstrate the generality of our framework, we also combine TaCoS with PPO (PPO-TaCoS).

We report the performance after convergence across different tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT in the first row of Figure 4. From our experiment, we conclude that SAC-TaCoS and PPO-TaCoS are robust to the choice of tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and perform equally well when tminsubscript𝑡t_{\min}italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT is decreased, i.e., frequency is increased. This is in contrast to the standard RL methods, which have a significant drop in performance at high frequencies. This observation is also made in prior work (Hafner et al.,, 2019). Crucially, this highlights the sensitivity of the standard RL methods to the frequency of interaction. In the second row of Figure 4 we show the learning curve of the methods for a specific frequency 1/teval1subscript𝑡eval\nicefrac{{1}}{{t_{\text{eval}}}}/ start_ARG 1 end_ARG start_ARG italic_t start_POSTSUBSCRIPT eval end_POSTSUBSCRIPT end_ARG. From the curve, we conclude that SAC-TaCoS achieves higher rewards with significantly less physical time on the environment. We believe this is because our method explores more efficiently  (akin to Dabney et al.,, 2020; Eberhard et al.,, 2022), and also learns a much stronger/continuous-time representation of the underlying MDP.

Interestingly, at the default frequency used in the benchmarks 1/t1superscript𝑡\nicefrac{{1}}{{t^{*}}}/ start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG, all methods perform similarly. However, slightly decreasing the frequency already leads to a drastic drop in performance for all methods. Intuitively, decreasing the frequency prevents us from performing the necessary fine-grained control and obtaining the highest performance.

While we have access to the optimal frequency 1/t1superscript𝑡\nicefrac{{1}}{{t^{*}}}/ start_ARG 1 end_ARG start_ARG italic_t start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_ARG for these benchmarks, for a general and unknown system it is very difficult to estimate this frequency. Furthermore, as we observe in our experiments, picking a very high frequency is also not an option when using standard RL algorithms. We believe this is where TaCoS excels as it adaptively picks the frequency of interaction, thereby relieving the problem designer of this decision.

5 Efficient Exploration for TaCoS via Model-Based RL

In this section, we propose a novel model-based RL algorithm for TaCoS called Optimistic TaCoS (OTaCoS). We analyze the episodic setting, where we interact with the system in episodes n=1,,N𝑛1𝑁n=1,\ldots,Nitalic_n = 1 , … , italic_N. In episode n𝑛nitalic_n, we execute the policy 𝝅nsubscript𝝅𝑛{\bm{\pi}}_{n}bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, collect measurements and integrated rewards (𝒙n,0,bn,0),,(𝒙n,kn,bn,kn)subscript𝒙𝑛0subscript𝑏𝑛0subscript𝒙𝑛subscript𝑘𝑛subscript𝑏𝑛subscript𝑘𝑛({\bm{x}}_{n,0},b_{n,0}),\ldots,({\bm{x}}_{n,k_{n}},b_{n,k_{n}})( bold_italic_x start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) , … , ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ), and prepare the data 𝒟n={(𝒛n,1,𝒚n,1),,(𝒛n,kn,𝒚n,kn)}subscript𝒟𝑛subscript𝒛𝑛1subscript𝒚𝑛1subscript𝒛𝑛subscript𝑘𝑛subscript𝒚𝑛subscript𝑘𝑛{\mathcal{D}}_{n}=\left\{({\bm{z}}_{n,1},{\bm{y}}_{n,1}),\ldots,({\bm{z}}_{n,k% _{n}},{\bm{y}}_{n,k_{n}})\right\}caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = { ( bold_italic_z start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT ) , … , ( bold_italic_z start_POSTSUBSCRIPT italic_n , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT , bold_italic_y start_POSTSUBSCRIPT italic_n , italic_k start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) }, where 𝒛n,i=(𝒙n,i1,𝝅n(𝒙n,i1))subscript𝒛𝑛𝑖subscript𝒙𝑛𝑖1subscript𝝅𝑛subscript𝒙𝑛𝑖1{\bm{z}}_{n,i}=({\bm{x}}_{n,i-1},{\bm{\pi}}_{n}({\bm{x}}_{n,i-1}))bold_italic_z start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_i - 1 end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_i - 1 end_POSTSUBSCRIPT ) ) and 𝒚n,i=(𝒙n,i,bn,i)subscript𝒚𝑛𝑖subscript𝒙𝑛𝑖subscript𝑏𝑛𝑖{\bm{y}}_{n,i}=({\bm{x}}_{n,i},b_{n,i})bold_italic_y start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT = ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_n , italic_i end_POSTSUBSCRIPT ). From the dataset 𝒟1:n=defin𝒟isubscript𝑖𝑛subscript𝒟:1𝑛defsubscript𝒟𝑖{\mathcal{D}}_{1:n}\overset{\text{def}}{=}\cup_{i\leq n}{\mathcal{D}}_{i}caligraphic_D start_POSTSUBSCRIPT 1 : italic_n end_POSTSUBSCRIPT overdef start_ARG = end_ARG ∪ start_POSTSUBSCRIPT italic_i ≤ italic_n end_POSTSUBSCRIPT caligraphic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT we build a model nsubscript𝑛{\mathcal{M}}_{n}caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for the unknown function 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that it is well-calibrated in the sense of the following definition.

Definition 1 (Well-calibrated statistical model of 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, Rothfuss et al., (2023)).

Let 𝒵=def𝒳×𝒰×𝒯𝒵def𝒳𝒰𝒯{\mathcal{Z}}\overset{\text{def}}{=}{\mathcal{X}}\times{\mathcal{U}}\times{% \mathcal{T}}caligraphic_Z overdef start_ARG = end_ARG caligraphic_X × caligraphic_U × caligraphic_T. We assume 𝚽n0nsuperscript𝚽subscript𝑛0subscript𝑛{\bm{\Phi}}^{*}\in\bigcap_{n\geq 0}{\mathcal{M}}_{n}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∈ ⋂ start_POSTSUBSCRIPT italic_n ≥ 0 end_POSTSUBSCRIPT caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT with probability at least 1δ1𝛿1-\delta1 - italic_δ, where statistical model nsubscript𝑛{\mathcal{M}}_{n}caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT is defined as

n=def{𝒇:𝒵dx+1𝒛𝒵,j{1,,dx+1}:|μn,j(𝒛)fj(𝒛)|βn(δ)σn,j(𝒛)},\displaystyle{\mathcal{M}}_{n}\overset{\text{def}}{=}\left\{{\bm{f}}:{\mathcal% {Z}}\to\mathbb{R}^{d_{x}+1}\mid\forall{\bm{z}}\in{\mathcal{Z}},\forall j\in% \left\{1,\ldots,d_{x}+1\right\}:\left\lvert\mu_{n,j}({\bm{z}})-f_{j}({\bm{z}})% \right\rvert\leq\beta_{n}(\delta)\sigma_{n,j}({\bm{z}})\right\},caligraphic_M start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT overdef start_ARG = end_ARG { bold_italic_f : caligraphic_Z → blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT ∣ ∀ bold_italic_z ∈ caligraphic_Z , ∀ italic_j ∈ { 1 , … , italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + 1 } : | italic_μ start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ( bold_italic_z ) - italic_f start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ( bold_italic_z ) | ≤ italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ ) italic_σ start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT ( bold_italic_z ) } ,

Here, μn,jsubscript𝜇𝑛𝑗\mu_{n,j}italic_μ start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT and σn,jsubscript𝜎𝑛𝑗\sigma_{n,j}italic_σ start_POSTSUBSCRIPT italic_n , italic_j end_POSTSUBSCRIPT denote the j𝑗jitalic_j-th element in the vector-valued mean and standard deviation functions 𝛍nsubscript𝛍𝑛{\bm{\mu}}_{n}bold_italic_μ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT and 𝛔nsubscript𝛔𝑛{\bm{\sigma}}_{n}bold_italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT respectively, and βn(δ)0subscript𝛽𝑛𝛿subscriptabsent0\beta_{n}(\delta)\in\mathbb{R}_{\geq 0}italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ ) ∈ blackboard_R start_POSTSUBSCRIPT ≥ 0 end_POSTSUBSCRIPT is a scalar function that depends on the confidence level δ(0,1]𝛿01\delta\in(0,1]italic_δ ∈ ( 0 , 1 ] and which is monotonically increasing in n𝑛nitalic_n.

Similar to model-based RL algorithms for the discrete-time setting (Kakade et al.,, 2020; Curi et al.,, 2020; Sukhija et al.,, 2024), we follow the principle of optimism in the face of uncertainty and select the policy 𝝅nsubscript𝝅𝑛{\bm{\pi}}_{n}bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT for both settings of TaCoS (cf. Sections 3.1 and 3.2) by solving:

𝝅nsubscript𝝅𝑛\displaystyle{\bm{\pi}}_{n}bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT =defargmin𝝅Πmin𝚽n1V𝝅,𝚽(𝒙0,T),defsubscriptarg𝝅subscriptΠsubscript𝚽subscript𝑛1subscript𝑉𝝅𝚽subscript𝒙0𝑇\displaystyle\overset{\text{def}}{=}\operatorname*{arg\!\min}_{{\bm{\pi}}\in% \Pi_{\square}}\min_{{\bm{\Phi}}\in{\mathcal{M}}_{n-1}}V_{{\bm{\pi}},{\bm{\Phi}% }}({\bm{x}}_{0},T),overdef start_ARG = end_ARG start_OPERATOR roman_arg roman_min end_OPERATOR start_POSTSUBSCRIPT bold_italic_π ∈ roman_Π start_POSTSUBSCRIPT □ end_POSTSUBSCRIPT end_POSTSUBSCRIPT roman_min start_POSTSUBSCRIPT bold_Φ ∈ caligraphic_M start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) , (8)

where {TC,BT}𝑇𝐶𝐵𝑇\square\in\left\{TC,BT\right\}□ ∈ { italic_T italic_C , italic_B italic_T } is the appropriate policy class from Section 3. Running OTaCoS for N𝑁Nitalic_N episodes, we measure the performance via the regret:

RNsubscript𝑅𝑁\displaystyle R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=1N(V𝝅,𝚽(𝒙0,T)V𝝅n,𝚽(𝒙0,T)).absentsuperscriptsubscript𝑛1𝑁subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇\displaystyle=\sum_{n=1}^{N}\bigl{(}V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}% }_{0},T)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\bigr{)}.= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ) .

Here 𝝅superscript𝝅{\bm{\pi}}^{*}bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is the optimal policy from the class of policies we optimize over. Any kind of regret bound requires certain assumptions on the regularity of the underlying dynamics (1).

Assumption 1 (Dynamics model).

Given any norm delimited-∥∥\left\lVert\cdot\right\rVert∥ ⋅ ∥, we assume that the drift 𝐟superscript𝐟{\bm{f}}^{*}bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, and diffusion 𝐠superscript𝐠{\bm{g}}^{*}bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT are L𝐟subscript𝐿superscript𝐟L_{{\bm{f}}^{*}}italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT and L𝐠subscript𝐿superscript𝐠L_{{\bm{g}}^{*}}italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT-Lipschitz continuous, respectively, with respect to the induced metric. We further assume sup𝐳𝒵𝐠(𝐳)FAsubscriptsupremum𝐳𝒵subscriptdelimited-∥∥superscript𝐠𝐳𝐹𝐴\sup_{{\bm{z}}\in{\mathcal{Z}}}\left\lVert{\bm{g}}^{*}({\bm{z}})\right\rVert_{% F}\leq Aroman_sup start_POSTSUBSCRIPT bold_italic_z ∈ caligraphic_Z end_POSTSUBSCRIPT ∥ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_z ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_A.

1 ensures the existence of the SDE (1) solution under policy 𝝅nsubscript𝝅𝑛{\bm{\pi}}_{n}bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT. To provide bounds on the performance of OTaCoS for settings Sections 3.1 and 3.2 we also need some assumptions on the noise and reward model.

Assumption 2 (Reward and noise model for Section 3.1 Setting).

Given any norm delimited-∥∥\left\lVert\cdot\right\rVert∥ ⋅ ∥, we assume that running reward b𝑏bitalic_b is Lbsubscript𝐿𝑏L_{b}italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-Lipschitz continuous, with respect to the induced metric. We further assume boundedness of the reward 0b(𝐱,𝐮)B0superscript𝑏𝐱𝐮𝐵0\leq b^{*}({\bm{x}},{\bm{u}})\leq B0 ≤ italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_u ) ≤ italic_B, and interaction cost 0c(𝐱,𝐮)C0𝑐𝐱𝐮𝐶0\leq c({\bm{x}},{\bm{u}})\leq C0 ≤ italic_c ( bold_italic_x , bold_italic_u ) ≤ italic_C. The dynamics noise is independent and follows: 𝐰k𝐱𝒩(|0,σ2(𝐱k,𝐮k,tk)Idx){\bm{w}}_{k}^{{\bm{x}}}\sim\mathcal{N}\left(|0,\sigma^{2}({\bm{x}}_{k},{\bm{u}% }_{k},t_{k})I_{d_{x}}\right)bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT ∼ caligraphic_N ( | 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_u start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) italic_I start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT ).

Assumption 3 (Reward and noise model for Section 3.2 Setting).

Given any norm delimited-∥∥\left\lVert\cdot\right\rVert∥ ⋅ ∥, we assume that the running reward b𝑏bitalic_b is Lbsubscript𝐿𝑏L_{b}italic_L start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-Lipschitz continuous, w.r.t. to the induced metric. The measurement noise 𝐰k𝐱superscriptsubscript𝐰𝑘𝐱{\bm{w}}_{k}^{{\bm{x}}}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT bold_italic_x end_POSTSUPERSCRIPT is independent and σ𝜎\sigmaitalic_σ sub-Gaussian.

Finally, we assume that we learn a well-calibrated model of the unknown flow 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT.

Assumption 4 (Well calibration assumption).

Our learned model is an all-time-calibrated statistical model of 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, i.e., there exists an increasing sequence of (βn(δ))n0subscriptsubscript𝛽𝑛𝛿𝑛0\left(\beta_{n}(\delta)\right)_{n\geq 0}( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( italic_δ ) ) start_POSTSUBSCRIPT italic_n ≥ 0 end_POSTSUBSCRIPT such that our model satisfies the well-calibration condition, cf., Definition 1.

Analogous assumptions are made for model-based RL algorithms in the discrete-time setting (Curi et al.,, 2020; Sukhija et al.,, 2024). This calibration assumption is satisfied if 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT can be represented with Gaussian Process (GP) (Williams and Rasmussen,, 2006; Kirschner and Krause,, 2018) models.

Theorem 2.

Consider the setting from Section 3.1 and let 1, 2, and 4 hold. Then we have with probability at least 1δ1𝛿1-\delta1 - italic_δ:

RN𝒪(βN1T3/2NN)subscript𝑅𝑁𝒪subscript𝛽𝑁1superscript𝑇32𝑁subscript𝑁R_{N}\leq\mathcal{O}\left(\beta_{N-1}T^{3/2}\sqrt{N{\mathcal{I}}_{N}}\right)italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≤ caligraphic_O ( italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG )

Now consider, the setting with a bounded number of switches K𝐾Kitalic_K, and let 1, 3, and 4 hold. Then, we get with probability at least 1δ1𝛿1-\delta1 - italic_δ

RN𝒪(βN1KKeD(L𝒇+L𝒈2)(1+L𝝅)TKNN),subscript𝑅𝑁𝒪superscriptsubscript𝛽𝑁1𝐾𝐾superscript𝑒𝐷subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝑁subscript𝑁R_{N}\leq\mathcal{O}\left(\beta_{N-1}^{K}Ke^{D(L_{{\bm{f}}^{*}}+L_{{\bm{g}}^{*% }}^{2})(1+L_{{\bm{\pi}}})TK}\sqrt{N{\mathcal{I}}_{N}}\right),italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT ≤ caligraphic_O ( italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_K italic_e start_POSTSUPERSCRIPT italic_D ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG ) ,

where D𝐷Ditalic_D is a constant. Here, with Nsubscript𝑁{\mathcal{I}}_{N}caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT we denote the model-complexity after observing N𝑁Nitalic_N points (Curi et al.,, 2020), which quantifies the difficulty of learning 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. For GPs, it behaves similar to the maximum information gain γNsubscript𝛾𝑁\gamma_{N}italic_γ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (Srinivas et al.,, 2009), i.e., implying sublinear regret for several common kernels (Vakili et al.,, 2021).

As a proof of concept, we evaluate OTaCoS on the pendulum and RC car environment for the interaction cost setting. 444The code is available at https://github.com/lasgroup/model-based-rl. As baselines, we adapt common model-based RL methods such as PETS (Chua et al.,, 2018) and planning with the mean to TaCoS. We call them PETS-TaCoS and Mean-TaCoS, respectively. The result is reported in Figure 5. From the figure, we conclude that OTaCoS is more sample efficient than other model-based baselines and SAC-TaCoS (SAC-TaCoS requires circa 6000600060006000 episodes for the pendulum and 2000200020002000 for the RC car).

Refer to caption
Figure 5: We run OTaCoS on the pendulum and RC car environment. We report the achieved reward averaged over five different seeds with one standard error.

6 Related Work

Similar to this work, Holt et al., (2023); Ni and Jang, (2022); Karimi, (2023) consider continuous-time deterministic dynamical systems where the measurements or control input changes can only happen at discrete time steps. Moreover, Holt et al., (2023) proposes a similar problem as ours from Section 3.1, where they specify a cost on the number of interactions. However, their solution is based on a heuristic, where a measurement is taken when the variance of the potential reward surpasses a prespecified threshold. On the contrary, we directly tackle this problem at hand and propose a general framework for time-adaptive control that does not rely on any heuristics. Karimi, (2023) adapt SAC (Haarnoja et al.,, 2018) to include a regularization term, which effectively adds a cost for every discrete interaction. Ni and Jang, (2022) induce a soft-constraint on the duration τ𝜏\tauitalic_τ of each action in the environment. However, all the aforementioned works propose heuristic techniques to minimize interactions, whereas we formalize the problem systematically for the more general case of SDEs and show that it has an underlying MDP structure that any RL algorithm can leverage. In addition, we propose a no-regret model-based RL algorithm for this setting and analyze its sample complexity.

Temporal abstractions are considered also in the framework of options (Sutton et al.,, 1999; Mankowitz et al.,, 2014; Mann and Mannor,, 2014; Harb et al.,, 2018). However, a key difference to TaCoS is that in the options framework, the agent measures the state even between the controller switches.

Learning to repeat actions

Several works observe that repeating actions in the discrete-time MDPs problems such as Atari (Mnih et al.,, 2013; Braylan et al.,, 2015) or Cartpole (Hafner et al.,, 2019) significantly increase the speed of learning. However, the action repeat is fixed through the entire rollout and treated as a hyperparameter. Durugkar et al., (2016); Vezhnevets et al., (2016); Srinivas et al., (2017); Sharma et al., (2017); Lee et al., (2020); Grigsby et al., (2021); Chen et al., (2021); Yu et al., (2021); Biedenkapp et al., (2021); Krale et al., (2023) automate the selection of action repeat, and show superior performance over the fixed number setting. Dabney et al., (2020) empirically show that repeating the actions helps with the exploration, effectively having a similar effect that colored noise exploration has over the standard white noise exploration (Eberhard et al.,, 2022).

Continuous-time RL

Following the seminal work of Doya, (2000) and the advances in Neural ODEs of Chen et al., (2018), continuous-time RL has regained interest (Cranmer et al.,, 2020; Greydanus et al.,, 2019; Yildiz et al.,, 2021; Lutter et al.,, 2021). Moreover, modeling in continuous-time is found to be particularly useful when learning from different data sources where each source is collected at a different frequency (Burns et al.,, 2023; Zheng et al.,, 2023). An important line of work exists for modeling continuous dynamics for the case when states and actions are discrete, called Markov Jump Processes (Kallianpur and Sundar,, 2014; Berger,, 1993; Huang et al.,, 2019; Seifner and Sanchez,, 2023). Another line of work that is close to ours is event and self-Triggered Control (Astrom and Bernhardsson,, 2002; Anta and Tabuada,, 2010; Heemels et al.,, 2012, 2021), where they model continuous-time control systems by implementing changes to the input only when stability is at risk, ensuring efficient and timely interventions. Treven et al., (2023) propose a no-regret continuous-time model-based RL algorithm, which akin to OTaCoS, performs optimistic exploration. They study the problem where controls can be executed continuously in time and propose adaptive measurement selection strategies. Similarly, we propose a novel model-based RL algorithm, OTaCoS, based on the principle of optimism in the face of uncertainty. We show that OTaCoS has no regret for sufficiently smooth dynamics and has considerable sample-efficiency gains over its model-free counterpart.

7 Conclusion and discussion

We study the problem of time-adaptive RL for continuous-time systems with continuous state and action spaces. We investigate two practical settings where each interaction has an inherent cost and where we have a hard constraint on the number of interactions. We propose a novel RL framework, TaCoS, and show that both of these settings result in extended MDPs which can be solved with standard RL algorithms. In our experiments, we show that combining standard RL algorithms with TaCoS results in a significant reduction in the number of interactions without having any effect on the performance for the interaction cost setting. Furthermore, for the second setting, TaCoS achieves considerably better control performance despite having a small budget for the number of interactions. Moreover, we show that TaCoS improves robustness to a large range of interaction frequencies, and generally improves sample complexity of learning. Finally, we propose, OTaCoS, a no-regret model-based RL algorithm for TaCoS and show that it has further sample efficiency gains.

Acknowledgments and Disclosure of Funding

This project has received funding from the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and the Microsoft Swiss Joint Research Center.

References

  • Anta and Tabuada, (2010) Anta, A. and Tabuada, P. (2010). To sample or not to sample: Self-triggered control for nonlinear systems. IEEE Transactions on automatic control, 55(9):2030–2042.
  • Astrom and Bernhardsson, (2002) Astrom, K. J. and Bernhardsson, B. M. (2002). Comparison of riemann and lebesgue sampling for first order stochastic systems. In Proceedings of the 41st IEEE Conference on Decision and Control, 2002., volume 2, pages 2011–2016. IEEE.
  • Berger, (1993) Berger, M. A. (1993). Markov Jump Processes, pages 121–138. Springer New York, New York, NY.
  • Biedenkapp et al., (2021) Biedenkapp, A., Rajan, R., Hutter, F., and Lindauer, M. (2021). Temporl: Learning when to act. In International Conference on Machine Learning, pages 914–924. PMLR.
  • Bobkov and Götze, (1999) Bobkov, S. G. and Götze, F. (1999). Exponential integrability and transportation cost related to logarithmic sobolev inequalities. Journal of Functional Analysis, 163(1):1–28.
  • Braylan et al., (2015) Braylan, A., Hollenbeck, M., Meyerson, E., and Miikkulainen, R. (2015). Frame skip is a powerful parameter for learning to play atari. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.
  • Burns et al., (2023) Burns, K., Yu, T., Finn, C., and Hausman, K. (2023). Offline reinforcement learning at multiple frequencies. In Conference on Robot Learning, pages 2041–2051. PMLR.
  • Chen et al., (2021) Chen, C., Tang, H., Hao, J., Liu, W., and Meng, Z. (2021). Addressing action oscillations through learning policy inertia. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7020–7027.
  • Chen et al., (2018) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.
  • Chua et al., (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS.
  • Cranmer et al., (2020) Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., and Ho, S. (2020). Lagrangian neural networks. arXiv preprint arXiv:2003.04630.
  • Curi et al., (2020) Curi, S., Berkenkamp, F., and Krause, A. (2020). Efficient model-based reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33:14156–14170.
  • Dabney et al., (2020) Dabney, W., Ostrovski, G., and Barreto, A. (2020). Temporally-extended {{\{{\\\backslash\epsilon}}\}}-greedy exploration. arXiv preprint arXiv:2006.01782.
  • Djellout et al., (2004) Djellout, H., Guillin, A., and Wu, L. (2004). Transportation cost-information inequalities and applications to random dynamical systems and diffusions. The Annals of Probability, 32(3):2702–2732.
  • Doya, (2000) Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245.
  • Durugkar et al., (2016) Durugkar, I. P., Rosenbaum, C., Dernbach, S., and Mahadevan, S. (2016). Deep reinforcement learning with macro-actions. arXiv preprint arXiv:1606.04615.
  • Eberhard et al., (2022) Eberhard, O., Hollenstein, J., Pinneri, C., and Martius, G. (2022). Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In The Eleventh International Conference on Learning Representations.
  • Engquist et al., (2007) Engquist, B., Li, X., Ren, W., Vanden-Eijnden, E., et al. (2007). Heterogeneous multiscale methods: a review. Communications in Computational Physics, 2(3):367–450.
  • Freeman et al., (2021) Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. (2021). Brax - a differentiable physics engine for large scale rigid body simulation.
  • Greydanus et al., (2019) Greydanus, S., Dzamba, M., and Yosinski, J. (2019). Hamiltonian neural networks. Advances in neural information processing systems, 32.
  • Grigsby et al., (2021) Grigsby, J., Yoo, J. Y., and Qi, Y. (2021). Towards automatic actor-critic solutions to continuous control. arXiv preprint arXiv:2106.08918.
  • Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR.
  • Hafner et al., (2019) Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2019). Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.
  • Harb et al., (2018) Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Heemels et al., (2021) Heemels, W., Johansson, K. H., and Tabuada, P. (2021). Event-triggered and self-triggered control. In Encyclopedia of Systems and Control, pages 724–730. Springer.
  • Heemels et al., (2012) Heemels, W. P., Johansson, K. H., and Tabuada, P. (2012). An introduction to event-triggered and self-triggered control. In 2012 ieee 51st ieee conference on decision and control (cdc), pages 3270–3285. IEEE.
  • Holt et al., (2023) Holt, S., Hüyük, A., and van der Schaar, M. (2023). Active observing in continuous-time control. In Thirty-seventh Conference on Neural Information Processing Systems.
  • Huang et al., (2019) Huang, Y., Kavitha, V., and Zhu, Q. (2019). Continuous-time markov decision processes with controlled observations. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 32–39. IEEE.
  • Jones et al., (2009) Jones, D. S., Plank, M., and Sleeman, B. D. (2009). Differential equations and mathematical biology. CRC press.
  • Kaandorp and Koole, (2007) Kaandorp, G. C. and Koole, G. (2007). Optimal outpatient appointment scheduling. Health care management science, 10:217–229.
  • Kabzan et al., (2020) Kabzan, J., Valls, M. I., Reijgwart, V. J., Hendrikx, H. F., Ehmke, C., Prajapat, M., Bühler, A., Gosala, N., Gupta, M., Sivanesan, R., et al. (2020). Amz driverless: The full autonomous racing system. Journal of Field Robotics, 37(7):1267–1294.
  • Kakade et al., (2020) Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020). Information theoretic regret bounds for online nonlinear control. NeurIPS, 33:15312–15325.
  • Kallianpur and Sundar, (2014) Kallianpur, G. and Sundar, P. (2014). 266Jump Markov Processes. In Stochastic Analysis and Diffusion Processes. Oxford University Press.
  • Karimi, (2023) Karimi, A. (2023). Decision frequency adaptation in reinforcement learning using continuous options with open-loop policies.
  • Kirschner and Krause, (2018) Kirschner, J. and Krause, A. (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358–384. PMLR.
  • Krale et al., (2023) Krale, M., Simão, T. D., and Jansen, N. (2023). Act-then-measure: reinforcement learning for partially observable environments with active measuring. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 33, pages 212–220.
  • Lee et al., (2020) Lee, J., Lee, B.-J., and Kim, K.-E. (2020). Reinforcement learning for control with multiple frequencies. Advances in Neural Information Processing Systems, 33:3254–3264.
  • Lenhart and Workman, (2007) Lenhart, S. and Workman, J. T. (2007). Optimal control applied to biological models. CRC press.
  • Lillicrap et al., (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
  • Lutter et al., (2021) Lutter, M., Mannor, S., Peters, J., Fox, D., and Garg, A. (2021). Value iteration in continuous actions, states and time. arXiv preprint arXiv:2105.04682.
  • Mankowitz et al., (2014) Mankowitz, D. J., Mann, T. A., and Mannor, S. (2014). Time regularized interrupting options. In Internation Conference on Machine Learning.
  • Mann and Mannor, (2014) Mann, T. and Mannor, S. (2014). Scaling up approximate value iteration with options: Better policies with fewer iterations. In International conference on machine learning, pages 127–135. PMLR.
  • Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Ni and Jang, (2022) Ni, T. and Jang, E. (2022). Continuous control on time. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World.
  • Panetta and Fister, (2003) Panetta, J. C. and Fister, K. R. (2003). Optimal control applied to competing chemotherapeutic cell-kill strategies. SIAM Journal on Applied Mathematics, 63(6):1954–1971.
  • Rothfuss et al., (2023) Rothfuss, J., Sukhija, B., Birchler, T., Kassraie, P., and Krause, A. (2023). Hallucinated adversarial control for conservative offline policy evaluation. In Uncertainty in Artificial Intelligence, pages 1774–1784. PMLR.
  • Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR.
  • Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
  • Seifner and Sanchez, (2023) Seifner, P. and Sanchez, R. J. (2023). Neural markov jump processes. arXiv preprint arXiv:2305.19744.
  • Sharma et al., (2017) Sharma, S., Srinivas, A., and Ravindran, B. (2017). Learning to repeat: Fine grained action repetition for deep reinforcement learning. arXiv preprint arXiv:1702.06054.
  • Spong et al., (2006) Spong, M. W., Hutchinson, S., Vidyasagar, M., et al. (2006). Robot modeling and control, volume 3. Wiley New York.
  • Srinivas et al., (2017) Srinivas, A., Sharma, S., and Ravindran, B. (2017). Dynamic action repetition for deep reinforcement learning. In Proc. AAAI.
  • Srinivas et al., (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
  • Sukhija et al., (2024) Sukhija, B., Treven, L., Sancaktar, C., Blaes, S., Coros, S., and Krause, A. (2024). Optimistic active exploration of dynamical systems. NeurIPS.
  • Sutton et al., (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
  • Tap, (2000) Tap, F. (2000). Economics-based optimal control of greenhouse tomato crop production. Wageningen University and Research.
  • Treven et al., (2023) Treven, L., Hübotter, J., Sukhija, B., Dörfler, F., and Krause, A. (2023). Efficient exploration in continuous-time model-based reinforcement learning.
  • Turchetta et al., (2022) Turchetta, M., Corinzia, L., Sussex, S., Burton, A., Herrera, J., Athanasiadis, I., Buhmann, J. M., and Krause, A. (2022). Learning long-term crop management strategies with cyclesgym. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 11396–11409. Curran Associates, Inc.
  • Vakili et al., (2021) Vakili, S., Khezeli, K., and Picheny, V. (2021). On information gain and regret bounds in gaussian process bandits. In AISTATS.
  • Vezhnevets et al., (2016) Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., and kavukcuoglu, k. (2016). Strategic attentive writer for learning macro-actions. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  • Williams and Rasmussen, (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.
  • Yildiz et al., (2021) Yildiz, C., Heinonen, M., and Lähdesmäki, H. (2021). Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR.
  • Yu et al., (2021) Yu, H., Xu, W., and Zhang, H. (2021). Taac: Temporally abstract actor-critic for continuous control. Advances in Neural Information Processing Systems, 34:29021–29033.
  • Zheng et al., (2023) Zheng, Q., Henaff, M., Amos, B., and Grover, A. (2023). Semi-supervised offline reinforcement learning with action-free trajectories. In International conference on machine learning, pages 42339–42362. PMLR.
\etocdepthtag

.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection

Appendix A Extended Theory

In this section, we prove Theorem 2 for OTaCoS. We separate the section into two parts; proof for the transaction cost setting (Section A.1) and the proof for the bounded number of switches setting (Section A.2).

We start with the definitions of model complexity and sub-Gaussian random vector that we will use extensively in this section.

Definition 2 (Model Complexity).

We define the model complexity as is defined by Curi et al., (2020).

N:=max𝒟1,,𝒟Nn=1N(𝒙,𝒖,t)𝒟n𝝈n(𝒙,𝒖,t)22.assignsubscript𝑁subscript𝒟1subscript𝒟𝑁subscriptsuperscript𝑁𝑛1subscript𝒙𝒖𝑡subscript𝒟𝑛subscriptsuperscriptdelimited-∥∥subscript𝝈𝑛𝒙𝒖𝑡22{\mathcal{I}}_{N}:=\underset{{\mathcal{D}}_{1},\dots,{\mathcal{D}}_{N}}{\max}% \sum^{N}_{n=1}\sum_{({\bm{x}},{\bm{u}},t)\in{\mathcal{D}}_{n}}\left\lVert{\bm{% \sigma}}_{n}({\bm{x}},{\bm{u}},t)\right\rVert^{2}_{2}.caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT := start_UNDERACCENT caligraphic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , caligraphic_D start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_UNDERACCENT start_ARG roman_max end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) ∈ caligraphic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT . (9)
Definition 3.

A random variable x𝑥x\in\mathbb{R}italic_x ∈ blackboard_R is said to be sub-Gaussian with variance proxy σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT if 𝔼[x]=0𝔼delimited-[]𝑥0\mathbb{E}[x]=0blackboard_E [ italic_x ] = 0 and we have:

𝔼[etx]eσ2t22,tformulae-sequence𝔼delimited-[]superscript𝑒𝑡𝑥superscript𝑒superscript𝜎2superscript𝑡22for-all𝑡\displaystyle\mathbb{E}[e^{tx}]\leq e^{\frac{\sigma^{2}t^{2}}{2}},\quad\forall t% \in\mathbb{R}blackboard_E [ italic_e start_POSTSUPERSCRIPT italic_t italic_x end_POSTSUPERSCRIPT ] ≤ italic_e start_POSTSUPERSCRIPT divide start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG end_POSTSUPERSCRIPT , ∀ italic_t ∈ blackboard_R

A random vector 𝐱d𝐱superscript𝑑{\bm{x}}\in\mathbb{R}^{d}bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT is said to be sub Gaussian with variance proxy σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT if for any 𝐞d,𝐞2=1formulae-sequence𝐞superscript𝑑subscriptdelimited-∥∥𝐞21{\bm{e}}\in\mathbb{R}^{d},\left\lVert{\bm{e}}\right\rVert_{2}=1bold_italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT , ∥ bold_italic_e ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 1 the random variable 𝐱𝐞superscript𝐱top𝐞{\bm{x}}^{\top}{\bm{e}}bold_italic_x start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e is σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT sub Gaussian. We write 𝐱subG(σ2)similar-to𝐱subGsuperscript𝜎2{\bm{x}}\sim\text{subG}\left(\sigma^{2}\right)bold_italic_x ∼ subG ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

In the following, we will be distinguishing between the state of the augmented MDP 𝒔𝒔{\bm{s}}bold_italic_s and the true state of the dynamical system 𝒙𝒙{\bm{x}}bold_italic_x. The augmented state at time step k𝑘kitalic_k includes the true state of the system, 𝒙ksubscript𝒙𝑘{\bm{x}}_{k}bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, the integrated reward bksubscript𝑏𝑘b_{k}italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT between k1𝑘1k-1italic_k - 1 and k𝑘kitalic_k, and the time to left to go tksubscript𝑡𝑘t_{k}italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, i.e., 𝒔k=[𝒙k,bk,tk]subscript𝒔𝑘superscriptsuperscriptsubscript𝒙𝑘topsubscript𝑏𝑘subscript𝑡𝑘top{\bm{s}}_{k}=[{\bm{x}}_{k}^{\top},b_{k},t_{k}]^{\top}bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = [ bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT , italic_b start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT.

A.1 Transition Cost setting

We prove our regret bound for the transition cost case in the following. We start with the difference lemma which adapts Sukhija et al., (2024, Lemma 2) to our setting.

Lemma 3 (Difference lemma).

Define V𝛑n,𝚽(𝐱,τ)subscript𝑉subscript𝛑𝑛𝚽𝐱𝜏V_{{\bm{\pi}}_{n},{\bm{\Phi}}}({\bm{x}},\tau)italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ end_POSTSUBSCRIPT ( bold_italic_x , italic_τ ) as

𝔼𝝅,𝚽[k0K(τ)1r(𝒔k,𝝅(𝒔k))|𝒙0=𝒙];where k=0K(τ)1π𝒯(𝒙k,tk)=τsubscript𝔼𝝅𝚽delimited-[]conditionalsubscriptsuperscript𝐾𝜏1𝑘0𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘subscript𝒙0𝒙where superscriptsubscript𝑘0𝐾𝜏1subscript𝜋𝒯subscript𝒙𝑘subscript𝑡𝑘𝜏\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}}\left[\sum^{K(\tau)-1}_{k\geq 0}r({\bm{s}}_% {k},{\bm{\pi}}({\bm{s}}_{k}))\Big{|}{\bm{x}}_{0}={\bm{x}}\right];\;\text{where% }\sum_{k=0}^{K(\tau)-1}\pi_{{\mathcal{T}}}({\bm{x}}_{k},t_{k})=\taublackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ end_POSTSUBSCRIPT [ ∑ start_POSTSUPERSCRIPT italic_K ( italic_τ ) - 1 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) | bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = bold_italic_x ] ; where ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K ( italic_τ ) - 1 end_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) = italic_τ

that is the total reward starting with time to go τ𝜏\tauitalic_τ and state 𝐱𝐱{\bm{x}}bold_italic_x for the policy 𝛑𝛑{\bm{\pi}}bold_italic_π and dynamics 𝚽𝚽{\bm{\Phi}}bold_Φ. Here the expectation w.r.t. 𝛑,𝚽𝛑𝚽{\bm{\pi}},{\bm{\Phi}}bold_italic_π , bold_Φ represents the expectation of the underlying trajectory induced by the policy 𝛑𝛑{\bm{\pi}}bold_italic_π on the dynamics 𝚽𝚽{\bm{\Phi}}bold_Φ. Then we have for all 𝛑𝛑{\bm{\pi}}bold_italic_π, 𝚽superscript𝚽{\bm{\Phi}}^{\prime}bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, 𝐱0subscript𝐱0{\bm{x}}_{0}bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, T<0𝑇0T<0italic_T < 0;

V𝝅,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)=𝔼𝝅,𝚽[k0V𝝅,𝚽(𝒙^k+1,tk+1)V𝝅,𝚽(𝒙k+1,tk+1)],subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝔼𝝅superscript𝚽delimited-[]subscript𝑘0subscript𝑉𝝅superscript𝚽subscript^𝒙𝑘1subscript𝑡𝑘1subscript𝑉𝝅superscript𝚽subscript𝒙𝑘1subscript𝑡𝑘1\displaystyle V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}({\bm{x}}_{0},T)-V_{{\bm{\pi}% },{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}% \left[\sum_{k\geq 0}V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}(\widehat{{\bm{x}}}_{k+% 1},t_{k+1})-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}({\bm{x}}_{k+1},t_{k+1})\right],italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) = blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT ) ] , (10)

where 𝐱^k+1subscript^𝐱𝑘1\widehat{{\bm{x}}}_{k+1}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is the state of 𝐬^k+1=𝚿𝚽(𝐬k,𝛑(𝐬k),𝐰k)subscript^𝐬𝑘1subscript𝚿superscript𝚽subscript𝐬𝑘𝛑subscript𝐬𝑘subscript𝐰𝑘\widehat{{\bm{s}}}_{k+1}={\bm{\Psi}}_{{\bm{\Phi}}^{\prime}}({\bm{s}}_{k},{\bm{% \pi}}({\bm{s}}_{k}),{\bm{w}}_{k})over^ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) and 𝐱k+1subscript𝐱𝑘1{\bm{x}}_{k+1}bold_italic_x start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT is the state of 𝐬k+1=𝚿𝚽(𝐬k,𝛑(𝐬k),𝐰k)subscript𝐬𝑘1subscript𝚿superscript𝚽subscript𝐬𝑘𝛑subscript𝐬𝑘subscript𝐰𝑘{\bm{s}}_{k+1}={\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_% {k}),{\bm{w}}_{k})bold_italic_s start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ).

Proof.
V𝝅,𝚽(𝒙0,T)subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇\displaystyle V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) =𝔼𝝅,𝚽[k0r(𝒔k,𝝅(𝒔k))]absentsubscript𝔼𝝅superscript𝚽delimited-[]subscript𝑘0𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[\sum_{k\geq 0}r({% \bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ]
=𝔼𝝅,𝚽[r(𝒔0,𝝅(𝒔0))+k1r(𝒔k,𝝅(𝒔k))]absentsubscript𝔼𝝅superscript𝚽delimited-[]𝑟subscript𝒔0𝝅subscript𝒔0subscript𝑘1𝑟subscript𝒔𝑘𝝅subscript𝒔𝑘\displaystyle=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[r({\bm{s}}_{0},{\bm% {\pi}}({\bm{s}}_{0}))+\sum_{k\geq 1}r({\bm{s}}_{k},{\bm{\pi}}({\bm{s}}_{k}))\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) ) ]
=𝔼𝝅,𝚽[r(𝒔k,𝝅(𝒔0))+V𝝅,𝚽(𝒙1,t1)]absentsubscript𝔼𝝅superscript𝚽delimited-[]𝑟subscript𝒔𝑘𝝅subscript𝒔0subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1\displaystyle=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[r({\bm{s}}_{k},{\bm% {\pi}}({\bm{s}}_{0}))+V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x}}_{{1}},t_{1})\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]
=𝔼𝝅,𝚽[r(𝒔k,𝝅(𝒔0))+V𝝅,𝚽(𝒙^1,t1)V𝝅,𝚽(𝒙0,T)]+absentlimit-fromsubscript𝔼𝝅superscript𝚽delimited-[]𝑟subscript𝒔𝑘𝝅subscript𝒔0subscript𝑉𝝅superscript𝚽subscript^𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇\displaystyle=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[r({\bm{s}}_{k},{\bm% {\pi}}({\bm{s}}_{0}))+V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}(\widehat{\bm{x}}_{{1% }},t_{1})-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}({\bm{x}}_{0},T)\right]+= blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π ( bold_italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ] +
+𝔼𝝅,𝚽[V𝝅,𝚽(𝒙0,T)V𝝅,𝚽(𝒙^1,t1)+V𝝅,𝚽(𝒙1,t1)]subscript𝔼𝝅superscript𝚽delimited-[]subscript𝑉𝝅𝚽subscript𝒙0𝑇subscript𝑉𝝅superscript𝚽subscript^𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1\displaystyle+\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[V_{{\bm{\pi}},{\bm{% \Phi}}}({\bm{x}}_{0},T)-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}(\widehat{\bm{x}}_{% {1}},t_{1})+V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x}}_{{1}},t_{1})\right]+ blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]
=V𝝅,𝚽(𝒙0,T)+𝔼𝝅,𝚽[V𝝅,𝚽(𝒙1,t1)V𝝅,𝚽(𝒙^1,t1)]absentsubscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝔼𝝅superscript𝚽delimited-[]subscript𝑉𝝅𝚽subscript𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript^𝒙1subscript𝑡1\displaystyle=V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}({\bm{x}}_{0},T)+\mathbb{E}_{% {\bm{\pi}},{\bm{\Phi}}^{*}}\left[V_{{\bm{\pi}},{\bm{\Phi}}}({\bm{x}}_{{1}},t_{% 1})-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}(\widehat{\bm{x}}_{{1}},t_{1})\right]= italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) + blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]
+𝔼𝝅,𝚽[V𝝅,𝚽(𝒙1,t1)V𝝅,𝚽(𝒙1,t1)]subscript𝔼𝝅superscript𝚽delimited-[]subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1\displaystyle+\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[V_{{\bm{\pi}},{\bm{% \Phi}}^{*}}({\bm{x}}_{{1}},t_{1})-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}({\bm{x}}% _{{1}},t_{1})\right]+ blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]

Hence we have:

V𝝅,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)=subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇subscript𝑉𝝅superscript𝚽subscript𝒙0𝑇absent\displaystyle V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}},{% \bm{\Phi}}^{\prime}}({\bm{x}}_{0},T)=italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) =
=𝔼𝝅,𝚽[V𝝅,𝚽(𝒙1,t1)V𝝅,𝚽(𝒙^1,t1)]+𝔼𝝅,𝚽[V𝝅,𝚽(𝒙1,t1)V𝝅,𝚽(𝒙1,t1)]absentsubscript𝔼𝝅superscript𝚽delimited-[]subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript^𝒙1subscript𝑡1subscript𝔼𝝅superscript𝚽delimited-[]subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1subscript𝑉𝝅superscript𝚽subscript𝒙1subscript𝑡1\displaystyle=\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}\left[V_{{\bm{\pi}},{\bm{% \Phi}}^{\prime}}({\bm{x}}_{{1}},t_{1})-V_{{\bm{\pi}},{\bm{\Phi}}^{\prime}}(% \widehat{\bm{x}}_{{1}},t_{1})\right]+\mathbb{E}_{{\bm{\pi}},{\bm{\Phi}}^{*}}% \left[V_{{\bm{\pi}},{\bm{\Phi}}^{*}}({\bm{x}}_{{1}},t_{1})-V_{{\bm{\pi}},{\bm{% \Phi}}^{\prime}}({\bm{x}}_{{1}},t_{1})\right]= blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] + blackboard_E start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Φ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ]

By repeating the step inductively the result follows. ∎

In the following, we leverage the result above to bound the regret of our optimistic planner w.r.t. the difference in value functions.

Lemma 4 (Per episode regret bound).

Let 4 hold, then we have with probability at least 1δ1𝛿1-\delta1 - italic_δ for all n0𝑛0n\geq 0italic_n ≥ 0.

V𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)𝔼𝝅n,𝚽[k0V𝝅n,𝚽n(𝒙^n,k+1,tn,k+1)V𝝅n,𝚽n(𝒙n,k+1,tn,k+1)].subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇subscript𝔼subscript𝝅𝑛superscript𝚽delimited-[]subscript𝑘0subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛𝑘1subscript𝑡𝑛𝑘1subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛𝑘1subscript𝑡𝑛𝑘1\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}}% ^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\leq\mathbb{E}_{{\bm{\pi}}_{n},{\bm{\Phi}% }^{*}}\left[\sum_{k\geq 0}V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\widehat{{\bm{x}}% }_{n,k+1},t_{n,k+1})-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{n,k+1},t_{n,% k+1})\right].italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ blackboard_E start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) ] . (11)
Proof.

Since we choose the policy optimistically, we get

V𝝅,𝚽(𝒙0,T)V𝝅n,𝚽(𝒙0,T)V𝝅n,𝚽n(𝒙0,T)V𝝅n,𝚽(𝒙0,T).subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙0𝑇subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇\displaystyle V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}}% _{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\leq V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({% \bm{x}}_{0},T)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T).italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) .

Applying Lemma 3 the result follows. ∎

Now we derive an upper and lower bound on our value function.

Lemma 5 (Objective upper bound).

Let 𝛑𝛑{\bm{\pi}}bold_italic_π be any policy from the class ΠTCsubscriptΠ𝑇𝐶\Pi_{TC}roman_Π start_POSTSUBSCRIPT italic_T italic_C end_POSTSUBSCRIPT and consider any T>0𝑇0T>0italic_T > 0, then we have:

CtminTV𝝅,𝚿(𝒙0,T)BT.𝐶subscript𝑡𝑇subscript𝑉𝝅superscript𝚿subscript𝒙0𝑇𝐵𝑇\displaystyle-\frac{C}{t_{\min}}T\leq V_{{\bm{\pi}},{\bm{\Psi}}^{*}}({\bm{x}}_% {0},T)\leq BT.- divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG italic_T ≤ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ italic_B italic_T .
Proof.

Since running reward is bounded 0b(𝒙,𝒖)B0superscript𝑏𝒙𝒖𝐵0\leq b^{*}({\bm{x}},{\bm{u}})\leq B0 ≤ italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_u ) ≤ italic_B, the number of steps K𝐾Kitalic_K we can do in an episode is bounded with 0KTtmin0𝐾𝑇subscript𝑡0\leq K\leq\frac{T}{t_{\min}}0 ≤ italic_K ≤ divide start_ARG italic_T end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG, and switch cost is bounded 0c(𝒙,𝒖)C0𝑐𝒙𝒖𝐶0\leq c({\bm{x}},{\bm{u}})\leq C0 ≤ italic_c ( bold_italic_x , bold_italic_u ) ≤ italic_C we have:

CtminTV𝝅,𝚿(𝒙0,T)BT.𝐶subscript𝑡𝑇subscript𝑉𝝅superscript𝚿subscript𝒙0𝑇𝐵𝑇\displaystyle-\frac{C}{t_{\min}}T\leq V_{{\bm{\pi}},{\bm{\Psi}}^{*}}({\bm{x}}_% {0},T)\leq BT.- divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG italic_T ≤ italic_V start_POSTSUBSCRIPT bold_italic_π , bold_Ψ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ italic_B italic_T .

A key lemma we use to bound the difference in value functions is the following from Kakade et al., (2020).

Lemma 6 (Absolute expectation Difference Under Two Gaussians (Lemma C.2. Kakade et al., (2020))).

Let 𝐳1𝒩(𝛍1,σ2𝐈)similar-tosubscript𝐳1𝒩subscript𝛍1superscript𝜎2𝐈{\bm{z}}_{1}\sim{\mathcal{N}}({\bm{\mu}}_{1},\sigma^{2}{\bm{I}})bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ) and 𝐳2𝒩(𝛍2,σ2𝐈)similar-tosubscript𝐳2𝒩subscript𝛍2superscript𝜎2𝐈{\bm{z}}_{2}\sim{\mathcal{N}}({\bm{\mu}}_{2},\sigma^{2}{\bm{I}})bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∼ caligraphic_N ( bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT bold_italic_I ), and for any (appropriately measurable) positive function g𝑔gitalic_g, it holds that:

𝔼[g(𝒛1)]𝔼[g(𝒛2)]min{𝝁1𝝁2σ2,1}𝔼[g2(𝒛1)]𝔼delimited-[]𝑔subscript𝒛1𝔼delimited-[]𝑔subscript𝒛2delimited-∥∥subscript𝝁1subscript𝝁2superscript𝜎21𝔼delimited-[]superscript𝑔2subscript𝒛1\mathbb{E}[g({\bm{z}}_{1})]-\mathbb{E}[g({\bm{z}}_{2})]\leq\min\left\{\frac{% \left\lVert{\bm{\mu}}_{1}-{\bm{\mu}}_{2}\right\rVert}{\sigma^{2}},1\right\}% \sqrt{\mathbb{E}[g^{2}({\bm{z}}_{1})]}blackboard_E [ italic_g ( bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] - blackboard_E [ italic_g ( bold_italic_z start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≤ roman_min { divide start_ARG ∥ bold_italic_μ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - bold_italic_μ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG , 1 } square-root start_ARG blackboard_E [ italic_g start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( bold_italic_z start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ] end_ARG

Furthermore, due to 4 we can also bound the distance between the next state prediction by the true system 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and the optimistic system 𝚽nsubscript𝚽𝑛{\bm{\Phi}}_{n}bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT.

Lemma 7.

Let 4 hold, then we have the following for all n0𝑛0n\geq 0italic_n ≥ 0.

𝒙n,k+1𝒙^n,k+12d𝒙βn1𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))delimited-∥∥subscript𝒙𝑛𝑘1subscript^𝒙𝑛𝑘12subscript𝑑𝒙subscript𝛽𝑛1delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle\left\lVert{\bm{x}}_{n,k+1}-\widehat{{\bm{x}}}_{n,k+1}\right% \rVert\leq 2\sqrt{d_{\bm{x}}}\beta_{n-1}\left\lVert{\bm{\sigma}}_{n-1}({\bm{x}% }_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}))\right\rVert∥ bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ∥ ≤ 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥
Proof.
𝒙n,k+1𝒙^n,k+1delimited-∥∥subscript𝒙𝑛𝑘1subscript^𝒙𝑛𝑘1\displaystyle\left\lVert{\bm{x}}_{n,k+1}-\widehat{{\bm{x}}}_{n,k+1}\right\rVert∥ bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ∥ =𝚽(𝒙k,𝝅n(𝒙n,k,tn,k))+𝒘n,k(𝚽n(𝒙n,k,𝝅n(𝒙n,k,tn,k)+𝒘n,k)\displaystyle=\left\lVert{\bm{\Phi}}^{*}({\bm{x}}_{k},{\bm{\pi}}_{n}({\bm{x}}_% {n,k},t_{n,k}))+{\bm{w}}_{n,k}-({\bm{\Phi}}_{n}({\bm{x}}_{n,k},{\bm{\pi}}_{n}(% {\bm{x}}_{n,k},t_{n,k})+{\bm{w}}_{n,k})\right\rVert= ∥ bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) + bold_italic_w start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT - ( bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) + bold_italic_w start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ∥
=𝚽(𝒙k,𝝅n(𝒙n,k,tn,k))𝚽n(𝒙n,k,𝝅n(𝒙n,k,tn,k)\displaystyle=\left\lVert{\bm{\Phi}}^{*}({\bm{x}}_{k},{\bm{\pi}}_{n}({\bm{x}}_% {n,k},t_{n,k}))-{\bm{\Phi}}_{n}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t% _{n,k})\right\rVert= ∥ bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) - bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ∥
2d𝒙βn1𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k)),absent2subscript𝑑𝒙subscript𝛽𝑛1delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle\leq 2\sqrt{d_{\bm{x}}}\beta_{n-1}\left\lVert{\bm{\sigma}}_{n-1}(% {\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}))\right\rVert,≤ 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ ,

where the last inequality follows from the fact that 𝚽,𝚽nn1superscript𝚽subscript𝚽𝑛subscript𝑛1{\bm{\Phi}}^{*},{\bm{\Phi}}_{n}\in{\mathcal{M}}_{n-1}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_M start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT

Next, we relate the regret at each episode to the model epistemic uncertainty using Lemma 3 and Lemma 7.

Corollary 8.

Let 12 and 4 hold, then we have for all n0𝑛0n\geq 0italic_n ≥ 0 with probability at least 1δ1𝛿1-\delta1 - italic_δ.

V𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)2d𝒙βn1Tσ(B+Ctmin)𝔼[k0𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))]subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇2subscript𝑑𝒙subscript𝛽𝑛1𝑇𝜎𝐵𝐶subscript𝑡𝔼delimited-[]subscript𝑘0delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}}% ^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{n-1}T}% {\sigma}\left(B+\frac{C}{t_{\min}}\right)\mathbb{E}\left[\sum_{k\geq 0}\left% \lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}% ))\right\rVert\right]italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ ] (12)
Proof.

From Lemma 4 we have:

V𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)𝔼[k0V𝝅n,𝚽n(𝒙n,k+1,tn,k+1)V𝝅n,𝚽n(𝒙^n,k+1,tn,k+1)].subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇𝔼delimited-[]subscript𝑘0subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛𝑘1subscript𝑡𝑛𝑘1subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛𝑘1subscript𝑡𝑛𝑘1\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}}% ^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\leq\mathbb{E}\left[\sum_{k\geq 0}V_{{\bm% {\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{n,k+1},t_{n,k+1})-V_{{\bm{\pi}}_{n},{\bm% {\Phi}}_{n}}(\widehat{{\bm{x}}}_{n,k+1},t_{n,k+1})\right].italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) ] .

Lemma 6 can be applied to positive function g𝑔gitalic_g. We hence make a transformation and apply it to g()=V𝝅n,𝚽n(,tn,k+1)+CtminT𝑔subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝑡𝑛𝑘1𝐶subscript𝑡𝑇g(\cdot)=V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\cdot,t_{n,k+1})+\frac{C}{t_{\min}}Titalic_g ( ⋅ ) = italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG italic_T, which is positive due to Lemma 5. Moreover, 𝒙𝒳for-all𝒙𝒳\forall{\bm{x}}\in{\mathcal{X}}∀ bold_italic_x ∈ caligraphic_X;

g()=V𝝅n,𝚽n(,tn,k+1)+CtminTBtn,k+1+CtminTT(B+Ctmin).𝑔subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝑡𝑛𝑘1𝐶subscript𝑡𝑇𝐵subscript𝑡𝑛𝑘1𝐶subscript𝑡𝑇𝑇𝐵𝐶subscript𝑡g(\cdot)=V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\cdot,t_{n,k+1})+\frac{C}{t_{\min}% }T\leq Bt_{n,k+1}+\frac{C}{t_{\min}}T\leq T(B+\frac{C}{t_{\min}}).italic_g ( ⋅ ) = italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( ⋅ , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG italic_T ≤ italic_B italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG italic_T ≤ italic_T ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) .

Applying Lemma 6 we obtain:

V𝝅n,𝚽n(𝒙n,k+1,tn,k+1)V𝝅n,𝚽n(𝒙^n,k+1,tn,k+1)Tσ(B+Ctmin)𝔼[𝒙n,k+1𝒙^n,k+1]subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛𝑘1subscript𝑡𝑛𝑘1subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛𝑘1subscript𝑡𝑛𝑘1𝑇𝜎𝐵𝐶subscript𝑡𝔼delimited-[]delimited-∥∥subscript𝒙𝑛𝑘1subscript^𝒙𝑛𝑘1\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{n,k+1},t_{n,k+1})-V% _{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\widehat{{\bm{x}}}_{n,k+1},t_{n,k+1})\leq% \frac{T}{\sigma}\left(B+\frac{C}{t_{\min}}\right)\mathbb{E}\left[\left\lVert{% \bm{x}}_{n,k+1}-\widehat{{\bm{x}}}_{n,k+1}\right\rVert\right]italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ) ≤ divide start_ARG italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) blackboard_E [ ∥ bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT ∥ ]

Finally, applying Lemma 7 we arrive at:

V𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)2d𝒙βn1Tσ(B+Ctmin)𝔼[k0𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))]subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇2subscript𝑑𝒙subscript𝛽𝑛1𝑇𝜎𝐵𝐶subscript𝑡𝔼delimited-[]subscript𝑘0delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)-V_{{\bm{\pi}}% ^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{n-1}T}% {\sigma}\left(B+\frac{C}{t_{\min}}\right)\mathbb{E}\left[\sum_{k\geq 0}\left% \lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}% ))\right\rVert\right]italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) ≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ ]

Now we can prove our regret bound for the transition cost case.

Theorem 9.

Let 12 and 4 hold, then we have for all n0𝑛0n\geq 0italic_n ≥ 0 with probability at least 1δ1𝛿1-\delta1 - italic_δ.

RNsubscript𝑅𝑁\displaystyle R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=1NV𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)absentsuperscriptsubscript𝑛1𝑁subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇\displaystyle=\sum_{n=1}^{N}V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)% -V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T )
2d𝒙βN1T3/2σ2tmin(B+Ctmin)NNabsent2subscript𝑑𝒙subscript𝛽𝑁1superscript𝑇32superscript𝜎2subscript𝑡𝐵𝐶subscript𝑡𝑁subscript𝑁\displaystyle\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{N-1}T^{3/2}}{\sigma^{2}t_{\min% }}\left(B+\frac{C}{t_{\min}}\right)\sqrt{N{\mathcal{I}}_{N}}≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG
Proof.

We compute:

RNsubscript𝑅𝑁\displaystyle R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=1NV𝝅n,𝚽(𝒙0,T)V𝝅,𝚽(𝒙0,T)absentsuperscriptsubscript𝑛1𝑁subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇\displaystyle=\sum_{n=1}^{N}V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)% -V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T)= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T )
2d𝒙Tσ(B+Ctmin)n=1Nβn1𝔼[k0𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))]absent2subscript𝑑𝒙𝑇𝜎𝐵𝐶subscript𝑡superscriptsubscript𝑛1𝑁subscript𝛽𝑛1𝔼delimited-[]subscript𝑘0delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle\leq\frac{2\sqrt{d_{\bm{x}}}T}{\sigma}\left(B+\frac{C}{t_{\min}}% \right)\sum_{n=1}^{N}\beta_{n-1}\mathbb{E}\left[\sum_{k\geq 0}\left\lVert{\bm{% \sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}))\right% \rVert\right]≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ ]
2d𝒙βN1Tσ(B+Ctmin)𝔼[n=1Nk0𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))]absent2subscript𝑑𝒙subscript𝛽𝑁1𝑇𝜎𝐵𝐶subscript𝑡𝔼delimited-[]superscriptsubscript𝑛1𝑁subscript𝑘0delimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘\displaystyle\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{N-1}T}{\sigma}\left(B+\frac{C}% {t_{\min}}\right)\mathbb{E}\left[\sum_{n=1}^{N}\sum_{k\geq 0}\left\lVert{\bm{% \sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k}))\right% \rVert\right]≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) blackboard_E [ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ ]
2d𝒙βN1Tσ(B+Ctmin)TNtmin𝔼[n=1Nk0𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k))2]absent2subscript𝑑𝒙subscript𝛽𝑁1𝑇𝜎𝐵𝐶subscript𝑡𝑇𝑁subscript𝑡𝔼delimited-[]superscriptsubscript𝑛1𝑁subscript𝑘0superscriptdelimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘2\displaystyle\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{N-1}T}{\sigma}\left(B+\frac{C}% {t_{\min}}\right)\sqrt{\frac{TN}{t_{\min}}}\mathbb{E}\left[\sqrt{\sum_{n=1}^{N% }\sum_{k\geq 0}\left\lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({% \bm{x}}_{n,k},t_{n,k}))\right\rVert^{2}}\right]≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_T end_ARG start_ARG italic_σ end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) square-root start_ARG divide start_ARG italic_T italic_N end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG end_ARG blackboard_E [ square-root start_ARG ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ]
2d𝒙βN1T3/2σtmin(B+Ctmin)NNabsent2subscript𝑑𝒙subscript𝛽𝑁1superscript𝑇32𝜎subscript𝑡𝐵𝐶subscript𝑡𝑁subscript𝑁\displaystyle\leq\frac{2\sqrt{d_{\bm{x}}}\beta_{N-1}T^{3/2}}{\sigma\sqrt{t_{% \min}}}\left(B+\frac{C}{t_{\min}}\right)\sqrt{N{\mathcal{I}}_{N}}≤ divide start_ARG 2 square-root start_ARG italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_ARG italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT italic_T start_POSTSUPERSCRIPT 3 / 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_σ square-root start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG end_ARG ( italic_B + divide start_ARG italic_C end_ARG start_ARG italic_t start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG

Here the first inequality follows because of Corollary 8, the second inequality follows due to the monotonicity of sequence (βn)n0subscriptsubscript𝛽𝑛𝑛0(\beta_{n})_{n\geq 0}( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ≥ 0 end_POSTSUBSCRIPT, the third inequality follows by Cauchy–Schwarz and the last one by maximizing the term in expectation. ∎

Our regret RNsubscript𝑅𝑁R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is sublinear if βN1NNsubscript𝛽𝑁1𝑁subscript𝑁\beta_{N-1}\sqrt{N{\mathcal{I}}_{N}}italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG is sublinear. For general well-calibrated models this is tough to verify. However, for Gaussian process dynamics, Nsubscript𝑁{\mathcal{I}}_{N}caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT is equal to (up to constant factors) the maximum information gain γNsubscript𝛾𝑁\gamma_{N}italic_γ start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT (Srinivas et al.,, 2009) (c.f., Curi et al., (2020, Lemma 17)). The maximum information gain is sublinear for a rich class of kernels (Vakili et al.,, 2021), i.e., yielding sublinear regret for OTaCoS (see Sukhija et al., (2024, Theorem 2) for more detail).

A.2 Bounded number of transition

We overload the notation in this section and add number of switches to the value function, such that we have V𝝅n,𝚽(𝒙0,T,0)=V𝝅n,𝚽(𝒙0,T)subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇0subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,0)=V_{{\bm{\pi}}_{n},{\bm{% \Phi}}^{*}}({\bm{x}}_{0},T)italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) = italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T )

Lemma 10 (Per episode regret bound).

We have:

V𝝅n,𝚽(𝒙0,T,0)subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇0\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,0)italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) V𝝅,𝚽(𝒙0,T,0)subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇0absent\displaystyle-V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,0)\leq- italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) ≤
𝔼[k=0K1V𝝅n,𝚽n(𝒙n,k+1,tn,k+1,k+1)V𝝅n,𝚽n(𝒙^n,k+1,tn,k+1,k+1)],absent𝔼delimited-[]superscriptsubscript𝑘0𝐾1subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛𝑘1subscript𝑡𝑛𝑘1𝑘1subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛𝑘1subscript𝑡𝑛𝑘1𝑘1\displaystyle\leq\mathbb{E}\left[\sum_{k=0}^{K-1}V_{{\bm{\pi}}_{n},{\bm{\Phi}}% _{n}}({\bm{x}}_{n,k+1},t_{n,k+1},k+1)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(% \widehat{{\bm{x}}}_{n,k+1},t_{n,k+1},k+1)\right],≤ blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_k + 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT , italic_k + 1 ) ] ,

where 𝐱^n,k+1subscript^𝐱𝑛𝑘1\widehat{{\bm{x}}}_{n,k+1}over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT is the state of one step hallucinated component 𝐬^n,k+1=𝚿𝚽n(𝐬n,k,𝛑n(𝐬n,k),𝐰n,k)subscript^𝐬𝑛𝑘1subscript𝚿subscript𝚽𝑛subscript𝐬𝑛𝑘subscript𝛑𝑛subscript𝐬𝑛𝑘subscript𝐰𝑛𝑘\widehat{{\bm{s}}}_{n,k+1}={\bm{\Psi}}_{{\bm{\Phi}}_{n}}({\bm{s}}_{n,k},{\bm{% \pi}}_{n}({\bm{s}}_{n,k}),{\bm{w}}_{n,k})over^ start_ARG bold_italic_s end_ARG start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) and 𝐱n,k+1subscript𝐱𝑛𝑘1{\bm{x}}_{n,k+1}bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT is the state of 𝐬n,k+1=𝚿𝚽(𝐬n,k,𝛑n(𝐬n,k),𝐰n,k)subscript𝐬𝑛𝑘1subscript𝚿superscript𝚽subscript𝐬𝑛𝑘subscript𝛑𝑛subscript𝐬𝑛𝑘subscript𝐰𝑛𝑘{\bm{s}}_{n,k+1}={\bm{\Psi}}_{{\bm{\Phi}}^{*}}({\bm{s}}_{n,k},{\bm{\pi}}_{n}({% \bm{s}}_{n,k}),{\bm{w}}_{n,k})bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k + 1 end_POSTSUBSCRIPT = bold_Ψ start_POSTSUBSCRIPT bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) , bold_italic_w start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ).

Proof.
V𝝅n,𝚽(𝒙0,T,0)subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇0\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,0)italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) =𝔼[k0r(𝒔n,k,𝝅n(𝒔n,k))]=𝔼[r(𝒔n,0,𝝅n(𝒔n,0))+k1r(𝒔n,k,𝝅n(𝒔n,k))]absent𝔼delimited-[]subscript𝑘0𝑟subscript𝒔𝑛𝑘subscript𝝅𝑛subscript𝒔𝑛𝑘𝔼delimited-[]𝑟subscript𝒔𝑛0subscript𝝅𝑛subscript𝒔𝑛0subscript𝑘1𝑟subscript𝒔𝑛𝑘subscript𝝅𝑛subscript𝒔𝑛𝑘\displaystyle=\mathbb{E}\left[\sum_{k\geq 0}r({\bm{s}}_{n,k},{\bm{\pi}}_{n}({% \bm{s}}_{n,k}))\right]=\mathbb{E}\left[r({\bm{s}}_{n,0},{\bm{\pi}}_{n}({\bm{s}% }_{n,0}))+\sum_{k\geq 1}r({\bm{s}}_{n,k},{\bm{\pi}}_{n}({\bm{s}}_{n,k}))\right]= blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k ≥ 0 end_POSTSUBSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ] = blackboard_E [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) ) + ∑ start_POSTSUBSCRIPT italic_k ≥ 1 end_POSTSUBSCRIPT italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT ) ) ]
=𝔼[r(𝒔n,k,𝝅n(𝒔n,0))+V𝝅n,𝚽(𝒙n,1,tn,1,1)]absent𝔼delimited-[]𝑟subscript𝒔𝑛𝑘subscript𝝅𝑛subscript𝒔𝑛0subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙𝑛1subscript𝑡𝑛11\displaystyle=\mathbb{E}\left[r({\bm{s}}_{n,k},{\bm{\pi}}_{n}({\bm{s}}_{n,0}))% +V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{{n,1}},t_{n,1},1)\right]= blackboard_E [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ]
=𝔼[r(𝒔n,k,𝝅n(𝒔n,0))+V𝝅n,𝚽n(𝒙n,1,tn,1,1)V𝝅n,𝚽n(𝒙0,T,0)]+absentlimit-from𝔼delimited-[]𝑟subscript𝒔𝑛𝑘subscript𝝅𝑛subscript𝒔𝑛0subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙0𝑇0\displaystyle=\mathbb{E}\left[r({\bm{s}}_{n,k},{\bm{\pi}}_{n}({\bm{s}}_{n,0}))% +V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{n,1},1)-V_{{\bm{\pi}}_% {n},{\bm{\Phi}}_{n}}({\bm{x}}_{0},T,0)\right]+= blackboard_E [ italic_r ( bold_italic_s start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_s start_POSTSUBSCRIPT italic_n , 0 end_POSTSUBSCRIPT ) ) + italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) ] +
+𝔼[V𝝅n,𝚽n(𝒙0,T,0)V𝝅n,𝚽n(𝒙n,1,tn,1,1)+V𝝅n,𝚽(𝒙n,1,tn,1,1)]𝔼delimited-[]subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙0𝑇0subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙𝑛1subscript𝑡𝑛11\displaystyle+\mathbb{E}\left[V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{0},% T,0)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{n,1},1)+V_{{\bm{% \pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{{n,1}},t_{n,1},1)\right]+ blackboard_E [ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) + italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ]
=V𝝅n,𝚽n(𝒙0,T,0)+𝔼[V𝝅n,𝚽n(𝒙^n,1,tn,1,1)V𝝅n,𝚽n(𝒙n,1,tn,1,1)]absentsubscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙0𝑇0𝔼delimited-[]subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11\displaystyle=V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{0},T,0)+\mathbb{E}% \left[V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\widehat{\bm{x}}_{{n,1}},t_{n,1},1)-V% _{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{n,1},1)\right]= italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) + blackboard_E [ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ]
+𝔼[V𝝅n,𝚽(𝒙n,1,tn,1,1)V𝝅n,𝚽n(𝒙n,1,tn,1,1)]𝔼delimited-[]subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11\displaystyle+\mathbb{E}\left[V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{{n,% 1}},t_{n,1},1)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{n,1},1)\right]+ blackboard_E [ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ]

Hence we have:

V𝝅n,𝚽(𝒙0,T,0)V𝝅n,𝚽n(𝒙0,T,0)=subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇0subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙0𝑇0absent\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,0)-V_{{\bm{\pi% }}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{0},T,0)=italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , 0 ) =
=𝔼[V𝝅n,𝚽n(𝒙^n,1,tn,1,1)V𝝅n,𝚽n(𝒙n,1,tn,1,1)]+𝔼[V𝝅n,𝚽(𝒙n,1,tn,1,1)V𝝅n,𝚽n(𝒙n,1,tn,1,1)]absent𝔼delimited-[]subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript^𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11𝔼delimited-[]subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙𝑛1subscript𝑡𝑛11subscript𝑉subscript𝝅𝑛subscript𝚽𝑛subscript𝒙𝑛1subscript𝑡𝑛11\displaystyle=\mathbb{E}\left[V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}(\widehat{\bm{% x}}_{{n,1}},t_{n,1},1)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{% n,1},1)\right]+\mathbb{E}\left[V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{{n% ,1}},t_{n,1},1)-V_{{\bm{\pi}}_{n},{\bm{\Phi}}_{n}}({\bm{x}}_{{n,1}},t_{n,1},1)\right]= blackboard_E [ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ] + blackboard_E [ italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , 1 end_POSTSUBSCRIPT , 1 ) ]

Repeating the step inductively the result follows and using V𝝅n,𝚽(𝒙n,K,tn,K,K)=0subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙𝑛𝐾subscript𝑡𝑛𝐾𝐾0V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{{n,K}},t_{n,K},K)=0italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_K end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_K end_POSTSUBSCRIPT , italic_K ) = 0 we prove the lemma. ∎

A.2.1 Subgaussianity of the noise

In principle, we could assume that the noise 𝒘ksubscript𝒘𝑘{\bm{w}}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is Gaussian and then with the same analysis obtain the regret bound. However, stochastic flows are in many cases not exactly Gaussian but only sub-Gaussian. For such noise we need can not apply Lemma 6 and need to escort to different analysis. First we show that under mild assumptions on the SDE dynamics functions 𝒇superscript𝒇{\bm{f}}^{*}bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and 𝒈superscript𝒈{\bm{g}}^{*}bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT the resulting noise 𝒘ksubscript𝒘𝑘{\bm{w}}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is sub-Gaussian.

To derive this result we will follow the work of Djellout et al., (2004). We present the results in quite informal way, for more rigorous statements we refer the reader to Djellout et al., (2004).

Definition 4 (Wasserstein distance).

Let (,d)subscript𝑑({\mathcal{E}},d_{{\mathcal{E}}})( caligraphic_E , italic_d start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) be a metric space and let μ,ν𝜇𝜈\mu,\nuitalic_μ , italic_ν be two probability measures on {\mathcal{E}}caligraphic_E. We define:

Wp(μ,ν)=infγΓ(μ,ν)𝔼(x,y)γ[d(x,y)p]1psubscript𝑊𝑝𝜇𝜈subscriptinfimum𝛾Γ𝜇𝜈subscript𝔼similar-to𝑥𝑦𝛾superscriptdelimited-[]𝑑superscript𝑥𝑦𝑝1𝑝\displaystyle W_{p}(\mu,\nu)=\inf_{\gamma\in\Gamma(\mu,\nu)}\mathbb{E}_{(x,y)% \sim\gamma}\left[d(x,y)^{p}\right]^{\frac{1}{p}}italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_ν ) = roman_inf start_POSTSUBSCRIPT italic_γ ∈ roman_Γ ( italic_μ , italic_ν ) end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_x , italic_y ) ∼ italic_γ end_POSTSUBSCRIPT [ italic_d ( italic_x , italic_y ) start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT ] start_POSTSUPERSCRIPT divide start_ARG 1 end_ARG start_ARG italic_p end_ARG end_POSTSUPERSCRIPT
Definition 5 (Kullback–Leibler divergence).

Let (,d)subscript𝑑({\mathcal{E}},d_{{\mathcal{E}}})( caligraphic_E , italic_d start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) be a metric space and let μ,ν𝜇𝜈\mu,\nuitalic_μ , italic_ν be two probability measures on {\mathcal{E}}caligraphic_E. We define:

H(ν||μ)={𝔼xν[log(dν(x)dμ(x))],if νμ+,else\displaystyle H(\nu||\mu)=\begin{cases}\mathbb{E}_{x\sim\nu}\left[\log\left(% \frac{d\nu(x)}{d\mu(x)}\right)\right],&\text{if }\nu\ll\mu\\ +\infty,&\text{else}\end{cases}italic_H ( italic_ν | | italic_μ ) = { start_ROW start_CELL blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_ν end_POSTSUBSCRIPT [ roman_log ( divide start_ARG italic_d italic_ν ( italic_x ) end_ARG start_ARG italic_d italic_μ ( italic_x ) end_ARG ) ] , end_CELL start_CELL if italic_ν ≪ italic_μ end_CELL end_ROW start_ROW start_CELL + ∞ , end_CELL start_CELL else end_CELL end_ROW
Definition 6 (Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT-transportation cost information inequality).

Let (,d)subscript𝑑({\mathcal{E}},d_{{\mathcal{E}}})( caligraphic_E , italic_d start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) be a metric space and let μ𝜇\muitalic_μ be a probability measure on {\mathcal{E}}caligraphic_E. We say that μ𝜇\muitalic_μ satisfy the Lpsuperscript𝐿𝑝L^{p}italic_L start_POSTSUPERSCRIPT italic_p end_POSTSUPERSCRIPT-transportation cost information inequality, and for short write μTp(C)𝜇subscript𝑇𝑝𝐶\mu\in T_{p}(C)italic_μ ∈ italic_T start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_C ), if there exists a constant C𝐶Citalic_C such that for any measure ν𝜈\nuitalic_ν on {\mathcal{E}}caligraphic_E we have:

Wp(μ,ν)2CH(ν||μ).\displaystyle W_{p}(\mu,\nu)\leq\sqrt{2CH(\nu||\mu)}.italic_W start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_μ , italic_ν ) ≤ square-root start_ARG 2 italic_C italic_H ( italic_ν | | italic_μ ) end_ARG .

We now state an important theroem of Bobkov and Götze, (1999) that we will use later.

Theorem 11 (From Bobkov and Götze, (1999)).

Let (,d)subscript𝑑({\mathcal{E}},d_{{\mathcal{E}}})( caligraphic_E , italic_d start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) be a metric space and let μ𝜇\muitalic_μ be a probability measure on {\mathcal{E}}caligraphic_E. We have that μT1(C)𝜇subscript𝑇1𝐶\mu\in T_{1}(C)italic_μ ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_C ) if and only if for any μ𝜇\muitalic_μ-integrable and LFsubscript𝐿𝐹L_{F}italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT-Lipschitz function F:(,d):𝐹subscript𝑑F:({\mathcal{E}},d_{{\mathcal{E}}})\to\mathbb{R}italic_F : ( caligraphic_E , italic_d start_POSTSUBSCRIPT caligraphic_E end_POSTSUBSCRIPT ) → blackboard_R and for any λ𝜆\lambda\in\mathbb{R}italic_λ ∈ blackboard_R we have:

𝔼xμ[eλ(F(x)𝔼xμ[F(x)])]eλ22CLF2subscript𝔼similar-to𝑥𝜇delimited-[]superscript𝑒𝜆𝐹𝑥subscript𝔼similar-to𝑥𝜇delimited-[]𝐹𝑥superscript𝑒superscript𝜆22𝐶superscriptsubscript𝐿𝐹2\displaystyle\mathbb{E}_{x\sim\mu}\left[e^{\lambda\left(F(x)-\mathbb{E}_{x\sim% \mu}[F(x)]\right)}\right]\leq e^{\frac{\lambda^{2}}{2}CL_{F}^{2}}blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_λ ( italic_F ( italic_x ) - blackboard_E start_POSTSUBSCRIPT italic_x ∼ italic_μ end_POSTSUBSCRIPT [ italic_F ( italic_x ) ] ) end_POSTSUPERSCRIPT ] ≤ italic_e start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_C italic_L start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT

Next, we provide a condition under which 𝚵(𝒙,𝒖,t)𝚵𝒙𝒖𝑡{\bm{\Xi}}({\bm{x}},{\bm{u}},t)bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) is sub-Gaussian random variable for any t𝒯𝑡𝒯t\in{\mathcal{T}}italic_t ∈ caligraphic_T.

Corollary 12 (Adjusted Corollary 4.1 of Djellout et al., (2004)).

Assume

sup𝒙dx𝒖du𝒈(𝒙,𝒖)FA,𝒇(𝒙,𝒖)𝒇(𝒙^,𝒖^)L𝒇(𝒙,𝒖)(𝒙^,𝒖^),formulae-sequencesubscriptsupremum𝒙superscriptsubscript𝑑𝑥𝒖superscriptsubscript𝑑𝑢subscriptdelimited-∥∥superscript𝒈𝒙𝒖𝐹𝐴delimited-∥∥superscript𝒇𝒙𝒖superscript𝒇^𝒙^𝒖subscript𝐿superscript𝒇delimited-∥∥𝒙𝒖^𝒙^𝒖\displaystyle\sup_{\begin{subarray}{c}{\bm{x}}\in\mathbb{R}^{d_{x}}\\ {\bm{u}}\in\mathbb{R}^{d_{u}}\end{subarray}}\left\lVert{\bm{g}}^{*}({\bm{x}},{% \bm{u}})\right\rVert_{F}\leq A,\quad\left\lVert{\bm{f}}^{*}({\bm{x}},{\bm{u}})% -{\bm{f}}^{*}(\widehat{{\bm{x}}},\widehat{{\bm{u}}})\right\rVert\leq L_{{\bm{f% }}^{*}}\left\lVert({\bm{x}},{\bm{u}})-(\widehat{{\bm{x}}},\widehat{{\bm{u}}})% \right\rVert,roman_sup start_POSTSUBSCRIPT start_ARG start_ROW start_CELL bold_italic_x ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_u ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG end_POSTSUBSCRIPT ∥ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_u ) ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT ≤ italic_A , ∥ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x , bold_italic_u ) - bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_u end_ARG ) ∥ ≤ italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ ( bold_italic_x , bold_italic_u ) - ( over^ start_ARG bold_italic_x end_ARG , over^ start_ARG bold_italic_u end_ARG ) ∥ ,

and denote the law of (𝚵(𝐱,𝐮,t))t𝒯subscript𝚵𝐱𝐮𝑡𝑡𝒯({\bm{\Xi}}({\bm{x}},{\bm{u}},t))_{t\in{\mathcal{T}}}( bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) ) start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT on the space C(𝒯,d𝐱)𝐶𝒯superscriptsubscript𝑑𝐱C({\mathcal{T}},\mathbb{R}^{d_{\bm{x}}})italic_C ( caligraphic_T , blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) (space of continuous functions from 𝒯𝒯{\mathcal{T}}caligraphic_T to d𝐱superscriptsubscript𝑑𝐱\mathbb{R}^{d_{{\bm{x}}}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT) by 𝐱subscript𝐱\mathbb{P}_{\bm{x}}blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT. Then, there exist a constant C=C(A,L𝐟,T)𝐶𝐶𝐴subscript𝐿superscript𝐟𝑇C=C(A,L_{{\bm{f}}^{*}},T)italic_C = italic_C ( italic_A , italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_T ) such that 𝐱T1(C)subscript𝐱subscript𝑇1𝐶\mathbb{P}_{\bm{x}}\in T_{1}(C)blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ∈ italic_T start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_C ) on the space C(𝒯,d𝐱)𝐶𝒯superscriptsubscript𝑑𝐱C({\mathcal{T}},\mathbb{R}^{d_{{\bm{x}}}})italic_C ( caligraphic_T , blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) equipped with the metric:

d(γ1,γ2)=supt[0,T]γ1(t)γ2(t)𝑑subscript𝛾1subscript𝛾2subscriptsupremum𝑡0𝑇delimited-∥∥subscript𝛾1𝑡subscript𝛾2𝑡\displaystyle d(\gamma_{1},\gamma_{2})=\sup_{t\in[0,T]}\left\lVert\gamma_{1}(t% )-\gamma_{2}(t)\right\rVertitalic_d ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = roman_sup start_POSTSUBSCRIPT italic_t ∈ [ 0 , italic_T ] end_POSTSUBSCRIPT ∥ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) - italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ∥

Lets 𝒆𝒆{\bm{e}}bold_italic_e be a(ny) unit vector in dxsuperscriptsubscript𝑑𝑥\mathbb{R}^{d_{x}}blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and define:

F𝒆,t:C(𝒯,d𝒙):subscript𝐹𝒆𝑡𝐶𝒯superscriptsubscript𝑑𝒙\displaystyle F_{{\bm{e}},t}:C({\mathcal{T}},\mathbb{R}^{d_{{\bm{x}}}})\to% \mathbb{R}italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT : italic_C ( caligraphic_T , blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) → blackboard_R
F𝒆,t:γγ(t)𝒆:subscript𝐹𝒆𝑡maps-to𝛾𝛾superscript𝑡top𝒆\displaystyle F_{{\bm{e}},t}:\gamma\mapsto\gamma(t)^{\top}{\bm{e}}italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT : italic_γ ↦ italic_γ ( italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e

We have:

|F𝒆,t(γ1)F𝒆,t(γ2)|subscript𝐹𝒆𝑡subscript𝛾1subscript𝐹𝒆𝑡subscript𝛾2\displaystyle\left\lvert F_{{\bm{e}},t}(\gamma_{1})-F_{{\bm{e}},t}(\gamma_{2})\right\rvert| italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT ( italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) | =|(γ1(t)γ2(t))𝒆|absentsuperscriptsubscript𝛾1𝑡subscript𝛾2𝑡top𝒆\displaystyle=\left\lvert(\gamma_{1}(t)-\gamma_{2}(t))^{\top}{\bm{e}}\right\rvert= | ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) - italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e |
γ1(t)γ2(t)e=γ1(t)γ2(t)absentdelimited-∥∥subscript𝛾1𝑡subscript𝛾2𝑡delimited-∥∥𝑒delimited-∥∥subscript𝛾1𝑡subscript𝛾2𝑡\displaystyle\leq\left\lVert\gamma_{1}(t)-\gamma_{2}(t)\right\rVert\left\lVert e% \right\rVert=\left\lVert\gamma_{1}(t)-\gamma_{2}(t)\right\rVert≤ ∥ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) - italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ∥ ∥ italic_e ∥ = ∥ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) - italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ∥
supt𝒯γ1(t)γ2(t)=d(γ1,γ2)absentsubscriptsupremum𝑡𝒯delimited-∥∥subscript𝛾1𝑡subscript𝛾2𝑡𝑑subscript𝛾1subscript𝛾2\displaystyle\leq\sup_{t\in{\mathcal{T}}}\left\lVert\gamma_{1}(t)-\gamma_{2}(t% )\right\rVert=d(\gamma_{1},\gamma_{2})≤ roman_sup start_POSTSUBSCRIPT italic_t ∈ caligraphic_T end_POSTSUBSCRIPT ∥ italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_t ) - italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_t ) ∥ = italic_d ( italic_γ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_γ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )

Therefore for any 𝒆,t𝒆𝑡{\bm{e}},tbold_italic_e , italic_t the function F𝒆,tsubscript𝐹𝒆𝑡F_{{\bm{e}},t}italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT is 1111–Lipschitz. Since we have

𝔼[|F𝒆,t(γ)|]=C(𝒯,d𝒙)|γ(t)|𝑑𝒙(γ)=𝔼[|𝚵(𝒙,𝒖,t)𝒆|]<𝔼delimited-[]subscript𝐹𝒆𝑡𝛾subscript𝐶𝒯superscriptsubscript𝑑𝒙𝛾𝑡differential-dsubscript𝒙𝛾𝔼delimited-[]𝚵superscript𝒙𝒖𝑡top𝒆\displaystyle\mathbb{E}[\left\lvert F_{{\bm{e}},t}(\gamma)\right\rvert]=\int_{% C({\mathcal{T}},\mathbb{R}^{d_{{\bm{x}}}})}|\gamma(t)|\ d\mathbb{P}_{\bm{x}}(% \gamma)=\mathbb{E}[\left\lvert{\bm{\Xi}}({\bm{x}},{\bm{u}},t)^{\top}{\bm{e}}% \right\rvert]<\inftyblackboard_E [ | italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT ( italic_γ ) | ] = ∫ start_POSTSUBSCRIPT italic_C ( caligraphic_T , blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT | italic_γ ( italic_t ) | italic_d blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT ( italic_γ ) = blackboard_E [ | bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e | ] < ∞

the function F𝒆,tsubscript𝐹𝒆𝑡F_{{\bm{e}},t}italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT is also 𝒙subscript𝒙\mathbb{P}_{\bm{x}}blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT-integrable. Combining the latter observation with the Theorem 11 we obtain that for any 𝒆d𝒙𝒆superscriptsubscript𝑑𝒙{\bm{e}}\in\mathbb{R}^{d_{{\bm{x}}}}bold_italic_e ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUPERSCRIPT and any t𝒯𝑡𝒯t\in{\mathcal{T}}italic_t ∈ caligraphic_T we have:

𝔼𝚵(𝒙,𝒖,t)[eλ(𝚵(𝒙,𝒖,t)𝒆𝔼[𝚵(𝒙,𝒖,t)𝒆])]=𝔼γ𝒙[eλ(F𝒆,t(γ)𝔼γ𝒙[F𝒆,t(γ)])]eλ22Csubscript𝔼𝚵𝒙𝒖𝑡delimited-[]superscript𝑒𝜆𝚵superscript𝒙𝒖𝑡top𝒆𝔼delimited-[]𝚵superscript𝒙𝒖𝑡top𝒆subscript𝔼similar-to𝛾subscript𝒙delimited-[]superscript𝑒𝜆subscript𝐹𝒆𝑡𝛾subscript𝔼similar-to𝛾subscript𝒙delimited-[]subscript𝐹𝒆𝑡𝛾superscript𝑒superscript𝜆22𝐶\displaystyle\mathbb{E}_{{\bm{\Xi}}({\bm{x}},{\bm{u}},t)}\left[e^{\lambda({\bm% {\Xi}}({\bm{x}},{\bm{u}},t)^{\top}{\bm{e}}-\mathbb{E}[{\bm{\Xi}}({\bm{x}},{\bm% {u}},t)^{\top}{\bm{e}}])}\right]=\mathbb{E}_{\gamma\sim\mathbb{P}_{\bm{x}}}% \left[e^{\lambda\left(F_{{\bm{e}},t}(\gamma)-\mathbb{E}_{\gamma\sim\mathbb{P}_% {\bm{x}}}[F_{{\bm{e}},t}(\gamma)]\right)}\right]\leq e^{\frac{\lambda^{2}}{2}C}blackboard_E start_POSTSUBSCRIPT bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_λ ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e - blackboard_E [ bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT bold_italic_e ] ) end_POSTSUPERSCRIPT ] = blackboard_E start_POSTSUBSCRIPT italic_γ ∼ blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_e start_POSTSUPERSCRIPT italic_λ ( italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT ( italic_γ ) - blackboard_E start_POSTSUBSCRIPT italic_γ ∼ blackboard_P start_POSTSUBSCRIPT bold_italic_x end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_F start_POSTSUBSCRIPT bold_italic_e , italic_t end_POSTSUBSCRIPT ( italic_γ ) ] ) end_POSTSUPERSCRIPT ] ≤ italic_e start_POSTSUPERSCRIPT divide start_ARG italic_λ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_C end_POSTSUPERSCRIPT

Hence under the assumption of Theorem 2 for Bounded number of switches setting we have that for any t𝒯𝑡𝒯t\in{\mathcal{T}}italic_t ∈ caligraphic_T the random variable 𝚵(𝒙,𝒖,t)𝔼[𝚵(𝒙,𝒖,t)]subG(C)similar-to𝚵𝒙𝒖𝑡𝔼delimited-[]𝚵𝒙𝒖𝑡subG𝐶{\bm{\Xi}}({\bm{x}},{\bm{u}},t)-\mathbb{E}\left[{\bm{\Xi}}({\bm{x}},{\bm{u}},t% )\right]\sim\text{subG}\left(C\right)bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) - blackboard_E [ bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) ] ∼ subG ( italic_C ). The variance proxy C𝐶Citalic_C depends on A,L𝒇,T𝐴subscript𝐿superscript𝒇𝑇A,L_{{\bm{f}}^{*}},Titalic_A , italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_T.

A.2.2 Lipschitness of the expected flow 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT

To apply analysis for the case when noise 𝒘ksubscript𝒘𝑘{\bm{w}}_{k}bold_italic_w start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is any sub-Gaussian we also need to show that the dynamics function 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is Lipschitz. We first start with some general results.

Lemma 13.

Let 𝐟:nm:𝐟superscript𝑛superscript𝑚{\bm{f}}:\mathbb{R}^{n}\to\mathbb{R}^{m}bold_italic_f : blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT → blackboard_R start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, A[n]𝐴delimited-[]𝑛A\subset[n]italic_A ⊂ [ italic_n ] and denote B=AC𝐵superscript𝐴𝐶B=A^{C}italic_B = italic_A start_POSTSUPERSCRIPT italic_C end_POSTSUPERSCRIPT. If we have:

  • 𝒇(𝒙A,𝒙B)𝒇(𝒙^A,𝒙B)2LA𝒙A𝒙^A2subscriptdelimited-∥∥𝒇subscript𝒙𝐴subscript𝒙𝐵𝒇subscript^𝒙𝐴subscript𝒙𝐵2subscript𝐿𝐴subscriptdelimited-∥∥subscript𝒙𝐴subscript^𝒙𝐴2\left\lVert{\bm{f}}({\bm{x}}_{A},{\bm{x}}_{B})-{\bm{f}}(\widehat{{\bm{x}}}_{A}% ,{\bm{x}}_{B})\right\rVert_{2}\leq L_{A}\left\lVert{\bm{x}}_{A}-\widehat{{\bm{% x}}}_{A}\right\rVert_{2}∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - bold_italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

  • 𝒇(𝒙A,𝒙B)𝒇(𝒙A,𝒙^B)2LB𝒙B𝒙^B2subscriptdelimited-∥∥𝒇subscript𝒙𝐴subscript𝒙𝐵𝒇subscript𝒙𝐴subscript^𝒙𝐵2subscript𝐿𝐵subscriptdelimited-∥∥subscript𝒙𝐵subscript^𝒙𝐵2\left\lVert{\bm{f}}({\bm{x}}_{A},{\bm{x}}_{B})-{\bm{f}}({\bm{x}}_{A},\widehat{% {\bm{x}}}_{B})\right\rVert_{2}\leq L_{B}\left\lVert{\bm{x}}_{B}-\widehat{{\bm{% x}}}_{B}\right\rVert_{2}∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ≤ italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT,

then 𝐟𝐟{\bm{f}}bold_italic_f is 2(LA+LB)2subscript𝐿𝐴subscript𝐿𝐵2(L_{A}+L_{B})2 ( italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) Lipschitz.

Proof.

We have:

𝒇(𝒙)𝒇(𝒙^)2subscriptdelimited-∥∥𝒇𝒙𝒇^𝒙2\displaystyle\left\lVert{\bm{f}}({\bm{x}})-{\bm{f}}(\widehat{{\bm{x}}})\right% \rVert_{2}∥ bold_italic_f ( bold_italic_x ) - bold_italic_f ( over^ start_ARG bold_italic_x end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT =𝒇(𝒙A,𝒙B)𝒇(𝒙^A,𝒙^B)2absentsubscriptdelimited-∥∥𝒇subscript𝒙𝐴subscript𝒙𝐵𝒇subscript^𝒙𝐴subscript^𝒙𝐵2\displaystyle=\left\lVert{\bm{f}}({\bm{x}}_{A},{\bm{x}}_{B})-{\bm{f}}(\widehat% {{\bm{x}}}_{A},\widehat{{\bm{x}}}_{B})\right\rVert_{2}= ∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - bold_italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=𝒇(𝒙A,𝒙B)𝒇(𝒙^A,𝒙B)+𝒇(𝒙^A,𝒙B)𝒇(𝒙^A,𝒙^B)2absentsubscriptdelimited-∥∥𝒇subscript𝒙𝐴subscript𝒙𝐵𝒇subscript^𝒙𝐴subscript𝒙𝐵𝒇subscript^𝒙𝐴subscript𝒙𝐵𝒇subscript^𝒙𝐴subscript^𝒙𝐵2\displaystyle=\left\lVert{\bm{f}}({\bm{x}}_{A},{\bm{x}}_{B})-{\bm{f}}(\widehat% {{\bm{x}}}_{A},{\bm{x}}_{B})+{\bm{f}}(\widehat{{\bm{x}}}_{A},{\bm{x}}_{B})-{% \bm{f}}(\widehat{{\bm{x}}}_{A},\widehat{{\bm{x}}}_{B})\right\rVert_{2}= ∥ bold_italic_f ( bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - bold_italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) + bold_italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) - bold_italic_f ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT , over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
LA𝒙A𝒙^A2+LB𝒙B𝒙^B2absentsubscript𝐿𝐴subscriptdelimited-∥∥subscript𝒙𝐴subscript^𝒙𝐴2subscript𝐿𝐵subscriptdelimited-∥∥subscript𝒙𝐵subscript^𝒙𝐵2\displaystyle\leq L_{A}\left\lVert{\bm{x}}_{A}-\widehat{{\bm{x}}}_{A}\right% \rVert_{2}+L_{B}\left\lVert{\bm{x}}_{B}-\widehat{{\bm{x}}}_{B}\right\rVert_{2}≤ italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
(LA+LB)(𝒙A𝒙^A2+𝒙B𝒙^B2)absentsubscript𝐿𝐴subscript𝐿𝐵subscriptdelimited-∥∥subscript𝒙𝐴subscript^𝒙𝐴2subscriptdelimited-∥∥subscript𝒙𝐵subscript^𝒙𝐵2\displaystyle\leq(L_{A}+L_{B})\left(\left\lVert{\bm{x}}_{A}-\widehat{{\bm{x}}}% _{A}\right\rVert_{2}+\left\lVert{\bm{x}}_{B}-\widehat{{\bm{x}}}_{B}\right% \rVert_{2}\right)≤ ( italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ( ∥ bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + ∥ bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )
2(LA+LB)(𝒙A𝒙^A𝒙B𝒙^B)2absent2subscript𝐿𝐴subscript𝐿𝐵subscriptdelimited-∥∥matrixsubscript𝒙𝐴subscript^𝒙𝐴subscript𝒙𝐵subscript^𝒙𝐵2\displaystyle\leq 2(L_{A}+L_{B})\left\lVert\begin{pmatrix}{\bm{x}}_{A}-% \widehat{{\bm{x}}}_{A}\\ {\bm{x}}_{B}-\widehat{{\bm{x}}}_{B}\\ \end{pmatrix}\right\rVert_{2}≤ 2 ( italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ ( start_ARG start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL bold_italic_x start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT - over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_CELL end_ROW end_ARG ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT
=2(LA+LB)𝒙𝒙^2absent2subscript𝐿𝐴subscript𝐿𝐵subscriptdelimited-∥∥𝒙^𝒙2\displaystyle=2(L_{A}+L_{B})\left\lVert{\bm{x}}-\widehat{{\bm{x}}}\right\rVert% _{2}= 2 ( italic_L start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT ) ∥ bold_italic_x - over^ start_ARG bold_italic_x end_ARG ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT

Lemma 14 (Lipschitzness of 𝚽𝒇subscript𝚽superscript𝒇{\bm{\Phi}}_{{\bm{f}}^{*}}bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT).

There exists a positive constant L𝚽𝐟subscript𝐿subscript𝚽𝐟L_{{\bm{\Phi}}_{{\bm{f}}}}italic_L start_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT such that the flow 𝚽𝐟subscript𝚽superscript𝐟{\bm{\Phi}}_{{\bm{f}}^{*}}bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is L𝚽𝐟subscript𝐿subscript𝚽𝐟L_{{\bm{\Phi}}_{{\bm{f}}}}italic_L start_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT–Lipschitz.

Proof.

We will first prove coordinate-wise Lipschitzness. We observe:

  1. 1.

    Lipschitness in time:

    𝚽𝒇(𝒙,𝒖,t)𝚽𝒇(𝒙,𝒖,t^)delimited-∥∥subscript𝚽superscript𝒇𝒙𝒖𝑡subscript𝚽superscript𝒇𝒙𝒖^𝑡\displaystyle\left\lVert{\bm{\Phi}}_{{\bm{f}}^{*}}({\bm{x}},{\bm{u}},t)-{\bm{% \Phi}}_{{\bm{f}}^{*}}({\bm{x}},{\bm{u}},\widehat{t})\right\rVert∥ bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) - bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , over^ start_ARG italic_t end_ARG ) ∥ =0t𝔼[𝒇(𝒙s,𝒖)]𝑑s0t^𝔼[𝒇(𝒙s,𝒖)]𝑑sabsentdelimited-∥∥superscriptsubscript0𝑡𝔼delimited-[]superscript𝒇subscript𝒙𝑠𝒖differential-d𝑠superscriptsubscript0^𝑡𝔼delimited-[]superscript𝒇subscript𝒙𝑠𝒖differential-d𝑠\displaystyle=\left\lVert\int_{0}^{t}\mathbb{E}[{\bm{f}}^{*}({\bm{x}}_{s},{\bm% {u}})]ds-\int_{0}^{\widehat{t}}\mathbb{E}[{\bm{f}}^{*}({\bm{x}}_{s},{\bm{u}})]% ds\right\rVert= ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) ] italic_d italic_s - ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_t end_ARG end_POSTSUPERSCRIPT blackboard_E [ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) ] italic_d italic_s ∥
    t^t𝔼[𝒇(𝒙s,𝒖)]𝑑sF|tt^|absentsuperscriptsubscript^𝑡𝑡𝔼delimited-[]delimited-∥∥superscript𝒇subscript𝒙𝑠𝒖differential-d𝑠𝐹𝑡^𝑡\displaystyle\leq\int_{\widehat{t}}^{t}\mathbb{E}\left[\left\lVert{\bm{f}}^{*}% ({\bm{x}}_{s},{\bm{u}})\right\rVert\right]ds\leq F\left\lvert t-\widehat{t}\right\rvert≤ ∫ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT blackboard_E [ ∥ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) ∥ ] italic_d italic_s ≤ italic_F | italic_t - over^ start_ARG italic_t end_ARG |
  2. 2.

    Lipschitness in state 𝒙𝒙{\bm{x}}bold_italic_x: To prove this, consider the δ𝒙t=𝚵(𝒙,𝒖,t)𝚵(𝒙^,𝒖,t)𝛿subscript𝒙𝑡𝚵𝒙𝒖𝑡𝚵^𝒙𝒖𝑡\delta{\bm{x}}_{t}={\bm{\Xi}}({\bm{x}},{\bm{u}},t)-{\bm{\Xi}}(\widehat{\bm{x}}% ,{\bm{u}},t)italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) - bold_Ξ ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_t ), then we have

    dδ𝒙t𝑑𝛿subscript𝒙𝑡\displaystyle d\delta{\bm{x}}_{t}italic_d italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT =(𝒇(𝒙t,𝒖)𝒇(𝒙^t,𝒖))dt+(𝒈(𝒙t,𝒖)𝒇(𝒙^t,𝒖))d𝑩tabsentsuperscript𝒇subscript𝒙𝑡𝒖superscript𝒇subscript^𝒙𝑡𝒖𝑑𝑡superscript𝒈subscript𝒙𝑡𝒖superscript𝒇subscript^𝒙𝑡𝒖𝑑subscript𝑩𝑡\displaystyle=({\bm{f}}^{*}({\bm{x}}_{t},{\bm{u}})-{\bm{f}}^{*}(\widehat{\bm{x% }}_{t},{\bm{u}}))dt+({\bm{g}}^{*}({\bm{x}}_{t},{\bm{u}})-{\bm{f}}^{*}(\widehat% {\bm{x}}_{t},{\bm{u}}))d{\bm{B}}_{t}= ( bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u ) - bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u ) ) italic_d italic_t + ( bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u ) - bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( over^ start_ARG bold_italic_x end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_u ) ) italic_d bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
    =δ𝒇tdt+δ𝒈td𝑩t.absent𝛿subscriptsuperscript𝒇𝑡𝑑𝑡𝛿subscriptsuperscript𝒈𝑡𝑑subscript𝑩𝑡\displaystyle=\delta{\bm{f}}^{*}_{t}dt+\delta{\bm{g}}^{*}_{t}d{\bm{B}}_{t}.= italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t + italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT .

    Note that δ𝒇tL𝒇δ𝒙tdelimited-∥∥𝛿subscriptsuperscript𝒇𝑡subscript𝐿superscript𝒇delimited-∥∥𝛿subscript𝒙𝑡\left\lVert\delta{\bm{f}}^{*}_{t}\right\rVert\leq L_{{\bm{f}}^{*}}\left\lVert% \delta{\bm{x}}_{t}\right\rVert∥ italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ and δ𝒈tL𝒈δ𝒙tdelimited-∥∥𝛿subscriptsuperscript𝒈𝑡subscript𝐿superscript𝒈delimited-∥∥𝛿subscript𝒙𝑡\left\lVert\delta{\bm{g}}^{*}_{t}\right\rVert\leq L_{{\bm{g}}^{*}}\left\lVert% \delta{\bm{x}}_{t}\right\rVert∥ italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ ≤ italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ since both functions are Lipschitz. Define 𝒚t=δ𝒙tδ𝒙tsubscript𝒚𝑡𝛿superscriptsubscript𝒙𝑡top𝛿subscript𝒙𝑡{\bm{y}}_{t}=\delta{\bm{x}}_{t}^{\top}\delta{\bm{x}}_{t}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and use Ito’s Lemma to get

    d𝒚t=2δ𝒙t(δ𝒇tdt+δ𝒈td𝑩t)+tr(δ𝒈t(δ𝒈t))dt𝑑subscript𝒚𝑡2𝛿superscriptsubscript𝒙𝑡top𝛿subscriptsuperscript𝒇𝑡𝑑𝑡𝛿subscriptsuperscript𝒈𝑡𝑑subscript𝑩𝑡tr𝛿subscriptsuperscript𝒈𝑡superscript𝛿subscriptsuperscript𝒈𝑡top𝑑𝑡d{\bm{y}}_{t}=2\delta{\bm{x}}_{t}^{\top}(\delta{\bm{f}}^{*}_{t}dt+\delta{\bm{g% }}^{*}_{t}d{\bm{B}}_{t})+\text{tr}(\delta{\bm{g}}^{*}_{t}(\delta{\bm{g}}^{*}_{% t})^{\top})dtitalic_d bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t + italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + tr ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_d italic_t

    Moreover,

    𝔼[𝒚t]𝔼delimited-[]subscript𝒚𝑡\displaystyle\mathbb{E}[{\bm{y}}_{t}]blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =0t2𝔼[δ𝒙sδ𝒇s]+𝔼[tr(δ𝒈s(δ𝒈s))]dsabsentsubscriptsuperscript𝑡02𝔼delimited-[]𝛿superscriptsubscript𝒙𝑠top𝛿subscriptsuperscript𝒇𝑠𝔼delimited-[]tr𝛿subscriptsuperscript𝒈𝑠superscript𝛿subscriptsuperscript𝒈𝑠top𝑑𝑠\displaystyle=\int^{t}_{0}2\mathbb{E}[\delta{\bm{x}}_{s}^{\top}\delta{\bm{f}}^% {*}_{s}]+\mathbb{E}[\text{tr}(\delta{\bm{g}}^{*}_{s}(\delta{\bm{g}}^{*}_{s})^{% \top})]ds= ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 blackboard_E [ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] + blackboard_E [ tr ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] italic_d italic_s
    0t2𝔼[δ𝒙sδ𝒇s]+𝔼[δ𝒈s2]dsabsentsubscriptsuperscript𝑡02𝔼delimited-[]delimited-∥∥𝛿subscript𝒙𝑠delimited-∥∥𝛿subscriptsuperscript𝒇𝑠𝔼delimited-[]superscriptdelimited-∥∥𝛿subscriptsuperscript𝒈𝑠2𝑑𝑠\displaystyle\leq\int^{t}_{0}2\mathbb{E}\left[\left\lVert\delta{\bm{x}}_{s}% \right\rVert\left\lVert\delta{\bm{f}}^{*}_{s}\right\rVert\right]+\mathbb{E}[% \left\lVert\delta{\bm{g}}^{*}_{s}\right\rVert^{2}]ds≤ ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ∥ italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ] + blackboard_E [ ∥ italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_s
    0t(2L𝒇+L𝒈2)𝔼[δ𝒙s2]𝑑sabsentsubscriptsuperscript𝑡02subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈𝔼delimited-[]superscriptdelimited-∥∥𝛿subscript𝒙𝑠2differential-d𝑠\displaystyle\leq\int^{t}_{0}(2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}})\mathbb{E% }[\left\lVert\delta{\bm{x}}_{s}\right\rVert^{2}]ds≤ ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_s

    Note that 𝒚t=δ𝒙t2subscript𝒚𝑡superscriptdelimited-∥∥𝛿subscript𝒙𝑡2{\bm{y}}_{t}=\left\lVert\delta{\bm{x}}_{t}\right\rVert^{2}bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT, so we can apply Grönwall’s inequality to get

    𝔼[δ𝒙t2]δ𝒙02e(2L𝒇+L𝒈2)t.𝔼delimited-[]superscriptdelimited-∥∥𝛿subscript𝒙𝑡2superscriptdelimited-∥∥𝛿subscript𝒙02superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈𝑡\mathbb{E}\left[\left\lVert\delta{\bm{x}}_{t}\right\rVert^{2}\right]\leq\left% \lVert\delta{\bm{x}}_{0}\right\rVert^{2}e^{(2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^% {*}})t}.blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] ≤ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT ( 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ) italic_t end_POSTSUPERSCRIPT .

    Moreover,

    𝔼[δ𝒙t]𝔼[δ𝒙t2]δ𝒙0e2L𝒇+L𝒈22tδ𝒙0e2L𝒇+L𝒈22T.delimited-∥∥𝔼delimited-[]𝛿subscript𝒙𝑡𝔼delimited-[]superscriptdelimited-∥∥𝛿subscript𝒙𝑡2delimited-∥∥𝛿subscript𝒙0superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑡delimited-∥∥𝛿subscript𝒙0superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑇\left\lVert\mathbb{E}[\delta{\bm{x}}_{t}]\right\rVert\leq\sqrt{\mathbb{E}\left% [\left\lVert\delta{\bm{x}}_{t}\right\rVert^{2}\right]}\leq\left\lVert\delta{% \bm{x}}_{0}\right\rVert e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}{2}t}% \leq\left\lVert\delta{\bm{x}}_{0}\right\rVert e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}% _{{\bm{g}}^{*}}}{2}T}.∥ blackboard_E [ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] ∥ ≤ square-root start_ARG blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG ≤ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_t end_POSTSUPERSCRIPT ≤ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∥ italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT .

    Hence we have:

    𝚽𝒇(𝒙,𝒖,t)𝚽𝒇(𝒙^,𝒖,t)𝒙𝒙^e2L𝒇+L𝒈22T.delimited-∥∥subscript𝚽superscript𝒇𝒙𝒖𝑡subscript𝚽superscript𝒇^𝒙𝒖𝑡delimited-∥∥𝒙^𝒙superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑇\displaystyle\left\lVert{\bm{\Phi}}_{{\bm{f}}^{*}}({\bm{x}},{\bm{u}},t)-{\bm{% \Phi}}_{{\bm{f}}^{*}}(\widehat{{\bm{x}}},{\bm{u}},t)\right\rVert\leq\left% \lVert{\bm{x}}-\widehat{\bm{x}}\right\rVert e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{% {\bm{g}}^{*}}}{2}T}.∥ bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) - bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_t ) ∥ ≤ ∥ bold_italic_x - over^ start_ARG bold_italic_x end_ARG ∥ italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT .
  3. 3.

    Lipschitness in action 𝒖𝒖{\bm{u}}bold_italic_u: We denote δ𝒙t=𝚵(𝒙,𝒖,t)𝚵(𝒙,𝒖^,t)𝛿subscript𝒙𝑡𝚵𝒙𝒖𝑡𝚵𝒙^𝒖𝑡\delta{\bm{x}}_{t}={\bm{\Xi}}({\bm{x}},{\bm{u}},t)-{\bm{\Xi}}({\bm{x}},% \widehat{\bm{u}},t)italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = bold_Ξ ( bold_italic_x , bold_italic_u , italic_t ) - bold_Ξ ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_t ) and δ𝒖=𝒖𝒖^𝛿𝒖𝒖^𝒖\delta{\bm{u}}={\bm{u}}-\widehat{\bm{u}}italic_δ bold_italic_u = bold_italic_u - over^ start_ARG bold_italic_u end_ARG Following the same steps as in the proof of Lipschitzness in state we arrive at:

    d𝒚t=2δ𝒙t(δ𝒇tdt+δ𝒈td𝑩t)+tr(δ𝒈t(δ𝒈t))dt𝑑subscript𝒚𝑡2𝛿superscriptsubscript𝒙𝑡top𝛿subscriptsuperscript𝒇𝑡𝑑𝑡𝛿subscriptsuperscript𝒈𝑡𝑑subscript𝑩𝑡tr𝛿subscriptsuperscript𝒈𝑡superscript𝛿subscriptsuperscript𝒈𝑡top𝑑𝑡\displaystyle d{\bm{y}}_{t}=2\delta{\bm{x}}_{t}^{\top}(\delta{\bm{f}}^{*}_{t}% dt+\delta{\bm{g}}^{*}_{t}d{\bm{B}}_{t})+\text{tr}(\delta{\bm{g}}^{*}_{t}(% \delta{\bm{g}}^{*}_{t})^{\top})dtitalic_d bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = 2 italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ( italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d italic_t + italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_d bold_italic_B start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + tr ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) italic_d italic_t

    Integration yields:

    𝔼[𝒚t]𝔼delimited-[]subscript𝒚𝑡\displaystyle\mathbb{E}[{\bm{y}}_{t}]blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] =0t2𝔼[δ𝒙sδ𝒇s]+𝔼[tr(δ𝒈s(δ𝒈s))]dsabsentsubscriptsuperscript𝑡02𝔼delimited-[]𝛿superscriptsubscript𝒙𝑠top𝛿subscriptsuperscript𝒇𝑠𝔼delimited-[]tr𝛿subscriptsuperscript𝒈𝑠superscript𝛿subscriptsuperscript𝒈𝑠top𝑑𝑠\displaystyle=\int^{t}_{0}2\mathbb{E}[\delta{\bm{x}}_{s}^{\top}\delta{\bm{f}}^% {*}_{s}]+\mathbb{E}[\text{tr}(\delta{\bm{g}}^{*}_{s}(\delta{\bm{g}}^{*}_{s})^{% \top})]ds= ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 blackboard_E [ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] + blackboard_E [ tr ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT ⊤ end_POSTSUPERSCRIPT ) ] italic_d italic_s
    0t2𝔼[δ𝒙sδ𝒇s]+𝔼[δ𝒈s2]dsabsentsubscriptsuperscript𝑡02𝔼delimited-[]delimited-∥∥𝛿subscript𝒙𝑠delimited-∥∥𝛿subscriptsuperscript𝒇𝑠𝔼delimited-[]superscriptdelimited-∥∥𝛿subscriptsuperscript𝒈𝑠2𝑑𝑠\displaystyle\leq\int^{t}_{0}2\mathbb{E}\left[\left\lVert\delta{\bm{x}}_{s}% \right\rVert\left\lVert\delta{\bm{f}}^{*}_{s}\right\rVert\right]+\mathbb{E}[% \left\lVert\delta{\bm{g}}^{*}_{s}\right\rVert^{2}]ds≤ ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ∥ italic_δ bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ] + blackboard_E [ ∥ italic_δ bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] italic_d italic_s
    0t2𝔼[L𝒇δ𝒙s(δ𝒙s+δ𝒖)]+𝔼[2L𝒈2(δ𝒙s2+δ𝒖2)]dsabsentsubscriptsuperscript𝑡02𝔼delimited-[]subscript𝐿superscript𝒇delimited-∥∥𝛿subscript𝒙𝑠delimited-∥∥𝛿subscript𝒙𝑠delimited-∥∥𝛿𝒖𝔼delimited-[]2superscriptsubscript𝐿superscript𝒈2superscriptdelimited-∥∥𝛿subscript𝒙𝑠2superscriptdelimited-∥∥𝛿𝒖2𝑑𝑠\displaystyle\leq\int^{t}_{0}2\mathbb{E}\left[L_{{\bm{f}}^{*}}\left\lVert% \delta{\bm{x}}_{s}\right\rVert(\left\lVert\delta{\bm{x}}_{s}\right\rVert+\left% \lVert\delta{\bm{u}}\right\rVert)\right]+\mathbb{E}\left[2L_{{\bm{g}}^{*}}^{2}% \left(\left\lVert\delta{\bm{x}}_{s}\right\rVert^{2}+\left\lVert\delta{\bm{u}}% \right\rVert^{2}\right)\right]ds≤ ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT 2 blackboard_E [ italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ ( ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ + ∥ italic_δ bold_italic_u ∥ ) ] + blackboard_E [ 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + ∥ italic_δ bold_italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ] italic_d italic_s
    0t(3L𝒇+2L𝒈2)𝔼[𝒚s]+(L𝒇+2L𝒈2)δ𝒖ds,absentsubscriptsuperscript𝑡03subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2𝔼delimited-[]subscript𝒚𝑠subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2delimited-∥∥𝛿𝒖𝑑𝑠\displaystyle\leq\int^{t}_{0}(3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2})\mathbb{% E}\left[{\bm{y}}_{s}\right]+(L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2})\left% \lVert\delta{\bm{u}}\right\rVert ds,≤ ∫ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ( 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) blackboard_E [ bold_italic_y start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ] + ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ∥ italic_δ bold_italic_u ∥ italic_d italic_s ,

    where we used (a+b)22a2+2b2superscript𝑎𝑏22superscript𝑎22superscript𝑏2(a+b)^{2}\leq 2a^{2}+2b^{2}( italic_a + italic_b ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ≤ 2 italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + 2 italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and ab12(a2+b2)𝑎𝑏12superscript𝑎2superscript𝑏2ab\leq\frac{1}{2}(a^{2}+b^{2})italic_a italic_b ≤ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( italic_a start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_b start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Applying Grönwall’s inequality results in:

    𝔼[δ𝒙t2]𝔼delimited-[]superscriptdelimited-∥∥𝛿subscript𝒙𝑡2\displaystyle\mathbb{E}\left[\left\lVert\delta{\bm{x}}_{t}\right\rVert^{2}\right]blackboard_E [ ∥ italic_δ bold_italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] δ𝒖2(L𝒇+2L𝒈2)e(3L𝒇+2L𝒈2)tabsentsuperscriptdelimited-∥∥𝛿𝒖2subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2𝑡\displaystyle\leq\left\lVert\delta{\bm{u}}\right\rVert^{2}(L_{{\bm{f}}^{*}}+2L% _{{\bm{g}}^{*}}^{2})e^{(3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2})t}≤ ∥ italic_δ bold_italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT ( 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_t end_POSTSUPERSCRIPT
    δ𝒖2(L𝒇+2L𝒈2)e(3L𝒇+2L𝒈2)Tabsentsuperscriptdelimited-∥∥𝛿𝒖2subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2𝑇\displaystyle\leq\left\lVert\delta{\bm{u}}\right\rVert^{2}(L_{{\bm{f}}^{*}}+2L% _{{\bm{g}}^{*}}^{2})e^{(3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2})T}≤ ∥ italic_δ bold_italic_u ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_e start_POSTSUPERSCRIPT ( 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_T end_POSTSUPERSCRIPT

    Applying Lemma 13 on 2. and 3. we have that 𝚽𝒇(,,t)subscript𝚽superscript𝒇𝑡{\bm{\Phi}}_{{\bm{f}}^{*}}(\cdot,\cdot,t)bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ , ⋅ , italic_t ) is 2(e2L𝒇+L𝒈22T+L𝒇+2L𝒈2e3L𝒇+2L𝒈22T)2superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑇subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈22𝑇2\left(e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}{2}T}+\sqrt{L_{{\bm{f}}% ^{*}}+2L_{{\bm{g}}^{*}}^{2}}e^{\frac{3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2}}{% 2}T}\right)2 ( italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT + square-root start_ARG italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT divide start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT )–Lipschitz. Applying Lemma 13 on 1. and 𝚽𝒇(,,t)subscript𝚽superscript𝒇𝑡{\bm{\Phi}}_{{\bm{f}}^{*}}(\cdot,\cdot,t)bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( ⋅ , ⋅ , italic_t ) and bounding 24242\leq 42 ≤ 4 we finally obtain that 𝚽𝒇subscript𝚽superscript𝒇{\bm{\Phi}}_{{\bm{f}}^{*}}bold_Φ start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is 4(e2L𝒇+L𝒈22T+L𝒇+2L𝒈2e3L𝒇+2L𝒈22T+F)4superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑇subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈22𝑇𝐹4\left(e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}{2}T}+\sqrt{L_{{\bm{f}}% ^{*}}+2L_{{\bm{g}}^{*}}^{2}}e^{\frac{3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2}}{% 2}T}+F\right)4 ( italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT + square-root start_ARG italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG italic_e start_POSTSUPERSCRIPT divide start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT + italic_F )–Lipschitz.

Corollary 15 (Lipschitzness of the ΦbsubscriptΦsuperscript𝑏\Phi_{b^{*}}roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT).

The cost flow ΦbsubscriptΦsuperscript𝑏\Phi_{b^{*}}roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT is 𝒪(eC1(L𝐟+L𝐠2)T)𝒪superscript𝑒subscript𝐶1subscript𝐿superscript𝐟superscriptsubscript𝐿superscript𝐠2𝑇\mathcal{O}\left(e^{C_{1}(L_{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})T}\right)caligraphic_O ( italic_e start_POSTSUPERSCRIPT italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_T end_POSTSUPERSCRIPT )–Lipschitz, where C1subscript𝐶1C_{1}italic_C start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is a constant.

Proof.

Same as in the proof of Lemma 14 we first show coordinate-wise Lipschitzness.

  1. 1.

    We first show Lipschitness in time:

    |Φb(𝒙,𝒖,t)Φb(𝒙,𝒖,t^)|subscriptΦsuperscript𝑏𝒙𝒖𝑡subscriptΦsuperscript𝑏𝒙𝒖^𝑡\displaystyle\left\lvert\Phi_{b^{*}}({\bm{x}},{\bm{u}},t)-\Phi_{b^{*}}({\bm{x}% },{\bm{u}},\widehat{t})\right\rvert| roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) - roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , over^ start_ARG italic_t end_ARG ) | =|𝔼[t^tb(𝒙s,𝒖)𝑑s]|absent𝔼delimited-[]superscriptsubscript^𝑡𝑡superscript𝑏subscript𝒙𝑠𝒖differential-d𝑠\displaystyle=\left\lvert\mathbb{E}\left[\int_{\widehat{t}}^{t}b^{*}({\bm{x}}_% {s},{\bm{u}})ds\right]\right\rvert= | blackboard_E [ ∫ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) italic_d italic_s ] |
    𝔼[t^t|b(𝒙s,𝒖)|𝑑s]absent𝔼delimited-[]superscriptsubscript^𝑡𝑡superscript𝑏subscript𝒙𝑠𝒖differential-d𝑠\displaystyle\leq\mathbb{E}\left[\int_{\widehat{t}}^{t}\left\lvert b^{*}({\bm{% x}}_{s},{\bm{u}})\right\rvert ds\right]≤ blackboard_E [ ∫ start_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT , bold_italic_u ) | italic_d italic_s ]
    𝔼[B(tt^)]=B(tt^).absent𝔼delimited-[]𝐵𝑡^𝑡𝐵𝑡^𝑡\displaystyle\leq\mathbb{E}\left[B(t-\widehat{t})\right]=B(t-\widehat{t}).≤ blackboard_E [ italic_B ( italic_t - over^ start_ARG italic_t end_ARG ) ] = italic_B ( italic_t - over^ start_ARG italic_t end_ARG ) .
  2. 2.

    To obtain Lipschitzness in state observe:

    |Φb(𝒙,𝒖,t)Φb(𝒙^,𝒖,t)|subscriptΦsuperscript𝑏𝒙𝒖𝑡subscriptΦsuperscript𝑏^𝒙𝒖𝑡\displaystyle\left\lvert\Phi_{b^{*}}({\bm{x}},{\bm{u}},t)-\Phi_{b^{*}}(% \widehat{\bm{x}},{\bm{u}},t)\right\rvert| roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) - roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_t ) | =|𝔼[0tb(𝚵(𝒙,𝒖,s),𝒖)b(𝚵(𝒙^,𝒖,s),𝒖)ds]|absent𝔼delimited-[]superscriptsubscript0𝑡superscript𝑏𝚵𝒙𝒖𝑠𝒖superscript𝑏𝚵^𝒙𝒖𝑠𝒖𝑑𝑠\displaystyle=\left\lvert\mathbb{E}\left[\int_{0}^{t}b^{*}({\bm{\Xi}}({\bm{x}}% ,{\bm{u}},s),{\bm{u}})-b^{*}({\bm{\Xi}}(\widehat{\bm{x}},{\bm{u}},s),{\bm{u}})% ds\right]\right\rvert= | blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) , bold_italic_u ) - italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_s ) , bold_italic_u ) italic_d italic_s ] |
    𝔼[0t|b(𝚵(𝒙,𝒖,s),𝒖)b(𝚵(𝒙^,𝒖,s),𝒖)|𝑑s]absent𝔼delimited-[]superscriptsubscript0𝑡superscript𝑏𝚵𝒙𝒖𝑠𝒖superscript𝑏𝚵^𝒙𝒖𝑠𝒖differential-d𝑠\displaystyle\leq\mathbb{E}\left[\int_{0}^{t}\left\lvert b^{*}({\bm{\Xi}}({\bm% {x}},{\bm{u}},s),{\bm{u}})-b^{*}({\bm{\Xi}}(\widehat{\bm{x}},{\bm{u}},s),{\bm{% u}})\right\rvert ds\right]≤ blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) , bold_italic_u ) - italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_s ) , bold_italic_u ) | italic_d italic_s ]
    Lb𝔼[0t𝚵(𝒙,𝒖,s)𝚵(𝒙^,𝒖,s)𝑑s]absentsubscript𝐿superscript𝑏𝔼delimited-[]superscriptsubscript0𝑡delimited-∥∥𝚵𝒙𝒖𝑠𝚵^𝒙𝒖𝑠differential-d𝑠\displaystyle\leq L_{b^{*}}\mathbb{E}\left[\int_{0}^{t}\left\lVert{\bm{\Xi}}({% \bm{x}},{\bm{u}},s)-{\bm{\Xi}}(\widehat{\bm{x}},{\bm{u}},s)\right\rVert ds\right]≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) - bold_Ξ ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_s ) ∥ italic_d italic_s ]
    Lb0t𝔼[𝚵(𝒙,𝒖,s)𝚵(𝒙^,𝒖,s)2]𝑑sabsentsubscript𝐿superscript𝑏superscriptsubscript0𝑡𝔼delimited-[]superscriptdelimited-∥∥𝚵𝒙𝒖𝑠𝚵^𝒙𝒖𝑠2differential-d𝑠\displaystyle\leq L_{b^{*}}\int_{0}^{t}\sqrt{\mathbb{E}\left[\left\lVert{\bm{% \Xi}}({\bm{x}},{\bm{u}},s)-{\bm{\Xi}}(\widehat{\bm{x}},{\bm{u}},s)\right\rVert% ^{2}\right]}ds≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT square-root start_ARG blackboard_E [ ∥ bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) - bold_Ξ ( over^ start_ARG bold_italic_x end_ARG , bold_italic_u , italic_s ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG italic_d italic_s
    Lb𝒙𝒙^0te2L𝒇+L𝒈22s𝑑sabsentsubscript𝐿superscript𝑏delimited-∥∥𝒙^𝒙superscriptsubscript0𝑡superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑠differential-d𝑠\displaystyle\leq L_{b^{*}}\left\lVert{\bm{x}}-\widehat{\bm{x}}\right\rVert% \int_{0}^{t}e^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}{2}s}ds≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_x - over^ start_ARG bold_italic_x end_ARG ∥ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_s end_POSTSUPERSCRIPT italic_d italic_s
    =2Lb2L𝒇+L𝒈2(e2L𝒇+L𝒈22t1)𝒙𝒙^absent2subscript𝐿superscript𝑏2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈superscript𝑒2subscript𝐿superscript𝒇subscriptsuperscript𝐿2superscript𝒈2𝑡1delimited-∥∥𝒙^𝒙\displaystyle=\frac{2L_{b^{*}}}{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}\left(e% ^{\frac{2L_{{\bm{f}}^{*}}+L^{2}_{{\bm{g}}^{*}}}{2}t}-1\right)\left\lVert{\bm{x% }}-\widehat{\bm{x}}\right\rVert= divide start_ARG 2 italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG ( italic_e start_POSTSUPERSCRIPT divide start_ARG 2 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_t end_POSTSUPERSCRIPT - 1 ) ∥ bold_italic_x - over^ start_ARG bold_italic_x end_ARG ∥
  3. 3.

    Finally, for Lipschitzness in action observe:

    |Φb(𝒙,𝒖,t)Φb(𝒙,𝒖^,t)|=|𝔼[0tb(𝚵(𝒙,𝒖,s),𝒖)b(𝚵(𝒙,𝒖^,s),𝒖)ds]|subscriptΦsuperscript𝑏𝒙𝒖𝑡subscriptΦsuperscript𝑏𝒙^𝒖𝑡𝔼delimited-[]superscriptsubscript0𝑡superscript𝑏𝚵𝒙𝒖𝑠𝒖superscript𝑏𝚵𝒙^𝒖𝑠𝒖𝑑𝑠\displaystyle\left\lvert\Phi_{b^{*}}({\bm{x}},{\bm{u}},t)-\Phi_{b^{*}}({\bm{x}% },\widehat{\bm{u}},t)\right\rvert=\left\lvert\mathbb{E}\left[\int_{0}^{t}b^{*}% ({\bm{\Xi}}({\bm{x}},{\bm{u}},s),{\bm{u}})-b^{*}({\bm{\Xi}}({\bm{x}},\widehat{% \bm{u}},s),{\bm{u}})ds\right]\right\rvert| roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , bold_italic_u , italic_t ) - roman_Φ start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_t ) | = | blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) , bold_italic_u ) - italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_s ) , bold_italic_u ) italic_d italic_s ] |
    𝔼[0t|b(𝚵(𝒙,𝒖,s),𝒖)b(𝚵(𝒙,𝒖^,s),𝒖)|𝑑s]absent𝔼delimited-[]superscriptsubscript0𝑡superscript𝑏𝚵𝒙𝒖𝑠𝒖superscript𝑏𝚵𝒙^𝒖𝑠𝒖differential-d𝑠\displaystyle\leq\mathbb{E}\left[\int_{0}^{t}\left\lvert b^{*}({\bm{\Xi}}({\bm% {x}},{\bm{u}},s),{\bm{u}})-b^{*}({\bm{\Xi}}({\bm{x}},\widehat{\bm{u}},s),{\bm{% u}})\right\rvert ds\right]≤ blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT | italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) , bold_italic_u ) - italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ( bold_Ξ ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_s ) , bold_italic_u ) | italic_d italic_s ]
    Lb𝔼[0t𝚵(𝒙,𝒖,s)𝚵(𝒙,𝒖^,s)+𝒖𝒖^ds]absentsubscript𝐿superscript𝑏𝔼delimited-[]superscriptsubscript0𝑡delimited-∥∥𝚵𝒙𝒖𝑠𝚵𝒙^𝒖𝑠delimited-∥∥𝒖^𝒖𝑑𝑠\displaystyle\leq L_{b^{*}}\mathbb{E}\left[\int_{0}^{t}\left\lVert{\bm{\Xi}}({% \bm{x}},{\bm{u}},s)-{\bm{\Xi}}({\bm{x}},\widehat{\bm{u}},s)\right\rVert+\left% \lVert{\bm{u}}-\widehat{\bm{u}}\right\rVert ds\right]≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT blackboard_E [ ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ∥ bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) - bold_Ξ ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_s ) ∥ + ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥ italic_d italic_s ]
    Lbt𝒖𝒖^+Lb0t𝔼[𝚵(𝒙,𝒖,s)𝚵(𝒙,𝒖^,s)2]𝑑sabsentsubscript𝐿superscript𝑏𝑡delimited-∥∥𝒖^𝒖subscript𝐿superscript𝑏superscriptsubscript0𝑡𝔼delimited-[]superscriptdelimited-∥∥𝚵𝒙𝒖𝑠𝚵𝒙^𝒖𝑠2differential-d𝑠\displaystyle\leq L_{b^{*}}t\left\lVert{\bm{u}}-\widehat{\bm{u}}\right\rVert+L% _{b^{*}}\int_{0}^{t}\sqrt{\mathbb{E}\left[\left\lVert{\bm{\Xi}}({\bm{x}},{\bm{% u}},s)-{\bm{\Xi}}({\bm{x}},\widehat{\bm{u}},s)\right\rVert^{2}\right]}ds≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_t ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥ + italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT square-root start_ARG blackboard_E [ ∥ bold_Ξ ( bold_italic_x , bold_italic_u , italic_s ) - bold_Ξ ( bold_italic_x , over^ start_ARG bold_italic_u end_ARG , italic_s ) ∥ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] end_ARG italic_d italic_s
    Lbt𝒖𝒖^+Lb𝒖𝒖^L𝒇+2L𝒈20te3L𝒇+2L𝒈22s𝑑sabsentsubscript𝐿superscript𝑏𝑡delimited-∥∥𝒖^𝒖subscript𝐿superscript𝑏delimited-∥∥𝒖^𝒖subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscriptsubscript0𝑡superscript𝑒3subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈22𝑠differential-d𝑠\displaystyle\leq L_{b^{*}}t\left\lVert{\bm{u}}-\widehat{\bm{u}}\right\rVert+L% _{b^{*}}\left\lVert{\bm{u}}-\widehat{\bm{u}}\right\rVert\sqrt{L_{{\bm{f}}^{*}}% +2L_{{\bm{g}}^{*}}^{2}}\int_{0}^{t}e^{\frac{3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}% }^{2}}{2}s}ds≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT italic_t ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥ + italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥ square-root start_ARG italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ∫ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT divide start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG italic_s end_POSTSUPERSCRIPT italic_d italic_s
    =Lb(t+2L𝒇+2L𝒈23L𝒇+2L𝒈2(e3L𝒇+2L𝒈22t1))𝒖𝒖^absentsubscript𝐿superscript𝑏𝑡2subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈23subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2subscriptsuperscript𝐿2superscript𝒈2𝑡1delimited-∥∥𝒖^𝒖\displaystyle=L_{b^{*}}\left(t+\frac{2\sqrt{L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}% ^{2}}}{3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2}}\left(e^{\frac{3L_{{\bm{f}}^{*}% }+2L^{2}_{{\bm{g}}^{*}}}{2}t}-1\right)\right)\left\lVert{\bm{u}}-\widehat{\bm{% u}}\right\rVert= italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_t + divide start_ARG 2 square-root start_ARG italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_e start_POSTSUPERSCRIPT divide start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_t end_POSTSUPERSCRIPT - 1 ) ) ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥
    Lb(T+2L𝒇+2L𝒈23L𝒇+2L𝒈2(e3L𝒇+2L𝒈22T1))𝒖𝒖^absentsubscript𝐿superscript𝑏𝑇2subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈23subscript𝐿superscript𝒇2superscriptsubscript𝐿superscript𝒈2superscript𝑒3subscript𝐿superscript𝒇2subscriptsuperscript𝐿2superscript𝒈2𝑇1delimited-∥∥𝒖^𝒖\displaystyle\leq L_{b^{*}}\left(T+\frac{2\sqrt{L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^% {*}}^{2}}}{3L_{{\bm{f}}^{*}}+2L_{{\bm{g}}^{*}}^{2}}\left(e^{\frac{3L_{{\bm{f}}% ^{*}}+2L^{2}_{{\bm{g}}^{*}}}{2}T}-1\right)\right)\left\lVert{\bm{u}}-\widehat{% \bm{u}}\right\rVert≤ italic_L start_POSTSUBSCRIPT italic_b start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( italic_T + divide start_ARG 2 square-root start_ARG italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG end_ARG start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG ( italic_e start_POSTSUPERSCRIPT divide start_ARG 3 italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + 2 italic_L start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT end_ARG start_ARG 2 end_ARG italic_T end_POSTSUPERSCRIPT - 1 ) ) ∥ bold_italic_u - over^ start_ARG bold_italic_u end_ARG ∥

Applying Lemma 13 result follows. ∎

Corollary 16 (Lipschitzness of 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT).

The unknown function 𝚽superscript𝚽{\bm{\Phi}}^{*}bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT is L𝚽=L𝚽𝐟+LΦb=𝒪(eD(L𝐟+L𝐠2)T)subscript𝐿𝚽subscript𝐿subscript𝚽𝐟subscript𝐿subscriptΦ𝑏𝒪superscript𝑒𝐷subscript𝐿superscript𝐟superscriptsubscript𝐿superscript𝐠2𝑇L_{{\bm{\Phi}}}=L_{{\bm{\Phi}}_{{\bm{f}}}}+L_{\Phi_{b}}=\mathcal{O}\left(e^{D(% L_{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})T}\right)italic_L start_POSTSUBSCRIPT bold_Φ end_POSTSUBSCRIPT = italic_L start_POSTSUBSCRIPT bold_Φ start_POSTSUBSCRIPT bold_italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT roman_Φ start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT end_POSTSUBSCRIPT = caligraphic_O ( italic_e start_POSTSUPERSCRIPT italic_D ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) italic_T end_POSTSUPERSCRIPT )–Lipschitz, where D𝐷Ditalic_D is constant.

A.2.3 Regret bound

Lemma 17 (Per episode regret bound (general sub-Gaussian noise)).

Consider the setting with a bounded number of switches K𝐾Kitalic_K, and let 1, 3, and 4 hold. Then, we get with probability at least 1δ1𝛿1-\delta1 - italic_δ:

V𝝅n,𝚽(𝒙0,T,K)V𝝅,𝚽(𝒙0,T,K)subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇𝐾subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇𝐾absent\displaystyle V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,K)-V_{{\bm{\pi% }}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,K)\leqitalic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K ) ≤
𝒪(L𝝈K1βn1KeC(L𝒇+L𝒈2)(1+L𝝅)TK𝔼[k=0K𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k,k))2])absent𝒪superscriptsubscript𝐿𝝈𝐾1superscriptsubscript𝛽𝑛1𝐾superscript𝑒𝐶subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝔼delimited-[]superscriptsubscript𝑘0𝐾subscriptdelimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘𝑘2\displaystyle\leq\mathcal{O}\left(L_{{\bm{\sigma}}}^{K-1}\beta_{n-1}^{K}e^{C(L% _{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})(1+L_{{\bm{\pi}}})TK}\mathbb{E}\left[\sum% _{k=0}^{K}\left\lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k},{\bm{\pi}}_{n}({\bm{x}% }_{n,k},t_{n,k},k))\right\rVert_{2}\right]\right)≤ caligraphic_O ( italic_L start_POSTSUBSCRIPT bold_italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_C ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_k ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] )
Proof.

Applying Lemma 5 of Curi et al., (2020) the result follows. ∎

Theorem 18.

Consider the setting with a bounded number of switches K𝐾Kitalic_K, and let 1, 3, and 4 hold. Then, we get with probability at least 1δ1𝛿1-\delta1 - italic_δ:

RNsubscript𝑅𝑁\displaystyle R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=1NV𝝅n,𝚽(𝒙0,T,K)V𝝅,𝚽(𝒙0,T,K)absentsuperscriptsubscript𝑛1𝑁subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇𝐾subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇𝐾\displaystyle=\sum_{n=1}^{N}V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,% K)-V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,K)= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K )
𝒪(L𝝈K1βN1KKeC(L𝒇+L𝒈2)(1+L𝝅)TKNN)absent𝒪superscriptsubscript𝐿𝝈𝐾1superscriptsubscript𝛽𝑁1𝐾𝐾superscript𝑒𝐶subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝑁subscript𝑁\displaystyle\leq\mathcal{O}\left(L_{{\bm{\sigma}}}^{K-1}\beta_{N-1}^{K}\sqrt{% K}e^{C(L_{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})(1+L_{{\bm{\pi}}})TK}\sqrt{N{% \mathcal{I}}_{N}}\right)≤ caligraphic_O ( italic_L start_POSTSUBSCRIPT bold_italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_K end_ARG italic_e start_POSTSUPERSCRIPT italic_C ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG )
Proof.

We apply Lemma 17 and Cauchy-Schwarz:

RNsubscript𝑅𝑁\displaystyle R_{N}italic_R start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT =n=1NV𝝅n,𝚽(𝒙0,T,K)V𝝅,𝚽(𝒙0,T,K)absentsuperscriptsubscript𝑛1𝑁subscript𝑉subscript𝝅𝑛superscript𝚽subscript𝒙0𝑇𝐾subscript𝑉superscript𝝅superscript𝚽subscript𝒙0𝑇𝐾\displaystyle=\sum_{n=1}^{N}V_{{\bm{\pi}}_{n},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,% K)-V_{{\bm{\pi}}^{*},{\bm{\Phi}}^{*}}({\bm{x}}_{0},T,K)= ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K ) - italic_V start_POSTSUBSCRIPT bold_italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , bold_Φ start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_T , italic_K )
n=1N𝒪(L𝝈K1βn1KeC(L𝒇+L𝒈2)(1+L𝝅)TK𝔼[k=0K𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k,k))2])absentsuperscriptsubscript𝑛1𝑁𝒪superscriptsubscript𝐿𝝈𝐾1superscriptsubscript𝛽𝑛1𝐾superscript𝑒𝐶subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝔼delimited-[]superscriptsubscript𝑘0𝐾subscriptdelimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘𝑘2\displaystyle\leq\sum_{n=1}^{N}\mathcal{O}\left(L_{{\bm{\sigma}}}^{K-1}\beta_{% n-1}^{K}e^{C(L_{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})(1+L_{{\bm{\pi}}})TK}% \mathbb{E}\left[\sum_{k=0}^{K}\left\lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k},{% \bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k},k))\right\rVert_{2}\right]\right)≤ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT caligraphic_O ( italic_L start_POSTSUBSCRIPT bold_italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_C ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT blackboard_E [ ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_k ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] )
𝒪(L𝝈K1βN1KeC(L𝒇+L𝒈2)(1+L𝝅)TK)𝔼[n=1Nk=0K𝝈n1(𝒙n,k,𝝅n(𝒙n,k,tn,k,k))2]absent𝒪superscriptsubscript𝐿𝝈𝐾1superscriptsubscript𝛽𝑁1𝐾superscript𝑒𝐶subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝔼delimited-[]superscriptsubscript𝑛1𝑁superscriptsubscript𝑘0𝐾subscriptdelimited-∥∥subscript𝝈𝑛1subscript𝒙𝑛𝑘subscript𝝅𝑛subscript𝒙𝑛𝑘subscript𝑡𝑛𝑘𝑘2\displaystyle\leq\mathcal{O}\left(L_{{\bm{\sigma}}}^{K-1}\beta_{N-1}^{K}e^{C(L% _{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})(1+L_{{\bm{\pi}}})TK}\right)\mathbb{E}% \left[\sum_{n=1}^{N}\sum_{k=0}^{K}\left\lVert{\bm{\sigma}}_{n-1}({\bm{x}}_{n,k% },{\bm{\pi}}_{n}({\bm{x}}_{n,k},t_{n,k},k))\right\rVert_{2}\right]≤ caligraphic_O ( italic_L start_POSTSUBSCRIPT bold_italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_C ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT ) blackboard_E [ ∑ start_POSTSUBSCRIPT italic_n = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT ∥ bold_italic_σ start_POSTSUBSCRIPT italic_n - 1 end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , bold_italic_π start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ( bold_italic_x start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_t start_POSTSUBSCRIPT italic_n , italic_k end_POSTSUBSCRIPT , italic_k ) ) ∥ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ]
𝒪(L𝝈K1βN1KeC(L𝒇+L𝒈2)(1+L𝝅)TK)KNNabsent𝒪superscriptsubscript𝐿𝝈𝐾1superscriptsubscript𝛽𝑁1𝐾superscript𝑒𝐶subscript𝐿superscript𝒇superscriptsubscript𝐿superscript𝒈21subscript𝐿𝝅𝑇𝐾𝐾𝑁subscript𝑁\displaystyle\leq\mathcal{O}\left(L_{{\bm{\sigma}}}^{K-1}\beta_{N-1}^{K}e^{C(L% _{{\bm{f}}^{*}}+L_{{\bm{g}}^{*}}^{2})(1+L_{{\bm{\pi}}})TK}\right)\sqrt{K}\sqrt% {N{\mathcal{I}}_{N}}≤ caligraphic_O ( italic_L start_POSTSUBSCRIPT bold_italic_σ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K - 1 end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_N - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_K end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_C ( italic_L start_POSTSUBSCRIPT bold_italic_f start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT + italic_L start_POSTSUBSCRIPT bold_italic_g start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) ( 1 + italic_L start_POSTSUBSCRIPT bold_italic_π end_POSTSUBSCRIPT ) italic_T italic_K end_POSTSUPERSCRIPT ) square-root start_ARG italic_K end_ARG square-root start_ARG italic_N caligraphic_I start_POSTSUBSCRIPT italic_N end_POSTSUBSCRIPT end_ARG

Here we first applied Lemma 17. Then we used the monotonicity of (βn)n0subscriptsubscript𝛽𝑛𝑛0(\beta_{n})_{n\geq 0}( italic_β start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_n ≥ 0 end_POSTSUBSCRIPT sequence. In the last step we first applied maximum over the collected data, then Cauchy-Schwarz inequality and finally the definition of model complexity. ∎