˝moH\IfValueTF#2[#1 | #2](#1) \NewDocumentCommand\HsmmoH\IfValueTF#2[#1 ∣#2](#1) \NewDocumentCommand\ImmoI\IfValueTF#3(#1;#2 | #3)(#1; #2) \NewDocumentCommand\IsmmmoI\IfValueTF#3(#1;#2 ∣#3)(#1; #2)
.tocmtchapter \etocsettagdepthmtchaptersubsection \etocsettagdepthmtappendixnone
When to Sense and Control?
A Time-adaptive Approach for Continuous-Time RL
Abstract
Reinforcement learning (RL) excels in optimizing policies for discrete-time Markov decision processes (MDP). However, various systems are inherently continuous in time, making discrete-time MDPs an inexact modeling choice. In many applications, such as greenhouse control or medical treatments, each interaction (measurement or switching of action) involves manual intervention and thus is inherently costly. Therefore, we generally prefer a time-adaptive approach with fewer interactions with the system. In this work, we formalize an RL framework, Time-adaptive Control & Sensing (TaCoS), that tackles this challenge by optimizing over policies that besides control predict the duration of its application. Our formulation results in an extended MDP that any standard RL algorithm can solve. We demonstrate that state-of-the-art RL algorithms trained on TaCoS drastically reduce the interaction amount over their discrete-time counterpart while retaining the same or improved performance, and exhibiting robustness over discretization frequency. Finally, we propose OTaCoS, an efficient model-based algorithm for our setting. We show that OTaCoS enjoys sublinear regret for systems with sufficiently smooth dynamics and empirically results in further sample-efficiency gains.
1 Introduction
Nearly all state-of-the-art RL algorithms (Schulman et al.,, 2017; Haarnoja et al.,, 2018; Lillicrap et al.,, 2015; Schulman et al.,, 2015) were developed for discrete-time MDPs. Nevertheless, continuous-time systems are ubiquitous in nature, ranging from robotics, biology, medicine, environment and sustainability etc. (cf. Spong et al.,, 2006; Jones et al.,, 2009; Lenhart and Workman,, 2007; Panetta and Fister,, 2003; Turchetta et al.,, 2022). Such systems can be naturally modeled with stochastic differential equations (SDEs), but computational approaches necessitate discretization. Furthermore, in many applications, obtaining measurements and switching actions is expensive. For instance, consider a greenhouse of fruits or medical treatment recommendations. In both cases, each measurement (crop inspection, medical exam) or switching of actions (climate control, treatment adjustment) typically involves costly human intervention. Hence, minimizing such interactions with the underlying system is desirable. This underlying challenge is rarely addressed in the RL literature.
In practice, a time-equidistant discretization frequency is set, often manually, adjusted to the underlying system’s characteristic time scale. This is challenging, however, especially for unknown/uncertain systems, and systems with multiple dominant time scales (Engquist et al.,, 2007). Therefore, for many real-world applications having a global frequency of control is inadequate and wasteful. For example, in medicine, patient monitoring often requires higher frequency interaction during the onset of illness and lower frequency interactions as the patient recovers (Kaandorp and Koole,, 2007).
In this work, we address this limitation of standard RL methods and propose a novel RL framework, Time-adaptive Control & Sensing (TaCoS). TaCoS reduces a general continuous-time RL problem with underlying SDE dynamics to an equivalent discrete-time MDP, that can be solved with any RL algorithm, including standard policy gradient methods like PPO and SAC (Schulman et al.,, 2017; Haarnoja et al.,, 2018). We summarize our contributions below.
Contributions
-
1.
We reformulate the problem of time-adaptive continuous time RL to an equaivalent discrete-time MDP that can be solved with standard RL algorithms.
-
2.
Using our formulation, we extend standard policy gradient techniques (Haarnoja et al., (2018) and Schulman et al., (2017)) to the time-adaptive setting. Our empirical results on standard RL benchmarks (Freeman et al.,, 2021) show that TaCoS outperforms its discrete-time counterpart in terms of policy performance, computational cost, and sample efficiency.
-
3.
To further improve sample efficiency, we propose a model-based RL algorithm, OTaCoS. OTaCoS uses well-calibrated probabilistic models to capture epistemic uncertainty and, similar to Curi et al., (2020) and Treven et al., (2023), leverages the principle of optimism in the face of uncertainty to guide exploration during learning. We theoretically prove that OTaCoS suffers no regret and empirically demonstrate its sample efficiency.
2 Problem statement
We consider a general nonlinear continuous time dynamical system with continuous state and action space. The underlying dynamics are governed by a (controllable) SDE:
(1) |
Here is the state at time , the control input, are unknown drift and diffusion functions and is a standard Brownian motion in . Our goal is to find a control policy which maximizes an unknown reward over a fixed horizon , i.e.,
where the expectation is taken w.r.t. the policy and stochastic dynamics and is the class of policies111We assume that is the set of -Lipschitz policies over which we search.
In practice, we can only measure the system state and execute control policies in discrete points in time. In this work, we focus on problems where state measurement and control are synchronized in time. We refer to these synchronized time points as interactions in the following parts of this paper. Synchronizing state measurement and control contrasts standard time-adaptive approaches such as event-triggered control (Heemels et al.,, 2021), where the state is measured arbitrarily high frequency and control inputs are changed only so often to ensure stability. It is also in contrast to the complementary setting, where control inputs are changing at an arbitrarily high frequency but measurements are collected adaptively in time (Treven et al.,, 2023). An adaptive control approach as Heemels et al., (2021) is very important for many real-world applications but similarly, an adaptive measurement strategy is crucial for efficient learning in RL (Treven et al.,, 2023). Our approach treats both of these requirements jointly.
We consider two different scenarios for continuous-time control: (i) Penalizing interactions with some cost, (ii) bounded number of interactions, i.e., hard constraint on control/measurement steps.
Interaction cost
We consider the setting where every interaction we take has an inherent cost . Note that we consider this cost structure for its simplicity and TaCoS works for more general cost functions that depend on the duration of application for the action or the previous action and thus captures many practical real-world settings. We define this task more formally below
(2) | ||||
Here is the minimal duration for which we have to apply the control, the maximum duration, and is a policy that predicts the duration of applying the action.
Bounded number of interactions
In this setting, the number of interactions with the system is limited by a known amount . Intuitively, this represents a scenario where we have a finite budget for the inputs that we can apply and have to decide on the best strategy to space these inputs over the full horizon. A formal definition of this task is given below
(3) | ||||
In the absence of the transition costs or the bound on the number of interactions, intuitively the policy would propose to interact with the system as frequently as possible, i.e., every seconds. The additional costs/constraints ensure that we do not converge to this trivial (but unrealistic) solution.
3 TaCoS: Time Adaptive Control or Sensing
In the following, we reformulate the continuous-time problem as an equivalent discrete-time MDP. We first denote the state and running reward flows of Equation 1. The state flow by applying action for time reads:
We assume that every time we interact with the system, we also obtain the integrated reward and define the reward flow as
(4) |
Due to the stochasticity of , the state flow and the reward flow are stochastic. For ease of notation, we denote
and the concatenated state and reward flow function, and noise as:
(5) |
In this work, we search for policies that return the next control we apply and also the time for how long to apply the control.
3.1 Reforumlation of Interaction Cost setting to Discrete-time MDPs
We convert the problem with interaction costs to a standard MDP which any RL algorithm for continuous state-action spaces can solve. To this end, we restrict ourselves to a policy class:
For simplicity, we denote by the component of the policy that predicts the duration of applying the action. The policies we consider map state and time-to-go to control and the time for how long we apply the action . We define the augmented state , where is the state, integrated reward and time-to-go. With the introduced notation we arrive at a discrete-time MDP problem formulation
(6) | ||||
where we have
3.2 Reformulation of Bounded Number of Interactions to Discrete-time MDPs
The second setting is similar to the one studied by Ni and Jang, (2022). In this case, we consider the following class of policies:
For an augmented state , our policies map states , time-to-go , number of past interactions to a controller and the time duration for applying the action. Here the optimal control problem reads
(7) | ||||
where,
In the following, we provide a simple proposition which shows that our reformulated problem is equivalent to its continuous-time counterpart from Section 2.
Proposition 1.
The problem in Equation 2 and 3 are equivalent to Equation 6 and 7, respectively.
Figure 1 depicts the influence of interaction cost and on the controller’s performance for the pendulum environment.
4 TaCoS with Model-free RL Algorithms
We now illustrate the performance of TaCoS on several well-studied robotic RL tasks. We consider the RC car (Kabzan et al.,, 2020), Greenhouse (Tap,, 2000), Pendulum, Reacher, Halfcheetah and Humanoid environments from Brax (Freeman et al.,, 2021). Thus our experiments range from environments necessitating time-adaptive control like the Greenhouse, a realistic and highly dynamic race car simulation, and a very high dimensional RL task like the Humanoid.222. We provide our implementation at https://github.com/lasgroup/TaCoS.
We investigate both the bounded number of interactions and interaction cost settings in our experiments. In particular, we study how the bound affects the performance of TaCoS and compare it to the standard equidistant baseline. We further study the interplay between the stochasticity of the environments (magnitude of ) and interaction costs and the influence of on TaCoS. For all experiments in this section, we combine SAC with TaCoS (SAC-TaCoS).
How does the bound on the number of interactions affect TaCoS?
We analyze the bounded number of interactions setting (cf. Section 3.2) of TaCoS, studying the relationship between the number of interactions and the achieved episode reward. We compare our algorithm with the standard equidistant time discretization approach which splits the whole horizon into discrete time steps at which an interaction takes place. We evaluate the two methods in the greenhouse and pendulum environments. For the pendulum, we consider the swing-up and swing-down tasks. The results are reported in Figure 2. The time-adaptive approach performs significantly better than the standard equidistant time discretization. This is particularly the case for the greenhouse and pendulum swing-down tasks. Both tasks involve driving the system to a stable equilibrium and thus, while high-frequency interaction might be necessary at the initial stages, a fairly low interaction frequency can be maintained when the system has reached the equilibrium state. This demonstrates the practical benefits of time-adaptive control.
How does the interaction cost magnitude influence TaCoS?
We investigate the setting from Section 3.1 with interaction costs. In our experiments, we always pick a constant cost, i.e., . We study the influence of on the episode reward and on the number of interactions that the policy has with the system within an episode. We again evaluate this on the greenhouse and pendulum environment. For the pendulum, we consider the swing-up task. The results are presented in the first row of Figure 3. Noticeably, increasing reduces the number of interactions. The decrease is drastic for the greenhouse environment since it can be controlled with considerably fewer interactions without having any effect on the performance. Generally, we observe that decreasing the number of interactions, that is, increasing , also results in a slight decline in episode reward.
How does environment stochasticity influence the number of interactions?
We analyze the influence of the environment’s stochasticity, i.e., the magnitude of the diffusion term , on the episode reward and number of interactions on TaCoS. Intuitively, the more stochastic the environment, the more interactions we would require to stabilize the system. We again evaluate our method on the greenhouse and pendulum swing-up tasks. The results are reported in the second row of Figure 3. The results verify our intuition that more stochasticity in the environment generally leads to more interactions. However, we observe that the policy is still able to achieve high rewards for a wide range of magnitude of . This showcases the robustness and adaptability of TaCoS to stochastic environments.
How does influence TaCoS?
As highlighted in Section 1, picking the right discretization for interactions is a challenging task. We show that TaCoS can naturally alleviate this issue and adaptively pick the frequency of interaction while also being more computationally and data-efficient. Moreover, we show that TaCoS is robust to the choice of , which represents the minimal duration an action has to be applied, i.e., its inverse is the highest frequency at which we can control the system. In this experiment, we consider SAC-TaCoS and compare it to the standard SAC algorithm. TaCoS adaptively picks the number of interactions and therefore during an episode of time , it effectively collects less data than the standard discrete-time RL algorithm.333A standard RL algorithm would collect data points per episode. This makes comparison to the discrete-time setting challenging since environment interactions and physical time on the environment are not linearly related for TaCoS as opposed to the standard discrete-time setting. Nevertheless, to be fair to the discrete-time method, we give SAC more physical time on the system for all environments, effectively resulting in the collection of more data for learning. Since the standard SAC algorithm updates the policy relative to the data amount, we consider a version of SAC, SAC-MC (SAC more compute), which leverages the additional data it collects to perform more gradient updates. This version essentially performs more policy updates than SAC-TaCoS and thus is computationally more expensive. Furthermore, to demonstrate the generality of our framework, we also combine TaCoS with PPO (PPO-TaCoS).
We report the performance after convergence across different in the first row of Figure 4. From our experiment, we conclude that SAC-TaCoS and PPO-TaCoS are robust to the choice of and perform equally well when is decreased, i.e., frequency is increased. This is in contrast to the standard RL methods, which have a significant drop in performance at high frequencies. This observation is also made in prior work (Hafner et al.,, 2019). Crucially, this highlights the sensitivity of the standard RL methods to the frequency of interaction. In the second row of Figure 4 we show the learning curve of the methods for a specific frequency . From the curve, we conclude that SAC-TaCoS achieves higher rewards with significantly less physical time on the environment. We believe this is because our method explores more efficiently (akin to Dabney et al.,, 2020; Eberhard et al.,, 2022), and also learns a much stronger/continuous-time representation of the underlying MDP.
Interestingly, at the default frequency used in the benchmarks , all methods perform similarly. However, slightly decreasing the frequency already leads to a drastic drop in performance for all methods. Intuitively, decreasing the frequency prevents us from performing the necessary fine-grained control and obtaining the highest performance.
While we have access to the optimal frequency for these benchmarks, for a general and unknown system it is very difficult to estimate this frequency. Furthermore, as we observe in our experiments, picking a very high frequency is also not an option when using standard RL algorithms. We believe this is where TaCoS excels as it adaptively picks the frequency of interaction, thereby relieving the problem designer of this decision.
5 Efficient Exploration for TaCoS via Model-Based RL
In this section, we propose a novel model-based RL algorithm for TaCoS called Optimistic TaCoS (OTaCoS). We analyze the episodic setting, where we interact with the system in episodes . In episode , we execute the policy , collect measurements and integrated rewards , and prepare the data , where and . From the dataset we build a model for the unknown function such that it is well-calibrated in the sense of the following definition.
Definition 1 (Well-calibrated statistical model of , Rothfuss et al., (2023)).
Let . We assume with probability at least , where statistical model is defined as
Here, and denote the -th element in the vector-valued mean and standard deviation functions and respectively, and is a scalar function that depends on the confidence level and which is monotonically increasing in .
Similar to model-based RL algorithms for the discrete-time setting (Kakade et al.,, 2020; Curi et al.,, 2020; Sukhija et al.,, 2024), we follow the principle of optimism in the face of uncertainty and select the policy for both settings of TaCoS (cf. Sections 3.1 and 3.2) by solving:
(8) |
where is the appropriate policy class from Section 3. Running OTaCoS for episodes, we measure the performance via the regret:
Here is the optimal policy from the class of policies we optimize over. Any kind of regret bound requires certain assumptions on the regularity of the underlying dynamics (1).
Assumption 1 (Dynamics model).
Given any norm , we assume that the drift , and diffusion are and -Lipschitz continuous, respectively, with respect to the induced metric. We further assume .
1 ensures the existence of the SDE (1) solution under policy . To provide bounds on the performance of OTaCoS for settings Sections 3.1 and 3.2 we also need some assumptions on the noise and reward model.
Assumption 2 (Reward and noise model for Section 3.1 Setting).
Given any norm , we assume that running reward is -Lipschitz continuous, with respect to the induced metric. We further assume boundedness of the reward , and interaction cost . The dynamics noise is independent and follows: .
Assumption 3 (Reward and noise model for Section 3.2 Setting).
Given any norm , we assume that the running reward is -Lipschitz continuous, w.r.t. to the induced metric. The measurement noise is independent and sub-Gaussian.
Finally, we assume that we learn a well-calibrated model of the unknown flow .
Assumption 4 (Well calibration assumption).
Our learned model is an all-time-calibrated statistical model of , i.e., there exists an increasing sequence of such that our model satisfies the well-calibration condition, cf., Definition 1.
Analogous assumptions are made for model-based RL algorithms in the discrete-time setting (Curi et al.,, 2020; Sukhija et al.,, 2024). This calibration assumption is satisfied if can be represented with Gaussian Process (GP) (Williams and Rasmussen,, 2006; Kirschner and Krause,, 2018) models.
Theorem 2.
Consider the setting from Section 3.1 and let 1, 2, and 4 hold. Then we have with probability at least :
Now consider, the setting with a bounded number of switches , and let 1, 3, and 4 hold. Then, we get with probability at least
where is a constant. Here, with we denote the model-complexity after observing points (Curi et al.,, 2020), which quantifies the difficulty of learning . For GPs, it behaves similar to the maximum information gain (Srinivas et al.,, 2009), i.e., implying sublinear regret for several common kernels (Vakili et al.,, 2021).
As a proof of concept, we evaluate OTaCoS on the pendulum and RC car environment for the interaction cost setting. 444The code is available at https://github.com/lasgroup/model-based-rl. As baselines, we adapt common model-based RL methods such as PETS (Chua et al.,, 2018) and planning with the mean to TaCoS. We call them PETS-TaCoS and Mean-TaCoS, respectively. The result is reported in Figure 5. From the figure, we conclude that OTaCoS is more sample efficient than other model-based baselines and SAC-TaCoS (SAC-TaCoS requires circa episodes for the pendulum and for the RC car).
6 Related Work
Similar to this work, Holt et al., (2023); Ni and Jang, (2022); Karimi, (2023) consider continuous-time deterministic dynamical systems where the measurements or control input changes can only happen at discrete time steps. Moreover, Holt et al., (2023) proposes a similar problem as ours from Section 3.1, where they specify a cost on the number of interactions. However, their solution is based on a heuristic, where a measurement is taken when the variance of the potential reward surpasses a prespecified threshold. On the contrary, we directly tackle this problem at hand and propose a general framework for time-adaptive control that does not rely on any heuristics. Karimi, (2023) adapt SAC (Haarnoja et al.,, 2018) to include a regularization term, which effectively adds a cost for every discrete interaction. Ni and Jang, (2022) induce a soft-constraint on the duration of each action in the environment. However, all the aforementioned works propose heuristic techniques to minimize interactions, whereas we formalize the problem systematically for the more general case of SDEs and show that it has an underlying MDP structure that any RL algorithm can leverage. In addition, we propose a no-regret model-based RL algorithm for this setting and analyze its sample complexity.
Temporal abstractions are considered also in the framework of options (Sutton et al.,, 1999; Mankowitz et al.,, 2014; Mann and Mannor,, 2014; Harb et al.,, 2018). However, a key difference to TaCoS is that in the options framework, the agent measures the state even between the controller switches.
Learning to repeat actions
Several works observe that repeating actions in the discrete-time MDPs problems such as Atari (Mnih et al.,, 2013; Braylan et al.,, 2015) or Cartpole (Hafner et al.,, 2019) significantly increase the speed of learning. However, the action repeat is fixed through the entire rollout and treated as a hyperparameter. Durugkar et al., (2016); Vezhnevets et al., (2016); Srinivas et al., (2017); Sharma et al., (2017); Lee et al., (2020); Grigsby et al., (2021); Chen et al., (2021); Yu et al., (2021); Biedenkapp et al., (2021); Krale et al., (2023) automate the selection of action repeat, and show superior performance over the fixed number setting. Dabney et al., (2020) empirically show that repeating the actions helps with the exploration, effectively having a similar effect that colored noise exploration has over the standard white noise exploration (Eberhard et al.,, 2022).
Continuous-time RL
Following the seminal work of Doya, (2000) and the advances in Neural ODEs of Chen et al., (2018), continuous-time RL has regained interest (Cranmer et al.,, 2020; Greydanus et al.,, 2019; Yildiz et al.,, 2021; Lutter et al.,, 2021). Moreover, modeling in continuous-time is found to be particularly useful when learning from different data sources where each source is collected at a different frequency (Burns et al.,, 2023; Zheng et al.,, 2023). An important line of work exists for modeling continuous dynamics for the case when states and actions are discrete, called Markov Jump Processes (Kallianpur and Sundar,, 2014; Berger,, 1993; Huang et al.,, 2019; Seifner and Sanchez,, 2023). Another line of work that is close to ours is event and self-Triggered Control (Astrom and Bernhardsson,, 2002; Anta and Tabuada,, 2010; Heemels et al.,, 2012, 2021), where they model continuous-time control systems by implementing changes to the input only when stability is at risk, ensuring efficient and timely interventions. Treven et al., (2023) propose a no-regret continuous-time model-based RL algorithm, which akin to OTaCoS, performs optimistic exploration. They study the problem where controls can be executed continuously in time and propose adaptive measurement selection strategies. Similarly, we propose a novel model-based RL algorithm, OTaCoS, based on the principle of optimism in the face of uncertainty. We show that OTaCoS has no regret for sufficiently smooth dynamics and has considerable sample-efficiency gains over its model-free counterpart.
7 Conclusion and discussion
We study the problem of time-adaptive RL for continuous-time systems with continuous state and action spaces. We investigate two practical settings where each interaction has an inherent cost and where we have a hard constraint on the number of interactions. We propose a novel RL framework, TaCoS, and show that both of these settings result in extended MDPs which can be solved with standard RL algorithms. In our experiments, we show that combining standard RL algorithms with TaCoS results in a significant reduction in the number of interactions without having any effect on the performance for the interaction cost setting. Furthermore, for the second setting, TaCoS achieves considerably better control performance despite having a small budget for the number of interactions. Moreover, we show that TaCoS improves robustness to a large range of interaction frequencies, and generally improves sample complexity of learning. Finally, we propose, OTaCoS, a no-regret model-based RL algorithm for TaCoS and show that it has further sample efficiency gains.
Acknowledgments and Disclosure of Funding
This project has received funding from the Swiss National Science Foundation under NCCR Automation, grant agreement 51NF40 180545, and the Microsoft Swiss Joint Research Center.
References
- Anta and Tabuada, (2010) Anta, A. and Tabuada, P. (2010). To sample or not to sample: Self-triggered control for nonlinear systems. IEEE Transactions on automatic control, 55(9):2030–2042.
- Astrom and Bernhardsson, (2002) Astrom, K. J. and Bernhardsson, B. M. (2002). Comparison of riemann and lebesgue sampling for first order stochastic systems. In Proceedings of the 41st IEEE Conference on Decision and Control, 2002., volume 2, pages 2011–2016. IEEE.
- Berger, (1993) Berger, M. A. (1993). Markov Jump Processes, pages 121–138. Springer New York, New York, NY.
- Biedenkapp et al., (2021) Biedenkapp, A., Rajan, R., Hutter, F., and Lindauer, M. (2021). Temporl: Learning when to act. In International Conference on Machine Learning, pages 914–924. PMLR.
- Bobkov and Götze, (1999) Bobkov, S. G. and Götze, F. (1999). Exponential integrability and transportation cost related to logarithmic sobolev inequalities. Journal of Functional Analysis, 163(1):1–28.
- Braylan et al., (2015) Braylan, A., Hollenbeck, M., Meyerson, E., and Miikkulainen, R. (2015). Frame skip is a powerful parameter for learning to play atari. In Workshops at the twenty-ninth AAAI conference on artificial intelligence.
- Burns et al., (2023) Burns, K., Yu, T., Finn, C., and Hausman, K. (2023). Offline reinforcement learning at multiple frequencies. In Conference on Robot Learning, pages 2041–2051. PMLR.
- Chen et al., (2021) Chen, C., Tang, H., Hao, J., Liu, W., and Meng, Z. (2021). Addressing action oscillations through learning policy inertia. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7020–7027.
- Chen et al., (2018) Chen, R. T., Rubanova, Y., Bettencourt, J., and Duvenaud, D. K. (2018). Neural ordinary differential equations. Advances in neural information processing systems, 31.
- Chua et al., (2018) Chua, K., Calandra, R., McAllister, R., and Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In NeurIPS.
- Cranmer et al., (2020) Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P., Spergel, D., and Ho, S. (2020). Lagrangian neural networks. arXiv preprint arXiv:2003.04630.
- Curi et al., (2020) Curi, S., Berkenkamp, F., and Krause, A. (2020). Efficient model-based reinforcement learning through optimistic policy search and planning. Advances in Neural Information Processing Systems, 33:14156–14170.
- Dabney et al., (2020) Dabney, W., Ostrovski, G., and Barreto, A. (2020). Temporally-extended epsilon-greedy exploration. arXiv preprint arXiv:2006.01782.
- Djellout et al., (2004) Djellout, H., Guillin, A., and Wu, L. (2004). Transportation cost-information inequalities and applications to random dynamical systems and diffusions. The Annals of Probability, 32(3):2702–2732.
- Doya, (2000) Doya, K. (2000). Reinforcement learning in continuous time and space. Neural computation, 12(1):219–245.
- Durugkar et al., (2016) Durugkar, I. P., Rosenbaum, C., Dernbach, S., and Mahadevan, S. (2016). Deep reinforcement learning with macro-actions. arXiv preprint arXiv:1606.04615.
- Eberhard et al., (2022) Eberhard, O., Hollenstein, J., Pinneri, C., and Martius, G. (2022). Pink noise is all you need: Colored noise exploration in deep reinforcement learning. In The Eleventh International Conference on Learning Representations.
- Engquist et al., (2007) Engquist, B., Li, X., Ren, W., Vanden-Eijnden, E., et al. (2007). Heterogeneous multiscale methods: a review. Communications in Computational Physics, 2(3):367–450.
- Freeman et al., (2021) Freeman, C. D., Frey, E., Raichuk, A., Girgin, S., Mordatch, I., and Bachem, O. (2021). Brax - a differentiable physics engine for large scale rigid body simulation.
- Greydanus et al., (2019) Greydanus, S., Dzamba, M., and Yosinski, J. (2019). Hamiltonian neural networks. Advances in neural information processing systems, 32.
- Grigsby et al., (2021) Grigsby, J., Yoo, J. Y., and Qi, Y. (2021). Towards automatic actor-critic solutions to continuous control. arXiv preprint arXiv:2106.08918.
- Haarnoja et al., (2018) Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning, pages 1861–1870. PMLR.
- Hafner et al., (2019) Hafner, D., Lillicrap, T., Ba, J., and Norouzi, M. (2019). Dream to control: Learning behaviors by latent imagination. arXiv preprint arXiv:1912.01603.
- Harb et al., (2018) Harb, J., Bacon, P.-L., Klissarov, M., and Precup, D. (2018). When waiting is not an option: Learning options with a deliberation cost. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Heemels et al., (2021) Heemels, W., Johansson, K. H., and Tabuada, P. (2021). Event-triggered and self-triggered control. In Encyclopedia of Systems and Control, pages 724–730. Springer.
- Heemels et al., (2012) Heemels, W. P., Johansson, K. H., and Tabuada, P. (2012). An introduction to event-triggered and self-triggered control. In 2012 ieee 51st ieee conference on decision and control (cdc), pages 3270–3285. IEEE.
- Holt et al., (2023) Holt, S., Hüyük, A., and van der Schaar, M. (2023). Active observing in continuous-time control. In Thirty-seventh Conference on Neural Information Processing Systems.
- Huang et al., (2019) Huang, Y., Kavitha, V., and Zhu, Q. (2019). Continuous-time markov decision processes with controlled observations. In 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pages 32–39. IEEE.
- Jones et al., (2009) Jones, D. S., Plank, M., and Sleeman, B. D. (2009). Differential equations and mathematical biology. CRC press.
- Kaandorp and Koole, (2007) Kaandorp, G. C. and Koole, G. (2007). Optimal outpatient appointment scheduling. Health care management science, 10:217–229.
- Kabzan et al., (2020) Kabzan, J., Valls, M. I., Reijgwart, V. J., Hendrikx, H. F., Ehmke, C., Prajapat, M., Bühler, A., Gosala, N., Gupta, M., Sivanesan, R., et al. (2020). Amz driverless: The full autonomous racing system. Journal of Field Robotics, 37(7):1267–1294.
- Kakade et al., (2020) Kakade, S., Krishnamurthy, A., Lowrey, K., Ohnishi, M., and Sun, W. (2020). Information theoretic regret bounds for online nonlinear control. NeurIPS, 33:15312–15325.
- Kallianpur and Sundar, (2014) Kallianpur, G. and Sundar, P. (2014). 266Jump Markov Processes. In Stochastic Analysis and Diffusion Processes. Oxford University Press.
- Karimi, (2023) Karimi, A. (2023). Decision frequency adaptation in reinforcement learning using continuous options with open-loop policies.
- Kirschner and Krause, (2018) Kirschner, J. and Krause, A. (2018). Information directed sampling and bandits with heteroscedastic noise. In Conference On Learning Theory, pages 358–384. PMLR.
- Krale et al., (2023) Krale, M., Simão, T. D., and Jansen, N. (2023). Act-then-measure: reinforcement learning for partially observable environments with active measuring. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 33, pages 212–220.
- Lee et al., (2020) Lee, J., Lee, B.-J., and Kim, K.-E. (2020). Reinforcement learning for control with multiple frequencies. Advances in Neural Information Processing Systems, 33:3254–3264.
- Lenhart and Workman, (2007) Lenhart, S. and Workman, J. T. (2007). Optimal control applied to biological models. CRC press.
- Lillicrap et al., (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Lutter et al., (2021) Lutter, M., Mannor, S., Peters, J., Fox, D., and Garg, A. (2021). Value iteration in continuous actions, states and time. arXiv preprint arXiv:2105.04682.
- Mankowitz et al., (2014) Mankowitz, D. J., Mann, T. A., and Mannor, S. (2014). Time regularized interrupting options. In Internation Conference on Machine Learning.
- Mann and Mannor, (2014) Mann, T. and Mannor, S. (2014). Scaling up approximate value iteration with options: Better policies with fewer iterations. In International conference on machine learning, pages 127–135. PMLR.
- Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Ni and Jang, (2022) Ni, T. and Jang, E. (2022). Continuous control on time. In ICLR 2022 Workshop on Generalizable Policy Learning in Physical World.
- Panetta and Fister, (2003) Panetta, J. C. and Fister, K. R. (2003). Optimal control applied to competing chemotherapeutic cell-kill strategies. SIAM Journal on Applied Mathematics, 63(6):1954–1971.
- Rothfuss et al., (2023) Rothfuss, J., Sukhija, B., Birchler, T., Kassraie, P., and Krause, A. (2023). Hallucinated adversarial control for conservative offline policy evaluation. In Uncertainty in Artificial Intelligence, pages 1774–1784. PMLR.
- Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning, pages 1889–1897. PMLR.
- Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Seifner and Sanchez, (2023) Seifner, P. and Sanchez, R. J. (2023). Neural markov jump processes. arXiv preprint arXiv:2305.19744.
- Sharma et al., (2017) Sharma, S., Srinivas, A., and Ravindran, B. (2017). Learning to repeat: Fine grained action repetition for deep reinforcement learning. arXiv preprint arXiv:1702.06054.
- Spong et al., (2006) Spong, M. W., Hutchinson, S., Vidyasagar, M., et al. (2006). Robot modeling and control, volume 3. Wiley New York.
- Srinivas et al., (2017) Srinivas, A., Sharma, S., and Ravindran, B. (2017). Dynamic action repetition for deep reinforcement learning. In Proc. AAAI.
- Srinivas et al., (2009) Srinivas, N., Krause, A., Kakade, S. M., and Seeger, M. (2009). Gaussian process optimization in the bandit setting: No regret and experimental design. arXiv preprint arXiv:0912.3995.
- Sukhija et al., (2024) Sukhija, B., Treven, L., Sancaktar, C., Blaes, S., Coros, S., and Krause, A. (2024). Optimistic active exploration of dynamical systems. NeurIPS.
- Sutton et al., (1999) Sutton, R. S., Precup, D., and Singh, S. (1999). Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211.
- Tap, (2000) Tap, F. (2000). Economics-based optimal control of greenhouse tomato crop production. Wageningen University and Research.
- Treven et al., (2023) Treven, L., Hübotter, J., Sukhija, B., Dörfler, F., and Krause, A. (2023). Efficient exploration in continuous-time model-based reinforcement learning.
- Turchetta et al., (2022) Turchetta, M., Corinzia, L., Sussex, S., Burton, A., Herrera, J., Athanasiadis, I., Buhmann, J. M., and Krause, A. (2022). Learning long-term crop management strategies with cyclesgym. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 11396–11409. Curran Associates, Inc.
- Vakili et al., (2021) Vakili, S., Khezeli, K., and Picheny, V. (2021). On information gain and regret bounds in gaussian process bandits. In AISTATS.
- Vezhnevets et al., (2016) Vezhnevets, A., Mnih, V., Osindero, S., Graves, A., Vinyals, O., Agapiou, J., and kavukcuoglu, k. (2016). Strategic attentive writer for learning macro-actions. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Williams and Rasmussen, (2006) Williams, C. K. and Rasmussen, C. E. (2006). Gaussian processes for machine learning, volume 2. MIT press Cambridge, MA.
- Yildiz et al., (2021) Yildiz, C., Heinonen, M., and Lähdesmäki, H. (2021). Continuous-time model-based reinforcement learning. In International Conference on Machine Learning, pages 12009–12018. PMLR.
- Yu et al., (2021) Yu, H., Xu, W., and Zhang, H. (2021). Taac: Temporally abstract actor-critic for continuous control. Advances in Neural Information Processing Systems, 34:29021–29033.
- Zheng et al., (2023) Zheng, Q., Henaff, M., Amos, B., and Grover, A. (2023). Semi-supervised offline reinforcement learning with action-free trajectories. In International conference on machine learning, pages 42339–42362. PMLR.
.tocmtappendix \etocsettagdepthmtchapternone \etocsettagdepthmtappendixsubsection
Contents of Appendix
Appendix A Extended Theory
In this section, we prove Theorem 2 for OTaCoS. We separate the section into two parts; proof for the transaction cost setting (Section A.1) and the proof for the bounded number of switches setting (Section A.2).
We start with the definitions of model complexity and sub-Gaussian random vector that we will use extensively in this section.
Definition 2 (Model Complexity).
We define the model complexity as is defined by Curi et al., (2020).
(9) |
Definition 3.
A random variable is said to be sub-Gaussian with variance proxy if and we have:
A random vector is said to be sub Gaussian with variance proxy if for any the random variable is sub Gaussian. We write .
In the following, we will be distinguishing between the state of the augmented MDP and the true state of the dynamical system . The augmented state at time step includes the true state of the system, , the integrated reward between and , and the time to left to go , i.e., .
A.1 Transition Cost setting
We prove our regret bound for the transition cost case in the following. We start with the difference lemma which adapts Sukhija et al., (2024, Lemma 2) to our setting.
Lemma 3 (Difference lemma).
Define as
that is the total reward starting with time to go and state for the policy and dynamics . Here the expectation w.r.t. represents the expectation of the underlying trajectory induced by the policy on the dynamics . Then we have for all , , , , ;
(10) |
where is the state of and is the state of .
Proof.
Hence we have:
By repeating the step inductively the result follows. ∎
In the following, we leverage the result above to bound the regret of our optimistic planner w.r.t. the difference in value functions.
Lemma 4 (Per episode regret bound).
Let 4 hold, then we have with probability at least for all .
(11) |
Proof.
Now we derive an upper and lower bound on our value function.
Lemma 5 (Objective upper bound).
Let be any policy from the class and consider any , then we have:
Proof.
Since running reward is bounded , the number of steps we can do in an episode is bounded with , and switch cost is bounded we have:
∎
A key lemma we use to bound the difference in value functions is the following from Kakade et al., (2020).
Lemma 6 (Absolute expectation Difference Under Two Gaussians (Lemma C.2. Kakade et al., (2020))).
Let and , and for any (appropriately measurable) positive function , it holds that:
Furthermore, due to 4 we can also bound the distance between the next state prediction by the true system and the optimistic system .
Lemma 7.
Let 4 hold, then we have the following for all .
Proof.
where the last inequality follows from the fact that ∎
Next, we relate the regret at each episode to the model epistemic uncertainty using Lemma 3 and Lemma 7.
Proof.
Now we can prove our regret bound for the transition cost case.
Proof.
We compute:
Here the first inequality follows because of Corollary 8, the second inequality follows due to the monotonicity of sequence , the third inequality follows by Cauchy–Schwarz and the last one by maximizing the term in expectation. ∎
Our regret is sublinear if is sublinear. For general well-calibrated models this is tough to verify. However, for Gaussian process dynamics, is equal to (up to constant factors) the maximum information gain (Srinivas et al.,, 2009) (c.f., Curi et al., (2020, Lemma 17)). The maximum information gain is sublinear for a rich class of kernels (Vakili et al.,, 2021), i.e., yielding sublinear regret for OTaCoS (see Sukhija et al., (2024, Theorem 2) for more detail).
A.2 Bounded number of transition
We overload the notation in this section and add number of switches to the value function, such that we have
Lemma 10 (Per episode regret bound).
We have:
where is the state of one step hallucinated component and is the state of .
Proof.
Hence we have:
Repeating the step inductively the result follows and using we prove the lemma. ∎
A.2.1 Subgaussianity of the noise
In principle, we could assume that the noise is Gaussian and then with the same analysis obtain the regret bound. However, stochastic flows are in many cases not exactly Gaussian but only sub-Gaussian. For such noise we need can not apply Lemma 6 and need to escort to different analysis. First we show that under mild assumptions on the SDE dynamics functions and the resulting noise is sub-Gaussian.
To derive this result we will follow the work of Djellout et al., (2004). We present the results in quite informal way, for more rigorous statements we refer the reader to Djellout et al., (2004).
Definition 4 (Wasserstein distance).
Let be a metric space and let be two probability measures on . We define:
Definition 5 (Kullback–Leibler divergence).
Let be a metric space and let be two probability measures on . We define:
Definition 6 (-transportation cost information inequality).
Let be a metric space and let be a probability measure on . We say that satisfy the -transportation cost information inequality, and for short write , if there exists a constant such that for any measure on we have:
We now state an important theroem of Bobkov and Götze, (1999) that we will use later.
Theorem 11 (From Bobkov and Götze, (1999)).
Let be a metric space and let be a probability measure on . We have that if and only if for any -integrable and -Lipschitz function and for any we have:
Next, we provide a condition under which is sub-Gaussian random variable for any .
Corollary 12 (Adjusted Corollary 4.1 of Djellout et al., (2004)).
Assume
and denote the law of on the space (space of continuous functions from to ) by . Then, there exist a constant such that on the space equipped with the metric:
Lets be a(ny) unit vector in and define:
We have:
Therefore for any the function is –Lipschitz. Since we have
the function is also -integrable. Combining the latter observation with the Theorem 11 we obtain that for any and any we have:
Hence under the assumption of Theorem 2 for Bounded number of switches setting we have that for any the random variable . The variance proxy depends on .
A.2.2 Lipschitness of the expected flow
To apply analysis for the case when noise is any sub-Gaussian we also need to show that the dynamics function is Lipschitz. We first start with some general results.
Lemma 13.
Let , and denote . If we have:
-
•
,
-
•
,
then is Lipschitz.
Proof.
We have:
∎
Lemma 14 (Lipschitzness of ).
There exists a positive constant such that the flow is –Lipschitz.
Proof.
We will first prove coordinate-wise Lipschitzness. We observe:
-
1.
Lipschitness in time:
-
2.
Lipschitness in state : To prove this, consider the , then we have
Note that and since both functions are Lipschitz. Define and use Ito’s Lemma to get
Moreover,
Note that , so we can apply Grönwall’s inequality to get
Moreover,
Hence we have:
-
3.
Lipschitness in action : We denote and Following the same steps as in the proof of Lipschitzness in state we arrive at:
Integration yields:
where we used and . Applying Grönwall’s inequality results in:
∎
Corollary 15 (Lipschitzness of the ).
The cost flow is –Lipschitz, where is a constant.
Proof.
Corollary 16 (Lipschitzness of ).
The unknown function is –Lipschitz, where is constant.
A.2.3 Regret bound
Lemma 17 (Per episode regret bound (general sub-Gaussian noise)).
Proof.
Applying Lemma 5 of Curi et al., (2020) the result follows. ∎