Variable Time Step Reinforcement Learning for Robotic Applications

Dong Wang, , Giovanni Beltrame Manuscript received …, 2024; accepted ,,,. This paper was recommended for publication by Editor XXX upon evaluation of the reviewers’ comments. This work was partly presented at the Finding the Frame Workshop in August 2024.D. Wang and G. Beltrame are with the Department of Computer and Software Engineering, Polytechnique Montréal, Montréal, QC, H3T1J4 Canada e-mail: {dong-1.wang,giovanni.beltrame}@polymtl.ca
Abstract

Traditional reinforcement learning (RL) generates discrete control policies, assigning one action per cycle. These policies are usually implemented as in a fixed-frequency control loop. This rigidity presents challenges as optimal control frequency is task-dependent; suboptimal frequencies increase computational demands and reduce exploration efficiency. Variable Time Step Reinforcement Learning (VTS-RL) addresses these issues with adaptive control frequencies, executing actions only when necessary, thus reducing computational load and extending the action space to include action durations. In this paper we introduce the Multi-Objective Soft Elastic Actor-Critic (MOSEAC) method to perform VTS-RL, validating it through theoretical analysis and experimentation in simulation and on real robots. Results show faster convergence, better training results, and reduced energy consumption with respect to other variable- or fixed-frequency approaches.

Index Terms:
Variable Time Step Reinforcement Learning (VTS-RL), Deep Learning in Robotics and Automation, Learning and Adaptive Systems, Optimization and Optimal Control

I Introduction

Deep reinforcement learning (DRL) algorithms have achieved significant success in gaming [1, 2] and robotic control [3, 4]. Traditional DRL employs a fixed control loop at set intervals (e.g., every 0.1 seconds). This fixed-rate control can cause stability issues and high computational demands, particularly in dynamic environments where the optimal action frequency varies.

Variable Time Step Reinforcement Learning (VTS-RL) has been recently proposed to address these issues. Based on reactive programming principles [5, 6], VTS-RL executes control actions only when necessary, reducing computational load and expanding the action space to include variable action durations. For example, in robotic manipulation, VTS-RL allows a robot arm to dynamically adjust its control frequency, using lower frequencies during simple, repetitive tasks and higher frequencies during complex maneuvers or when handling delicate objects [7].

Two notable VTS-RL algorithms are Soft Elastic Actor-Critic (SEAC) [8] and Continuous-Time Continuous-Options (CTCO) [7]. CTCO employs continuous-time decision-making with flexible option durations. CTCO requires tuning several hyperparameters, such as its radial basis functions (RBFs) and the time-related hyperparameters τ𝜏\tauitalic_τ for its adaptive discount factor γ𝛾\gammaitalic_γ, complicating tuning in some environments.

SEAC incorporates reward terms for task energy (number of actions) and task time, making it effective in time-restricted environments like racing video games [9]. However, it also requires careful hyperparameter tuning (balancing task, energy, and time costs) to maintain performance. The sensitivity of SEAC and CTCO to hyperparameter settings challenges users to fully leverage their potential.

We introduce the Multi-Objective Soft Elastic Actor-Critic (MOSEAC) algorithm [10]. MOSEAC integrates action durations into the action space and adjusts hyperparameters based on observed trends in task rewards during training, reducing the need to set multiple hyperparameters. Additionally, our hyperparameter setting approach can be broadly applied to any continuous action reinforcement learning algorithm. This adaptability facilitates the transition from fixed-time step to variable-time step reinforcement learning, significantly expanding its practical application.

In this paper, we provide an in-depth analysis of MOSEAC’s theoretical performance, covering its framework, implementation, performance guarantees, convergence and complexity analysis. We also deploy the MOSEAC model as the navigation policy in simulation and on a physical AgileX Limo [11]. Our evaluation includes statistical performance analysis, trajectory similarity analysis between simulated and real environments, and a comparison of average computational resource consumption on both CPU and GPU.

MOSEAC allows the optimization of the control frequency for RL policies, as well as a reduction in computational load on the robots’ onboard computer. The resources saved can be redirected to other critical tasks such as environmental sensing [12] and communication [13]. In addition, the MOSEAC principles can be applied to a variety of RL algorithms, making it a useful tool for the deployment of RL policies on physical robotics systems, reducing the real-to-sim gap.

The paper is organized as follows. Section II reviews the current research status of VTS-RL. Section III details the MOSEAC framework with its pseudocode. Section IV presents the theoretical analysis of MOSEAC. Section V describes the implementation of our validation environment in the real world and the method to build the simulation environment. Section VI presents the experiment results on both the simulation and the real Agilex Limo. Finally, Section VII concludes the paper.

II Related Work

The importance of action duration in reinforcement learning (RL) has been significantly underestimated, yet it is crucial for applying RL algorithms in real-world scenarios, impacting an agent’s exploration capabilities. For instance, high frequencies may reduce exploration efficiency but are essential in delayed environments [14]. Recent studies by Amin et al. [15], and Park et al. [16] have highlighted this issue, showing that variable action durations can significantly affect learning performance. Building on these insights, Wang & Beltrame [9] and Karimi et al. [7] further explored how different frequencies impact learning, noting that excessively high frequencies can impede convergence. Their findings suggest that dynamic control frequencies, which adjust in real-time based on reactive principles, could enhance performance and adaptability.

Expanding on the idea of adaptive control, Sharma et al. [17] introduced a related concept of repetitive action reinforcement learning, where an agent performs identical actions over successive states, combining them into a larger action. This concept was further explored by Metelli et al. [18] and Lee et al. [19] in gaming contexts. However, despite its potential, this method does not fully address physical properties or computational demands, limiting its practical application.

Additionally, Chen et al. [20] proposed modifying the traditional control rate by integrating actions such as ”sleep” to reduce activity periods. However, this approach still required fixed-frequency system checks, ultimately failing to reduce computational load as intended.

In the domain of traffic signal control, research focused on hybrid action reinforcement learning. The study highlighted the importance of synchronously optimizing stage specifications and green interval durations, underscoring the potential of variable action durations to improve decision-making processes in complex, real-time environments [21].

Furthermore, research on variable dam** control for wearable robotic limbs has showcased the application of reinforcement learning to adjust control parameters adaptively in response to changing states. This study illustrated the effectiveness of RL in maintaining system stability and performance under varying conditions, reinforcing the value of dynamic control frequencies in practical applications [22].

These studies collectively highlight the ongoing efforts to integrate variable action durations and adaptive control mechanisms into RL. They also underscore the nascent stage of effectively integrating variable control frequencies and repetitive behaviors into practical applications, emphasizing their critical importance for enhancing algorithm performance and applicability in real-world scenarios.

III Algorithm Framework

Soft Elastic Actor-Critic (SEAC) is an extension of the Soft Actor-Critic (SAC) algorithm that addresses the limitations of fixed control rates in reinforcement learning (RL) [8]. Traditional Markov Decision Processes (MDPs) in RL do not account for the duration of actions, assuming that all actions are executed over uniform time steps. This can lead to inefficiencies, as the time between two actions can vary widely, requiring a fixed-frequency control rate in practical deployments. SEAC breaks this assumption by dynamically adjusting the duration of each action based on the state and environmental conditions, following the principles of reactive programming. By incorporating action duration D𝐷Ditalic_D into the action set, SEAC can decide on both the action and its duration to optimize energy consumption and computational efficiency.

MOSEAC extends SEAC [8] and combines its hyperparameters for balancing task, energy, and time rewards, and provides a method for automatically adjust its other hyperparamenters during training. Similar to SEAC, MOSEAC reward includes components for:

  • Quality of task execution (the standard RL reward),

  • Time required to complete a task (important for varying action durations), and

  • Energy (the number of time steps, a.k.a. the number of actions taken, which we aim to minimize).

Definition 1

The reward associated with the state space is:

R=αmRtRταε𝑅subscript𝛼𝑚subscript𝑅𝑡subscript𝑅𝜏subscript𝛼𝜀R=\alpha_{m}R_{t}R_{\tau}-\alpha_{\varepsilon}italic_R = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT (1)

where Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the task reward and Rτsubscript𝑅𝜏R_{\tau}italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a time-dependent term.

αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is a weighting factor used to modulate the magnitude of the reward: its primary function is to prevent the reward from being too small, which could lead to task failure, or too large, which could cause reward explosion [23], ensuring stable learning.

αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is a penalty parameter applied at each time step to impose a cost on the agent’s actions. This parameter gives a fixed cost to the execution of an action, thereby discouraging unnecessary ones. In practice, αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT promotes the completion of a task using fewer time steps (remember that time steps have variable duration), i.e., it reduces the energy used by the control loop of the agent.

We determine the optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT, which maximizes the reward R𝑅Ritalic_R.

The reward is designed to minimize both energy cost (number of steps) and the total time to complete the task through Rτsubscript𝑅𝜏R_{\tau}italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. By scaling the task-specific reward based on action duration with αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, agents are motivated to complete tasks using fewer actions:

Definition 2

The remap relationship between action duration and reward is:

Rτ=Dmin/D,Rτ[Dmin/Dmax,1]formulae-sequencesubscript𝑅𝜏subscript𝐷𝑚𝑖𝑛𝐷subscript𝑅𝜏subscript𝐷𝑚𝑖𝑛subscript𝐷𝑚𝑎𝑥1R_{\tau}=D_{min}/D,\quad R_{\tau}\in[D_{min}/D_{max},1]italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT / italic_D , italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ∈ [ italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT / italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT , 1 ] (2)

where t𝑡titalic_t is the duration of the current action, Dminsubscript𝐷𝑚𝑖𝑛D_{min}italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT is the minimum duration of an action (strictly greater than 0), and Dmaxsubscript𝐷𝑚𝑎𝑥D_{max}italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT is the maximum duration of an action.

We automatically set αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT during training. Based on previous results [9], we bind the increase of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to a decrease αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT using a sigmoid function to mitigate convergence issues, specifically the problem of sparse rewards caused a large αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT and a small αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT. This adjustment ensures that rewards are appropriately balanced, facilitating learning and convergence.

Definition 3

The relationship between αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT and αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is:

αε=0.2(111+eαm+1)subscript𝛼𝜀0.2111superscript𝑒subscript𝛼𝑚1\alpha_{\varepsilon}=0.2\cdot\left(1-\frac{1}{1+e^{-\alpha_{m}+1}}\right)italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT = 0.2 ⋅ ( 1 - divide start_ARG 1 end_ARG start_ARG 1 + italic_e start_POSTSUPERSCRIPT - italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + 1 end_POSTSUPERSCRIPT end_ARG ) (3)

Based on [8]’s experience, we establish a map** relationship between the two parameters: when the initial value of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is 1.0, the initial value of αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is 0.1. As αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT increases, αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT decreases, but never falls below 0.

Overall, we update αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT based on the current trend in average reward during training. To guarantee stability, we ensure the change is monotonic, forcing a uniform sweep of the parameter space, and a maximum value of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, namely αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. To determine the trend in the average reward, we perform a linear regression across the current training episode and compute the slope of the resulting line: if it is negative, the reward is declining.

Definition 4

The slope of the average reward (kRsubscript𝑘𝑅k_{R}italic_k start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT) is:

kR(Ra)=ni=1n(iRai)(i=1ni)(i=1nRai)ni=1ni2(i=1ni)2subscript𝑘𝑅subscript𝑅𝑎𝑛superscriptsubscript𝑖1𝑛𝑖subscriptsubscript𝑅𝑎𝑖superscriptsubscript𝑖1𝑛𝑖superscriptsubscript𝑖1𝑛subscriptsubscript𝑅𝑎𝑖𝑛superscriptsubscript𝑖1𝑛superscript𝑖2superscriptsuperscriptsubscript𝑖1𝑛𝑖2k_{R}(R_{a})=\frac{n\sum_{i=1}^{n}(i\cdot{R_{a}}_{i})-\left(\sum_{i=1}^{n}i% \right)\left(\sum_{i=1}^{n}{R_{a}}_{i}\right)}{n\sum_{i=1}^{n}i^{2}-\left(\sum% _{i=1}^{n}i\right)^{2}}italic_k start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) = divide start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_i ⋅ italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) - ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_i ) ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG start_ARG italic_n ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_i start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - ( ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_i ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (4)

where n𝑛nitalic_n is the total number of data points collected across the update interval (kupdatesubscript𝑘𝑢𝑝𝑑𝑎𝑡𝑒k_{update}italic_k start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT in Algorithm algorithm 1). The update interval is a hyperparameter that determines the frequency of updates for these neural networks used in the actor and critic policies, occurring after every n𝑛nitalic_n episode [24]. Rasubscript𝑅𝑎R_{a}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT represents the list of average rewards (Ra1,Ra2,,Ran)subscriptsubscript𝑅𝑎1subscriptsubscript𝑅𝑎2subscriptsubscript𝑅𝑎𝑛({R_{a}}_{1},{R_{a}}_{2},...,{R_{a}}_{n})( italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) during training. Here, an average reward Raisubscriptsubscript𝑅𝑎𝑖{R_{a}}_{i}italic_R start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is calculated across one episode.

Definition 5

We adaptively adjust the reward every kupdatesubscript𝑘𝑢𝑝𝑑𝑎𝑡𝑒k_{update}italic_k start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT episodes when kR<0subscript𝑘𝑅0k_{R}<0italic_k start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT < 0:

{αm=αm+ψif αm<αmaxαm=αmaxotherwisecasessubscript𝛼𝑚subscript𝛼𝑚𝜓if subscript𝛼𝑚subscript𝛼𝑚𝑎𝑥subscript𝛼𝑚subscript𝛼𝑚𝑎𝑥otherwise\left\{\begin{array}[]{ll}\alpha_{m}=\alpha_{m}+\psi&\text{if }\alpha_{m}<% \alpha_{max}\\ \alpha_{m}=\alpha_{max}&\text{otherwise}\end{array}\right.{ start_ARRAY start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_ψ end_CELL start_CELL if italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT < italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT end_CELL start_CELL otherwise end_CELL end_ROW end_ARRAY (5)

where ψ𝜓\psiitalic_ψ serves as the sole additional MOSEAC hyperparameter necessary for adjusting the reward equation during training. The parameter αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT represents the upper limit of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, ensuring algorithmic convergence and preventing reward explosion. Furthermore, αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is adjusted in accordance with Definition 3.

Algorithm algorithm 1 shows the pseudocode of MOSEAC. In short, MOSEAC extends the SAC algorithm by incorporating action duration D𝐷Ditalic_D into the action policy set. This expansion allows the algorithm to predict the action and its duration simultaneously. The reward is calculated using 1, and its changes are continuously monitored. If the reward trend declines, αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT increases linearly at a rate of ψ𝜓\psiitalic_ψ, without exceeding alphamax𝑎𝑙𝑝subscript𝑎𝑚𝑎𝑥alpha_{max}italic_a italic_l italic_p italic_h italic_a start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT. The action and critic networks are periodically updated like the SAC algorithm based on these preprocessed rewards.

Require: a policy π𝜋\piitalic_π with a set of parameters θ𝜃\thetaitalic_θ, θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, critic parameters ϕitalic-ϕ\phiitalic_ϕ, ϕsuperscriptitalic-ϕ\phi^{{}^{\prime}}italic_ϕ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT, variable time step environment model ΩΩ\Omegaroman_Ω, learning-rate λpsubscript𝜆𝑝\lambda_{p}italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT, λqsubscript𝜆𝑞\lambda_{q}italic_λ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT, reward buffer βrsubscript𝛽𝑟\beta_{r}italic_β start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT, replay buffer β𝛽\betaitalic_β.
1
2Initialization i=0𝑖0i=0italic_i = 0, ti=0subscript𝑡𝑖0t_{i}=0italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = 0, βr=0subscript𝛽𝑟0\beta_{r}=0italic_β start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0, observe S0subscript𝑆0S_{0}italic_S start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT
3 while titmaxsubscript𝑡𝑖subscript𝑡𝑚𝑎𝑥t_{i}\leq t_{max}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≤ italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT do
4       for iklengthNotDone𝑖subscript𝑘𝑙𝑒𝑛𝑔𝑡𝑁𝑜𝑡𝐷𝑜𝑛𝑒i\leq k_{length}\vee Not\,Doneitalic_i ≤ italic_k start_POSTSUBSCRIPT italic_l italic_e italic_n italic_g italic_t italic_h end_POSTSUBSCRIPT ∨ italic_N italic_o italic_t italic_D italic_o italic_n italic_e do
5             Ai,Di=πθ(Si)subscript𝐴𝑖subscript𝐷𝑖subscript𝜋𝜃subscript𝑆𝑖A_{i},D_{i}=\pi_{\theta}(S_{i})italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
6             \rightarrow simple action and its duration
7             Si+1,Ri=Ω(Ai,Di)subscript𝑆𝑖1subscript𝑅𝑖Ωsubscript𝐴𝑖subscript𝐷𝑖S_{i+1},R_{i}=\Omega(A_{i},D_{i})italic_S start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = roman_Ω ( italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )
8             \rightarrow compute reward with Definition 1 and 2
9            
10            ii+1𝑖𝑖1i\leftarrow i+1italic_i ← italic_i + 1
11       end for
12      βr1/i×0iRisubscript𝛽𝑟1𝑖superscriptsubscript0𝑖subscript𝑅𝑖\beta_{r}\leftarrow 1/i\times\sum_{0}^{i}R_{i}italic_β start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ← 1 / italic_i × ∑ start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT italic_R start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
13       \rightarrow collect the average reward for one episode
14       βS0i,A0i,D0i,R0i,S1i+1𝛽subscript𝑆similar-to0𝑖subscript𝐴similar-to0𝑖subscript𝐷similar-to0𝑖subscript𝑅similar-to0𝑖subscript𝑆similar-to1𝑖1\beta\leftarrow S_{0\sim i},\,A_{0\sim i},\,D_{0\sim i},\,R_{0\sim i},\,S_{1% \sim i+1}italic_β ← italic_S start_POSTSUBSCRIPT 0 ∼ italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT 0 ∼ italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 0 ∼ italic_i end_POSTSUBSCRIPT , italic_R start_POSTSUBSCRIPT 0 ∼ italic_i end_POSTSUBSCRIPT , italic_S start_POSTSUBSCRIPT 1 ∼ italic_i + 1 end_POSTSUBSCRIPT
15       i=0𝑖0i=0italic_i = 0
16       titi+1subscript𝑡𝑖subscript𝑡𝑖1t_{i}\leftarrow t_{i}+1italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + 1
17       if tikinit&tikupdatesubscript𝑡𝑖subscript𝑘𝑖𝑛𝑖𝑡conditionalsubscript𝑡𝑖subscript𝑘𝑢𝑝𝑑𝑎𝑡𝑒t_{i}\geq k_{init}\quad\&\quad t_{i}\mid k_{update}italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_k start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT & italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∣ italic_k start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT then
18             SampleS,A,D,R,Sfrom(β)𝑆𝑎𝑚𝑝𝑙𝑒𝑆𝐴𝐷𝑅superscript𝑆𝑓𝑟𝑜𝑚𝛽Sample\,S,\,A,\,D,\,R,\,S^{{}^{\prime}}from(\beta)italic_S italic_a italic_m italic_p italic_l italic_e italic_S , italic_A , italic_D , italic_R , italic_S start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT italic_f italic_r italic_o italic_m ( italic_β )
19             ϕϕλqδQ(ϕ,S,A,D,R,S)italic-ϕitalic-ϕsubscript𝜆𝑞subscript𝛿subscript𝑄italic-ϕ𝑆𝐴𝐷𝑅superscript𝑆\phi\leftarrow\phi-\lambda_{q}\nabla_{\delta}\mathcal{L}_{Q}(\phi,\,S,\,A,\,D,% \,R,\,S^{{}^{\prime}})italic_ϕ ← italic_ϕ - italic_λ start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT ( italic_ϕ , italic_S , italic_A , italic_D , italic_R , italic_S start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT )
20             \rightarrow critic update
21             θθλpθπ(θ,S,A,D,ϕ)𝜃𝜃subscript𝜆𝑝subscript𝜃subscript𝜋𝜃𝑆𝐴𝐷italic-ϕ\theta\leftarrow\theta-\lambda_{p}\nabla_{\theta}\mathcal{L}_{\pi}(\theta,\,S,% \,A,\,D,\,\phi)italic_θ ← italic_θ - italic_λ start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT caligraphic_L start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_θ , italic_S , italic_A , italic_D , italic_ϕ )
22             \rightarrow actor update
23             if kR(βr)subscript𝑘𝑅subscript𝛽𝑟k_{R}(\beta_{r})italic_k start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ( italic_β start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT ) then
24                  
25                  αm=αm+ψsubscript𝛼𝑚subscript𝛼𝑚𝜓\alpha_{m}=\alpha_{m}+\psiitalic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_ψ ifαm<αmaxifsubscript𝛼𝑚subscript𝛼𝑚𝑎𝑥\text{if}\alpha_{m}<\alpha_{max}if italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT < italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT
26                   Or αm=αmaxsubscript𝛼𝑚subscript𝛼𝑚𝑎𝑥\alpha_{m}=\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT = italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT otherwise
27                   \rightarrow see Definition 4 for kRsubscript𝑘𝑅k_{R}italic_k start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT
28                   αεFupdate(αm)subscript𝛼𝜀subscript𝐹𝑢𝑝𝑑𝑎𝑡𝑒subscript𝛼𝑚\alpha_{\varepsilon}\leftarrow F_{update}(\alpha_{m})italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ← italic_F start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT )
29                   \rightarrow update αm,αεsubscript𝛼𝑚subscript𝛼𝜀\alpha_{m},\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT followed Definition 3
30             end if
31            βr=0subscript𝛽𝑟0\beta_{r}=0italic_β start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 0
32             \rightarrow Re-record average reward values under new hyperparameters
33            
34       end if
35      Perform soft-update of ϕsuperscriptitalic-ϕ\phi^{{}^{\prime}}italic_ϕ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT and θsuperscript𝜃\theta^{{}^{\prime}}italic_θ start_POSTSUPERSCRIPT start_FLOATSUPERSCRIPT ′ end_FLOATSUPERSCRIPT end_POSTSUPERSCRIPT
36 end while
Algorithm 1 Multi-Objective Soft Elastic Actor and Critic (MOSEAC)

Here, tmaxsubscript𝑡𝑚𝑎𝑥t_{max}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT represents the maximum number of training steps [24]; klengthsubscript𝑘𝑙𝑒𝑛𝑔𝑡k_{length}italic_k start_POSTSUBSCRIPT italic_l italic_e italic_n italic_g italic_t italic_h end_POSTSUBSCRIPT is the maximum number of exploration steps per episode [24]; kinitsubscript𝑘𝑖𝑛𝑖𝑡k_{init}italic_k start_POSTSUBSCRIPT italic_i italic_n italic_i italic_t end_POSTSUBSCRIPT is the number of steps in the initial random exploration phase [24]. The reward R(Si,Ai,Di)𝑅subscript𝑆𝑖subscript𝐴𝑖subscript𝐷𝑖R(S_{i},A_{i},D_{i})italic_R ( italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) depends on state (Sisubscript𝑆𝑖S_{i}italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT), action (Aisubscript𝐴𝑖A_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT) and duration Di[Dmin,Dmax]subscript𝐷𝑖subscript𝐷𝑚𝑖𝑛subscript𝐷𝑚𝑎𝑥D_{i}\in[D_{min},D_{max}]italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_D start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ].

Overall, our reward scalarizes a multi-objective optimization problem including rewards for task, time, and energy. Unlike Hierarchical Reinforcement Learning (HRL) [25, 26], which seeks Pareto optimality [27, 28] with layered reward policies, our method simplifies the approach, making our strategy easily adaptable to various algorithms. While our goal is to avoid the complexity of Pareto optimization, dynamically adjusting the reward structure inevitably places us on the Pareto optimization curve. For a theoretical analysis of the Pareto curve and how αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ensures algorithmic stability, please refer to Appendix A.

Apart from a series of hyperparameters inherent to RL that need adjustment, such as learning rate, tmaxsubscript𝑡𝑚𝑎𝑥t_{max}italic_t start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, kupdatesubscript𝑘𝑢𝑝𝑑𝑎𝑡𝑒k_{update}italic_k start_POSTSUBSCRIPT italic_u italic_p italic_d italic_a italic_t italic_e end_POSTSUBSCRIPT, etc., ψ𝜓\psiitalic_ψ is the primary hyperparameter that requires tuning in MOSEAC. Determining the appropriate ψ𝜓\psiitalic_ψ value remains a critical optimization point. We recommend using the pre-set ψ𝜓\psiitalic_ψ value provided in our implementation to reduce the need for additional adjustments. If training performance is poor, a large ψ𝜓\psiitalic_ψ may cause the reward signal’s gradient to change too rapidly, leading to instability, suggesting a reduction in ψ𝜓\psiitalic_ψ. Conversely, if training is slow, a small ψ𝜓\psiitalic_ψ may result in a weak reward signal, affecting convergence, suggesting an increase in ψ𝜓\psiitalic_ψ.

IV Theoretical Analysis

We have analyzed the theoretical performance of MOSEAC through various aspects, including performance guarantees, convergence and complexity analysis. Table I provides the notation for the following.

IV-A Performance Guarantees

In the standard SAC algorithm [29], the policy πθ(a|s)subscript𝜋𝜃conditional𝑎𝑠\pi_{\theta}(a|s)italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a | italic_s ) selects action a𝑎aitalic_a, and updates the policy parameters θ𝜃\thetaitalic_θ and value function parameters ϕitalic-ϕ\phiitalic_ϕ. The objective function is:

J(πθ)=𝔼(s,a)πθ[Qπ(s,a)+α(πθ(|s))]J(\pi_{\theta})=\mathbb{E}_{(s,a)\sim\pi_{\theta}}\left[Q^{\pi}(s,a)+\alpha% \mathcal{H}(\pi_{\theta}(\cdot|s))\right]italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a ) ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) + italic_α caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) ] (6)

where J(πθ)𝐽subscript𝜋𝜃J(\pi_{\theta})italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) is the objective function, Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) is the state-action value function, α𝛼\alphaitalic_α is a temperature parameter controlling the entropy term, and (πθ(|s))\mathcal{H}(\pi_{\theta}(\cdot|s))caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) is the entropy of the policy.

When the action space is extended to include the action duration D𝐷Ditalic_D and the reward function is modified to R=αmRtRταε𝑅subscript𝛼𝑚subscript𝑅𝑡subscript𝑅𝜏subscript𝛼𝜀R=\alpha_{m}R_{t}R_{\tau}-\alpha_{\varepsilon}italic_R = italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT, where αm0subscript𝛼𝑚0\alpha_{m}\geq 0italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≥ 0, 0<Rτ10subscript𝑅𝜏10<R_{\tau}\leq 10 < italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ≤ 1, and αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is a small positive constant, the new objective function becomes:

J(πθ)=𝔼(s,a,D)πθ[\displaystyle J(\pi_{\theta})=\mathbb{E}_{(s,a,D)\sim\pi_{\theta}}\big{[}italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a , italic_D ) ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ Qπ(s,a,D)+α(πθ(|s))]\displaystyle Q^{\pi}(s,a,D)+\alpha\mathcal{H}(\pi_{\theta}(\cdot|s))\big{]}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) + italic_α caligraphic_H ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( ⋅ | italic_s ) ) ] (7)

where Qπ(s,a,D)superscript𝑄𝜋𝑠𝑎𝐷Q^{\pi}(s,a,D)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) is the extended state-action value function with time dimension D𝐷Ditalic_D, and Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and Rτsubscript𝑅𝜏R_{\tau}italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT are components of the reward function, see Definition 1.

Replacing Qπsuperscript𝑄𝜋Q^{\pi}italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT the objective function for MOSEAC, incorporating the extended action space and the adaptive reward function, becomes:

J(π)=𝔼(st,at,Dt)ρπ[\displaystyle J(\pi)=\mathbb{E}_{(s_{t},a_{t},D_{t})\sim\rho_{\pi}}\bigg{[}italic_J ( italic_π ) = blackboard_E start_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ t=0γt(αm(t)RtDminDt\displaystyle\sum_{t=0}^{\infty}\gamma^{t}\Big{(}\alpha_{m}(t)R_{t}\frac{D_{% \min}}{D_{t}}∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG (8)
αε(t)+α(π(|st)))]\displaystyle-\alpha_{\varepsilon}(t)+\alpha\mathcal{H}(\pi(\cdot|s_{t}))\Big{% )}\bigg{]}- italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t ) + italic_α caligraphic_H ( italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ) ]

where:

  • αm(t)subscript𝛼𝑚𝑡\alpha_{m}(t)italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) is the adaptive parameter that increases monotonically when the average reward decreases, with an upper limit of αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT.

  • αε(t)subscript𝛼𝜀𝑡\alpha_{\varepsilon}(t)italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t ) is the adaptive parameter that decreases monotonically when the average reward decreases.

  • (π(|st))=𝔼(at,Dt)π(|st)[logπ(at,Dt|st)]\mathcal{H}(\pi(\cdot|s_{t}))=-\mathbb{E}_{(a_{t},D_{t})\sim\pi(\cdot|s_{t})}[% \log\pi(a_{t},D_{t}|s_{t})]caligraphic_H ( italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) = - blackboard_E start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ∼ italic_π ( ⋅ | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT [ roman_log italic_π ( italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] is the entropy of the policy.

The SAC policy improvement theorem [29] ensures that

Qπk(s,a,D)Qπk1(s,a,D)superscript𝑄subscript𝜋𝑘𝑠𝑎𝐷superscript𝑄subscript𝜋𝑘1𝑠𝑎𝐷Q^{\pi_{k}}(s,a,D)\geq Q^{\pi_{k-1}}(s,a,D)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) ≥ italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k - 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) (9)

for any iteration k0𝑘0k\geq 0italic_k ≥ 0.

The soft Bellman equation in MOSEAC is:

Qπ(s,a,D)superscript𝑄𝜋𝑠𝑎𝐷\displaystyle Q^{\pi}(s,a,D)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) =αm(t)RtDminDαε(t)absentsubscript𝛼𝑚𝑡subscript𝑅𝑡subscript𝐷𝐷subscript𝛼𝜀𝑡\displaystyle=\alpha_{m}(t)R_{t}\frac{D_{\min}}{D}-\alpha_{\varepsilon}(t)= italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t ) (10)
+γ𝔼sP(|s,a,D)[Vπ(s)]\displaystyle\quad+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a,D)}\left[V^{% \pi}(s^{\prime})\right]+ italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a , italic_D ) end_POSTSUBSCRIPT [ italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]

where the value function Vπ(s)superscript𝑉𝜋𝑠V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) is defined as:

Vπ(s)superscript𝑉𝜋𝑠\displaystyle V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) =𝔼(a,D)π(|s)[Qπ(s,a,D)\displaystyle=\mathbb{E}_{(a,D)\sim\pi(\cdot|s)}\Big{[}Q^{\pi}(s,a,D)= blackboard_E start_POSTSUBSCRIPT ( italic_a , italic_D ) ∼ italic_π ( ⋅ | italic_s ) end_POSTSUBSCRIPT [ italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) (11)
αlogπ(a,D|s)]\displaystyle\qquad\quad-\alpha\log\pi(a,D|s)\Big{]}- italic_α roman_log italic_π ( italic_a , italic_D | italic_s ) ]

To prove convergence, we need to show that the soft Bellman operator is a contraction map**. Let 𝒯𝒯\mathcal{T}caligraphic_T be the soft Bellman operator. For any two Q-functions Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT:

𝒯Q1𝒯Q2=sups,a,D|αm(t)RtDminDαε(t)subscriptnorm𝒯subscript𝑄1𝒯subscript𝑄2conditionalsubscriptsupremum𝑠𝑎𝐷subscript𝛼𝑚𝑡subscript𝑅𝑡subscript𝐷𝐷subscript𝛼𝜀𝑡\displaystyle\|\mathcal{T}Q_{1}-\mathcal{T}Q_{2}\|_{\infty}=\sup_{s,a,D}\left|% \alpha_{m}(t)R_{t}\frac{D_{\min}}{D}-\alpha_{\varepsilon}(t)\right.∥ caligraphic_T italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - caligraphic_T italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT = roman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_D end_POSTSUBSCRIPT | italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t ) (12)
+γ𝔼sP(|s,a,D)[V1(s)]\displaystyle\quad\left.+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a,D)}\left% [V_{1}(s^{\prime})\right]\right.+ italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a , italic_D ) end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ]
(αm(t)RtDminDαε(t)+γ𝔼sP(|s,a,D)[V2(s)])|\displaystyle\quad\left.-\left(\alpha_{m}(t)R_{t}\frac{D_{\min}}{D}-\alpha_{% \varepsilon}(t)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot|s,a,D)}\left[V_{2}(s^% {\prime})\right]\right)\right|- ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT divide start_ARG italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_D end_ARG - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ( italic_t ) + italic_γ blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a , italic_D ) end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ) |
sups,a,Dγ|𝔼sP(|s,a,D)[V1(s)V2(s)]|\displaystyle\leq\sup_{s,a,D}\gamma\left|\mathbb{E}_{s^{\prime}\sim P(\cdot|s,% a,D)}\left[V_{1}(s^{\prime})-V_{2}(s^{\prime})\right]\right|≤ roman_sup start_POSTSUBSCRIPT italic_s , italic_a , italic_D end_POSTSUBSCRIPT italic_γ | blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_P ( ⋅ | italic_s , italic_a , italic_D ) end_POSTSUBSCRIPT [ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] |
γV1V2absent𝛾subscriptnormsubscript𝑉1subscript𝑉2\displaystyle\leq\gamma\|V_{1}-V_{2}\|_{\infty}≤ italic_γ ∥ italic_V start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_V start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT
γQ1Q2absent𝛾subscriptnormsubscript𝑄1subscript𝑄2\displaystyle\leq\gamma\|Q_{1}-Q_{2}\|_{\infty}≤ italic_γ ∥ italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT - italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT

Since γ<1𝛾1\gamma<1italic_γ < 1, the soft Bellman operator 𝒯𝒯\mathcal{T}caligraphic_T is a contraction map**. By Banach’s fixed-point theorem, there exists a unique fixed point Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT such that Q=𝒯Qsuperscript𝑄𝒯superscript𝑄Q^{*}=\mathcal{T}Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = caligraphic_T italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT [30].

Let πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT be the optimal policy and πksubscript𝜋𝑘\pi_{k}italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT be the policy at iteration k𝑘kitalic_k. The error bound between the value functions of the optimal and learned policies is

VπVπk2αγlog|𝒜×[Dmin,Dmax]|(1γ)2normsuperscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘2𝛼𝛾𝒜subscript𝐷subscript𝐷superscript1𝛾2\|V^{\pi^{*}}-V^{\pi_{k}}\|\leq\frac{2\alpha\gamma\log|\mathcal{A}\times[D_{% \min},D_{\max}]|}{(1-\gamma)^{2}}∥ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ ≤ divide start_ARG 2 italic_α italic_γ roman_log | caligraphic_A × [ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] | end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (13)

With |𝒜×[Dmin,Dmax]|𝒜subscript𝐷subscript𝐷|\mathcal{A}\times[D_{\min},D_{\max}]|| caligraphic_A × [ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] | the size of the extended action space, the contraction property of the soft Bellman operator ensures that

VπVπk2αγlog|𝒜×[Dmin,Dmax]|(1γ)2subscriptnormsuperscript𝑉superscript𝜋superscript𝑉subscript𝜋𝑘2𝛼𝛾𝒜subscript𝐷subscript𝐷superscript1𝛾2\|V^{\pi^{*}}-V^{\pi_{k}}\|_{\infty}\leq\frac{2\alpha\gamma\log|\mathcal{A}% \times[D_{\min},D_{\max}]|}{(1-\gamma)^{2}}∥ italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT - italic_V start_POSTSUPERSCRIPT italic_π start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ∥ start_POSTSUBSCRIPT ∞ end_POSTSUBSCRIPT ≤ divide start_ARG 2 italic_α italic_γ roman_log | caligraphic_A × [ italic_D start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ] | end_ARG start_ARG ( 1 - italic_γ ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG (14)

where α𝛼\alphaitalic_α is the coefficient of the entropy term and γ𝛾\gammaitalic_γ is the discount factor.

IV-B Convergence Analysis

With our reward function, the policy gradient is:

θJ(πθ)=𝔼πθ[\displaystyle\nabla_{\theta}J(\pi_{\theta})=\mathbb{E}_{\pi_{\theta}}\Big{[}∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ θlogπθ(a,D|s)\displaystyle\nabla_{\theta}\log\pi_{\theta}(a,D|s)\cdot∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a , italic_D | italic_s ) ⋅ (15)
(Qπ(s,a,D)(αmRτ)αε)]\displaystyle\big{(}Q^{\pi}(s,a,D)\cdot(\alpha_{m}\cdot R_{\tau})-\alpha_{% \varepsilon}\big{)}\Big{]}( italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) ⋅ ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ) ]

where θJ(πθ)subscript𝜃𝐽subscript𝜋𝜃\nabla_{\theta}J(\pi_{\theta})∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) is the gradient of the objective function with respect to the policy parameters θ𝜃\thetaitalic_θ.

The value function update, incorporating the time dimension D𝐷Ditalic_D and our reward function, is:

L(ϕ)=𝔼(s,a,D,r,s)[\displaystyle L(\phi)=\mathbb{E}_{(s,a,D,r,s^{\prime})}\Big{[}italic_L ( italic_ϕ ) = blackboard_E start_POSTSUBSCRIPT ( italic_s , italic_a , italic_D , italic_r , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) end_POSTSUBSCRIPT [ (Qϕ(s,a,D)(r+γ𝔼(a,D)πθ\displaystyle\big{(}Q_{\phi}(s,a,D)-\big{(}r+\gamma\mathbb{E}_{(a^{\prime},D^{% \prime})\sim\pi_{\theta}}( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a , italic_D ) - ( italic_r + italic_γ blackboard_E start_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT (16)
[Vϕ¯(s)αlogπθ(a,D|s)]))2]\displaystyle[V_{\bar{\phi}}(s^{\prime})-\alpha\log\pi_{\theta}(a^{\prime},D^{% \prime}|s^{\prime})]\big{)}\big{)}^{2}\Big{]}[ italic_V start_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_α roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , italic_D start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ]

where L(ϕ)𝐿italic-ϕL(\phi)italic_L ( italic_ϕ ) is the loss function for the value function update, r𝑟ritalic_r is the reward, γ𝛾\gammaitalic_γ is the discount factor, and Vϕ¯(s)subscript𝑉¯italic-ϕsuperscript𝑠V_{\bar{\phi}}(s^{\prime})italic_V start_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) is the target value function.

The new policy parameter θ𝜃\thetaitalic_θ update rule is:

θk+1=θk+βk𝔼sD,(a,D)πθ[\displaystyle\theta_{k+1}=\theta_{k}+\beta_{k}\mathbb{E}_{s\sim D,(a,D)\sim\pi% _{\theta}}\Big{[}italic_θ start_POSTSUBSCRIPT italic_k + 1 end_POSTSUBSCRIPT = italic_θ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT italic_s ∼ italic_D , ( italic_a , italic_D ) ∼ italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ θlogπθ(a,D|s)\displaystyle\nabla_{\theta}\log\pi_{\theta}(a,D|s)\cdot∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a , italic_D | italic_s ) ⋅ (17)
(Qϕ(s,a,D)(αmRτ)αε\displaystyle\big{(}Q_{\phi}(s,a,D)\cdot(\alpha_{m}\cdot R_{\tau})-\alpha_{\varepsilon}( italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a , italic_D ) ⋅ ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT
Vϕ¯(s)+αlogπθ(a,D|s))]\displaystyle-V_{\bar{\phi}}(s)+\alpha\log\pi_{\theta}(a,D|s)\big{)}\Big{]}- italic_V start_POSTSUBSCRIPT over¯ start_ARG italic_ϕ end_ARG end_POSTSUBSCRIPT ( italic_s ) + italic_α roman_log italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_a , italic_D | italic_s ) ) ]

where βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the learning rate at step k𝑘kitalic_k.

To analyze the impact of dynamically adjusting αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT, we assume:

  1. 1.

    Dynamic Adjustment Rules: αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT increases monotonically by a small increment ψ𝜓\psiitalic_ψ if the reward trend decreases over consecutive episodes, and its upper limit is αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT, which guarantees algorithmic convergence and prevents reward explosion. αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT decreases as defined.

  2. 2.

    Learning Rate Conditions: αksubscript𝛼𝑘\alpha_{k}italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT must satisfy [31]:

    k=0αk=,k=0αk2<formulae-sequencesuperscriptsubscript𝑘0subscript𝛼𝑘superscriptsubscript𝑘0superscriptsubscript𝛼𝑘2\sum_{k=0}^{\infty}\alpha_{k}=\infty,\quad\sum_{k=0}^{\infty}\alpha_{k}^{2}<\infty∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∞ , ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_α start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞ (18)
    k=0βk=,k=0βk2<formulae-sequencesuperscriptsubscript𝑘0subscript𝛽𝑘superscriptsubscript𝑘0superscriptsubscript𝛽𝑘2\sum_{k=0}^{\infty}\beta_{k}=\infty,\quad\sum_{k=0}^{\infty}\beta_{k}^{2}<\infty∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ∞ , ∑ start_POSTSUBSCRIPT italic_k = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT < ∞ (19)

Assuming the critic estimates are unbiased:

𝔼[Qϕ(s,a,D)(αmRτ)αε]=Qπ(s,a,D)(αmRτ)αε𝔼delimited-[]subscript𝑄italic-ϕ𝑠𝑎𝐷subscript𝛼𝑚subscript𝑅𝜏subscript𝛼𝜀superscript𝑄𝜋𝑠𝑎𝐷subscript𝛼𝑚subscript𝑅𝜏subscript𝛼𝜀\mathbb{E}[Q_{\phi}(s,a,D)\cdot(\alpha_{m}\cdot R_{\tau})-\alpha_{\varepsilon}% ]=Q^{\pi}(s,a,D)\cdot(\alpha_{m}\cdot R_{\tau})-\alpha_{\varepsilon}blackboard_E [ italic_Q start_POSTSUBSCRIPT italic_ϕ end_POSTSUBSCRIPT ( italic_s , italic_a , italic_D ) ⋅ ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT ] = italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) ⋅ ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ⋅ italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ) - italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT (20)

Since Rτsubscript𝑅𝜏R_{\tau}italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT is a positive number within [0, 1], its effect on Qπ(s,a,D)superscript𝑄𝜋𝑠𝑎𝐷Q^{\pi}(s,a,D)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) is linear and does not affect the consistency of the policy gradient.

  1. 1.

    Positive Scaling: As 0Rτ10subscript𝑅𝜏10\leq R_{\tau}\leq 10 ≤ italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT ≤ 1 and αm0subscript𝛼𝑚0\alpha_{m}\geq 0italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ≥ 0, αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT only scales the reward without altering its sign. This scaling does not change the direction of the policy gradient but affects its magnitude.

  2. 2.

    Small Offset: αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT is a small constant used to accelerate training. This small offset does not affect the direction of the policy gradient but introduces a minor shift in the value function, which does not alter the overall policy update direction.

Under these conditions, MOSEAC will converge to a local optimum [24]:

limkθJ(πθ)=0subscript𝑘subscript𝜃𝐽subscript𝜋𝜃0\lim_{k\to\infty}\nabla_{\theta}J(\pi_{\theta})=0roman_lim start_POSTSUBSCRIPT italic_k → ∞ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT italic_J ( italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ) = 0 (21)

IV-C Complexity Analysis

As MOSEAC closely follows SAC [29], but with an expanded action set that includes durations, its time complexity is similar to SAC, replacing the action set A𝐴Aitalic_A with the expanded one A×D𝐴𝐷A\times Ditalic_A × italic_D. SAC and MOSEAC have 3 computational components:

  • Policy Evaluation: consists of computing the value function Vπ(s)superscript𝑉𝜋𝑠V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) and the Q-value function Qπ(s,a)superscript𝑄𝜋𝑠𝑎Q^{\pi}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) for each state-action pair, i.e. O(|𝒮||𝒜×D|)𝑂𝒮𝒜𝐷O(|\mathcal{S}|\cdot|\mathcal{A}\times D|)italic_O ( | caligraphic_S | ⋅ | caligraphic_A × italic_D | ).

  • Policy Improvement: consists updating the policy parameters θ𝜃\thetaitalic_θ based on the policy gradient, involving the Q-value function and the entropy term. The time complexity is O(|𝒮||𝒜×D|)𝑂𝒮𝒜𝐷O(|\mathcal{S}|\cdot|\mathcal{A}\times D|)italic_O ( | caligraphic_S | ⋅ | caligraphic_A × italic_D | ).

  • Value Function Update: consists of computing the target value using the Bellman backup equation, with a time complexity of O(|𝒮||𝒜×D|)𝑂𝒮𝒜𝐷O(|\mathcal{S}|\cdot|\mathcal{A}\times D|)italic_O ( | caligraphic_S | ⋅ | caligraphic_A × italic_D | ).

The overall time complexity for one iteration of the SAC algorithm is therefore

O(|𝒮||𝒜×D|)𝑂𝒮𝒜𝐷O(|\mathcal{S}|\cdot|\mathcal{A}\times D|)italic_O ( | caligraphic_S | ⋅ | caligraphic_A × italic_D | ) (22)

The space complexity of MOSEAC is determined by the storage requirements for the policy, value functions, and other necessary data structures. As for the time complexity, MOSEAC follows SAC with a different action set:

  • Policy Storage: Requires O(|θ|)𝑂𝜃O(|\theta|)italic_O ( | italic_θ | ) space for the policy parameters.

  • Value Function Storage: Requires O(|𝒮||𝒜||D|)𝑂𝒮𝒜𝐷O(|\mathcal{S}|\cdot|\mathcal{A}|\cdot|D|)italic_O ( | caligraphic_S | ⋅ | caligraphic_A | ⋅ | italic_D | ) space for the value functions.

The overall space complexity is:

O(|θ|+|𝒮||𝒜||D|)𝑂𝜃𝒮𝒜𝐷O(|\theta|+|\mathcal{S}|\cdot|\mathcal{A}|\cdot|D|)italic_O ( | italic_θ | + | caligraphic_S | ⋅ | caligraphic_A | ⋅ | italic_D | ) (23)

The dynamic adjustment of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and αεsubscript𝛼𝜀\alpha_{\varepsilon}italic_α start_POSTSUBSCRIPT italic_ε end_POSTSUBSCRIPT adds a small computational overhead of O(1)𝑂1O(1)italic_O ( 1 ) per update due to simple arithmetic operations and comparisons.

In summary, MOSEAC has higher computational complexity than SAC, but its dynamic adjustment mechanism and expanded action space provide greater adaptability and potential for improved performance in multi-objective optimization scenarios.

TABLE I: Complexity Analysis Notation
Symbol Description
𝒮𝒮\mathcal{S}caligraphic_S Set of states
𝒜𝒜\mathcal{A}caligraphic_A Set of actions
D𝐷Ditalic_D Action durations set
πθsubscript𝜋𝜃\pi_{\theta}italic_π start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT Policy parameters θ𝜃\thetaitalic_θ
|θ|𝜃|\theta|| italic_θ | Policy parameter count
Vπ(s)superscript𝑉𝜋𝑠V^{\pi}(s)italic_V start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s ) Value function
Qπ(s,a,D)superscript𝑄𝜋𝑠𝑎𝐷Q^{\pi}(s,a,D)italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a , italic_D ) Q-value function with D𝐷Ditalic_D
α𝛼\alphaitalic_α Entropy temperature param.
αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Reward scaling param.
αϵsubscript𝛼italic-ϵ\alpha_{\epsilon}italic_α start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT Reward offset param.
Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT Reward at time t𝑡titalic_t
Rτsubscript𝑅𝜏R_{\tau}italic_R start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT Time-based reward
γ𝛾\gammaitalic_γ Discount factor
ϕitalic-ϕ\phiitalic_ϕ Value function params.
βksubscript𝛽𝑘\beta_{k}italic_β start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Learning rate at step k𝑘kitalic_k
𝒯𝒯\mathcal{T}caligraphic_T Soft Bellman operator
(π(|s))\mathcal{H}(\pi(\cdot|s))caligraphic_H ( italic_π ( ⋅ | italic_s ) ) Policy entropy
ρπsubscript𝜌𝜋\rho_{\pi}italic_ρ start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT State-action distribution
O𝑂Oitalic_O Complexity upper bound

V Experiments

We conducted a systematic evaluation and validation of MOSEAC’s performance on task completion ability (e.g., navigating to target locations, avoiding yellow lines), resource consumption (measured by the number of steps required to complete the task), and time cost. Additionally, we assessed the trajectory and control output similarity to report the differences between simulation and reality. Finally, we compared the computing resource usage between variable and fixed RL models.

Figure 1 illustrates our workflow. Our simulation to reality approach involves a process to ensure the reliability of MOSEAC when applied to a real-world Limo. The workflow includes the following steps:

  1. 1.

    Manual Control and Data Collection: Initially, we manually control the Limo to collect a dataset of its movements.

  2. 2.

    Supervised Learning: This dataset is then used to train an environment model in a supervised learning manner.

  3. 3.

    Reinforcement Learning: The trained environment model is applied in the reinforcement learning environment to train the MOSEAC model.

  4. 4.

    Validation and Fine-Tuning: The MOSEAC model is applied to the real Limo. Fine-tuning is performed based on the recorded movement data to ensure accurate translation from simulation to real-world application.

  5. 5.

    Application: Finally, the refined MOSEAC model is deployed for practical use and validated through real-world tasks.

Refer to caption
Figure 1: The workflow for our MOSEAC implementaion to the Agilex Limo, we use a joystick to control the Limo movement for the initial environmental data collection (top).

V-A Environment Setup

Figure 2 shows Limo’s real-world validation environment. We used an OptiTrack [32] system for real-time positioning.

Refer to caption
Figure 2: This photo depicts the real-world environment used to validate the performance of MOSEAC on the Agilex Limo. The cameras on the left, right, and middle stands are three of the four cameras comprising the OptiTrack positioning system.

We use a 3x3 meter global coordinate frame with its center aligned with the center of the 2D map, recording the positions of yellow lines within the map. The Limo’s navigable area is confined within the map boundaries, excluding four enclosed regions. Specific values of these positions, the specifications of the Agilex Limo, and other metrics can be found in Appendix B.

V-B Simulation Environment

For our kinematic modeling, we use an Ackerman model:

Definition 6

The Ackerman kinematic model.

Table II gives the symbols and descriptions.

TABLE II: List of Symbols and Descriptions
Symbol Description
X𝑋Xitalic_X and Y𝑌Yitalic_Y Current positions of Limo
V𝑉Vitalic_V Current velocity of Limo
θ𝜃\thetaitalic_θ Current angular velocity of Limo
ΔtΔ𝑡\Delta troman_Δ italic_t Control duration
Vtargetsubscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡V_{target}italic_V start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT Control linear velocity
δ𝛿\deltaitalic_δ Control angular velocity
μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT Coefficient of kinetic friction
P𝑃Pitalic_P Power factor (which can be negative) of Limo
g=9.81m/s2𝑔9.81superscriptm/s2g=9.81\,\text{m/s}^{2}italic_g = 9.81 m/s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT Gravitational acceleration
M=4.2kg𝑀4.2kgM=4.2\,\text{kg}italic_M = 4.2 kg Mass of the Limo (measured)
L=0.204m𝐿0.204mL=0.204\,\text{m}italic_L = 0.204 m Distance between the centers of the front
and rear wheels (measured)

The forces and accelerations are calculated as:

Ffrictionsubscript𝐹𝑓𝑟𝑖𝑐𝑡𝑖𝑜𝑛\displaystyle F_{friction}italic_F start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT =μkMgabsentsubscript𝜇𝑘𝑀𝑔\displaystyle=-\mu_{k}Mg= - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_M italic_g (24)
africtionsubscript𝑎𝑓𝑟𝑖𝑐𝑡𝑖𝑜𝑛\displaystyle a_{friction}italic_a start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT =FfrictionM=μkgabsentsubscript𝐹𝑓𝑟𝑖𝑐𝑡𝑖𝑜𝑛𝑀subscript𝜇𝑘𝑔\displaystyle=\frac{F_{friction}}{M}=-\mu_{k}g= divide start_ARG italic_F start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT end_ARG start_ARG italic_M end_ARG = - italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT italic_g
apowersubscript𝑎𝑝𝑜𝑤𝑒𝑟\displaystyle a_{power}italic_a start_POSTSUBSCRIPT italic_p italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT =PMabsent𝑃𝑀\displaystyle=\frac{P}{M}= divide start_ARG italic_P end_ARG start_ARG italic_M end_ARG
anetsubscript𝑎𝑛𝑒𝑡\displaystyle a_{net}italic_a start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT =apower+VtargetVΔt+africtionabsentsubscript𝑎𝑝𝑜𝑤𝑒𝑟subscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝑉Δ𝑡subscript𝑎𝑓𝑟𝑖𝑐𝑡𝑖𝑜𝑛\displaystyle=a_{power}+\frac{V_{target}-V}{\Delta t}+a_{friction}= italic_a start_POSTSUBSCRIPT italic_p italic_o italic_w italic_e italic_r end_POSTSUBSCRIPT + divide start_ARG italic_V start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT - italic_V end_ARG start_ARG roman_Δ italic_t end_ARG + italic_a start_POSTSUBSCRIPT italic_f italic_r italic_i italic_c italic_t italic_i italic_o italic_n end_POSTSUBSCRIPT

The updates for velocity and position are:

Vnewsubscript𝑉𝑛𝑒𝑤\displaystyle V_{new}italic_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT =V+anetΔtabsent𝑉subscript𝑎𝑛𝑒𝑡Δ𝑡\displaystyle=V+a_{net}\Delta t= italic_V + italic_a start_POSTSUBSCRIPT italic_n italic_e italic_t end_POSTSUBSCRIPT roman_Δ italic_t (25)
Xnewsubscript𝑋𝑛𝑒𝑤\displaystyle X_{new}italic_X start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT =X+Vnewcos(θnew)Δtabsent𝑋subscript𝑉𝑛𝑒𝑤subscript𝜃𝑛𝑒𝑤Δ𝑡\displaystyle=X+V_{new}\cos(\theta_{new})\Delta t= italic_X + italic_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT roman_cos ( italic_θ start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) roman_Δ italic_t
Ynewsubscript𝑌𝑛𝑒𝑤\displaystyle Y_{new}italic_Y start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT =Y+Vnewsin(θnew)Δtabsent𝑌subscript𝑉𝑛𝑒𝑤subscript𝜃𝑛𝑒𝑤Δ𝑡\displaystyle=Y+V_{new}\sin(\theta_{new})\Delta t= italic_Y + italic_V start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT roman_sin ( italic_θ start_POSTSUBSCRIPT italic_n italic_e italic_w end_POSTSUBSCRIPT ) roman_Δ italic_t

This formulation accounts for the dynamics of the Limo vehicle, considering friction, power, and vehicle mass, ultimately predicting the new position. This kinematic model is referred tp as MAckermansubscript𝑀𝐴𝑐𝑘𝑒𝑟𝑚𝑎𝑛M_{Ackerman}italic_M start_POSTSUBSCRIPT italic_A italic_c italic_k italic_e italic_r italic_m italic_a italic_n end_POSTSUBSCRIPT.

We employ supervised learning [33, 34] to model the motion dynamics from the real-world environment to the simulation environment. This model predicts kinematic and physical information, μksubscript𝜇𝑘\mu_{k}italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT and P𝑃Pitalic_P, across different regions within the environment. We select a Transformer Model 111Our code is publicly available on GitHub with the shape and related hyperparameters of this Transformer Model inside. [35] to predict these data. The input, output, and its loss function are defined in Definition 7.

Definition 7

The input 𝐗𝐗\mathbf{X}bold_X is:

𝐗=(X,Y,V,θ,Δt,Vtarget,δ)𝐗matrix𝑋𝑌𝑉𝜃Δ𝑡subscript𝑉𝑡𝑎𝑟𝑔𝑒𝑡𝛿\mathbf{X}=\begin{pmatrix}X,Y,V,\theta,\Delta t,V_{target},\delta\end{pmatrix}bold_X = ( start_ARG start_ROW start_CELL italic_X , italic_Y , italic_V , italic_θ , roman_Δ italic_t , italic_V start_POSTSUBSCRIPT italic_t italic_a italic_r italic_g italic_e italic_t end_POSTSUBSCRIPT , italic_δ end_CELL end_ROW end_ARG )

The output 𝐎𝐎\mathbf{O}bold_O is:

𝐘=(μk,P)𝐘matrixsubscript𝜇𝑘𝑃\mathbf{Y}=\begin{pmatrix}\mu_{k},P\end{pmatrix}bold_Y = ( start_ARG start_ROW start_CELL italic_μ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_P end_CELL end_ROW end_ARG )

Using MAckermansubscript𝑀𝐴𝑐𝑘𝑒𝑟𝑚𝑎𝑛M_{Ackerman}italic_M start_POSTSUBSCRIPT italic_A italic_c italic_k italic_e italic_r italic_m italic_a italic_n end_POSTSUBSCRIPT from Definition 6, the loss function of the Transformer model T𝑇Titalic_T is computed as:

predict =MAckerman(𝐗,𝐘,g)absentsubscript𝑀𝐴𝑐𝑘𝑒𝑟𝑚𝑎𝑛𝐗𝐘𝑔\displaystyle=M_{Ackerman}(\mathbf{X},\mathbf{Y},g)= italic_M start_POSTSUBSCRIPT italic_A italic_c italic_k italic_e italic_r italic_m italic_a italic_n end_POSTSUBSCRIPT ( bold_X , bold_Y , italic_g ) (26)
loss =1Ni=1N(predictitargeti)2absent1𝑁superscriptsubscript𝑖1𝑁superscriptsubscriptpredict𝑖subscripttarget𝑖2\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\left(\text{predict}_{i}-\text{target}_% {i}\right)^{2}= divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT ( predict start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - target start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

The loss function is defined as the Mean Squared Error (MSE) between the predicted positions and the target positions recorded in the dataset:

where:

  • predictisubscriptpredict𝑖\text{predict}_{i}predict start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the predicted positions for the i𝑖iitalic_i-th data point

  • targetisubscripttarget𝑖\text{target}_{i}target start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the actual positions for the i𝑖iitalic_i-th data point

After learning the kinematic model, we need to address the state definition of MOSEAC. It should include the necessary environmental information to enable the model to understand the positions of obstacles in the environment. Although we have recorded all the positions of these yellow lines, it is not reasonable to input them directly as state values. This would compromise the generality of our approach, and a large amount of invariant fixed position information might cause the neural network processing units to struggle with capturing the critical information steps or lead to overfitting [36, 37].

We developed a simulated lidar system centered on Limo’s current position. This system generates 20 rays, similar to lidar operation, to detect intersections with enclosed regions in the environment. These intersections provide information about restricted areas, as shown in Figure 3.

Refer to caption
Figure 3: The simulated lidar system generates 20 rays from Limo’s position, calculates their intersections with environment enclosed regions, and returns the nearest intersection points for each ray.

Since the enclosed regions data in our simulation are based on real-world measurements, the gap between the virtual and real world is negligible. We designated eight turning points on this map as navigation endpoints, with a specific starting point for Limo.

The state dimensions for our MOSEAC model include: robot current position, goal position (chosen among 8 random locations), linear velocity, steering angle, 20 lidar points, and previous control duration, linear and angular velocity. The action dimensions include control duration, linear velocity, and angular velocity.

Let Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represent the task reward and T𝑇Titalic_T denote the termination flag (where 1 indicates termination and 0 indicates continuation). The function d(𝐩1,𝐩2)𝑑subscript𝐩1subscript𝐩2d(\mathbf{p}_{1},\mathbf{p}_{2})italic_d ( bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) measures the distance between positions 𝐩1subscript𝐩1\mathbf{p}_{1}bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and 𝐩2subscript𝐩2\mathbf{p}_{2}bold_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. The variables 𝐚newsubscript𝐚new\mathbf{a}_{\text{new}}bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT and 𝐭𝐭\mathbf{t}bold_t denote the new position of the agent and the target position, respectively. The initial distance from the starting position to the target position is d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT, and δ𝛿\deltaitalic_δ specifies the threshold distance determining whether Limo is sufficiently close to a point. Additionally, Cinnersubscript𝐶innerC_{\text{inner}}italic_C start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT indicates a collision with enclosed regions, while Coutersubscript𝐶outerC_{\text{outer}}italic_C start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT indicates a collision with the map boundary.

Implementing a penalty mechanism instead of a termination mechanism in our reward definition offers significant advantages: the penalty mechanism provides incremental feedback, allowing the model to continue learning even after errors, thereby avoiding frequent resets that could disrupt the training process. This approach reduces overall training time and encourages the agent to explore a broader range of strategies, leading to a more comprehensive understanding of the environment’s dynamics. However, we still terminate an experiment if the robot wonders outside the map boundary to stop meaningless exploration. The reward Rtsubscript𝑅𝑡R_{t}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is:

Rt={Rboundary_penalty,if Cinner(𝐚new)Rsuccess_reward,if d(𝐚new,𝐭)δRfailure_penalty,if Couter(𝐚new)Rdistance_reward=d0d(𝐚new,𝐭),otherwisesubscript𝑅𝑡casessubscript𝑅boundary_penaltyif subscript𝐶innersubscript𝐚newsubscript𝑅success_rewardif 𝑑subscript𝐚new𝐭𝛿subscript𝑅failure_penaltyif subscript𝐶outersubscript𝐚newsubscript𝑅distance_rewardsubscript𝑑0𝑑subscript𝐚new𝐭otherwiseR_{t}=\begin{cases}R_{\text{boundary\_penalty}},&\text{if }C_{\text{inner}}(% \mathbf{a}_{\text{new}})\\ R_{\text{success\_reward}},&\text{if }d(\mathbf{a}_{\text{new}},\mathbf{t})% \leq\delta\\ R_{\text{failure\_penalty}},&\text{if }C_{\text{outer}}(\mathbf{a}_{\text{new}% })\\ R_{\text{distance\_reward}}=d_{0}-d(\mathbf{a}_{\text{new}},\mathbf{t}),&\text% {otherwise}\end{cases}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = { start_ROW start_CELL italic_R start_POSTSUBSCRIPT boundary_penalty end_POSTSUBSCRIPT , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT success_reward end_POSTSUBSCRIPT , end_CELL start_CELL if italic_d ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT , bold_t ) ≤ italic_δ end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT failure_penalty end_POSTSUBSCRIPT , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL italic_R start_POSTSUBSCRIPT distance_reward end_POSTSUBSCRIPT = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT - italic_d ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT , bold_t ) , end_CELL start_CELL otherwise end_CELL end_ROW (27)

The termination flag T𝑇Titalic_T is determined as:

T={0,if Cinner(𝐚new)1,if d(𝐚new,𝐭)δ1,if Couter(𝐚new)0,otherwise𝑇cases0if subscript𝐶innersubscript𝐚new1if 𝑑subscript𝐚new𝐭𝛿1if subscript𝐶outersubscript𝐚new0otherwiseT=\begin{cases}0,&\text{if }C_{\text{inner}}(\mathbf{a}_{\text{new}})\\ 1,&\text{if }d(\mathbf{a}_{\text{new}},\mathbf{t})\leq\delta\\ 1,&\text{if }C_{\text{outer}}(\mathbf{a}_{\text{new}})\\ 0,&\text{otherwise}\end{cases}italic_T = { start_ROW start_CELL 0 , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT inner end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_d ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT , bold_t ) ≤ italic_δ end_CELL end_ROW start_ROW start_CELL 1 , end_CELL start_CELL if italic_C start_POSTSUBSCRIPT outer end_POSTSUBSCRIPT ( bold_a start_POSTSUBSCRIPT new end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL otherwise end_CELL end_ROW (28)

The initial data collected using remote control was skewed for short action durations, causing some discrepancies between the physical robot behaviour and the behaviour in simulation. To address these discrepancies we fine-tuned a Transformer model that approximates Limo’s kinematic model by collecting new real-world movement data.

We begin by freezing all the layers of the Transformer model except the final fully connected layer. This ensures that the initial generalized features are preserved while adapting the model’s outputs to the specific characteristics of the Limo dataset. The training process is:

  1. 1.

    Freeze Initial Layers: Initially, all layers except the final fully connected layer are frozen. θisubscript𝜃𝑖\theta_{i}italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the parameters of the i𝑖iitalic_i-th layer:

    if i<Nunfrozen,θiL=0formulae-sequenceif 𝑖subscript𝑁unfrozensubscriptsubscript𝜃𝑖𝐿0\text{if }i<N_{\text{unfrozen}},\quad\nabla_{\theta_{i}}L=0if italic_i < italic_N start_POSTSUBSCRIPT unfrozen end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L = 0 (29)

    where L𝐿Litalic_L is the loss function and Nunfrozensubscript𝑁unfrozenN_{\text{unfrozen}}italic_N start_POSTSUBSCRIPT unfrozen end_POSTSUBSCRIPT is the number of layers that are not frozen.

  2. 2.

    Gradual Unfreezing: After training the final layer, we incrementally unfreeze the preceding layers and fine-tune the model further. This is done iteratively, where at each stage, one additional layer is unfrozen and the model is re-trained. The number of unfrozen layers increases until all layers are eventually fine-tuned:

    if iNunfrozen,θiL0formulae-sequenceif 𝑖subscript𝑁unfrozensubscriptsubscript𝜃𝑖𝐿0\text{if }i\geq N_{\text{unfrozen}},\quad\nabla_{\theta_{i}}L\neq 0if italic_i ≥ italic_N start_POSTSUBSCRIPT unfrozen end_POSTSUBSCRIPT , ∇ start_POSTSUBSCRIPT italic_θ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_L ≠ 0 (30)
  3. 3.

    Early Stop**: To prevent overfitting, we employ early stop**. If the validation loss does not improve for a specified number of epochs, training is halted:

    if ΔLvalid>0 for n epochs,stop trainingif Δsubscript𝐿valid0 for 𝑛 epochsstop training\text{if }\Delta L_{\text{valid}}>0\text{ for }n\text{ epochs},\quad\text{stop% training}if roman_Δ italic_L start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT > 0 for italic_n epochs , stop training (31)

    where ΔLvalidΔsubscript𝐿valid\Delta L_{\text{valid}}roman_Δ italic_L start_POSTSUBSCRIPT valid end_POSTSUBSCRIPT is the change in validation loss and n𝑛nitalic_n is the patience parameter.

  4. 4.

    Evaluation: The model’s performance is evaluated on a test dataset after each training phase. The final model is selected based on the lowest validation loss.

The total training process can be described by the following optimization problem:

minθ(x,y)𝒟train(fθ(x),y)subscript𝜃subscript𝑥𝑦subscript𝒟trainsubscript𝑓𝜃𝑥𝑦\min_{\theta}\sum_{(x,y)\in\mathcal{D}_{\text{train}}}\mathcal{L}(f_{\theta}(x% ),y)roman_min start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∑ start_POSTSUBSCRIPT ( italic_x , italic_y ) ∈ caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x ) , italic_y ) (32)

where 𝒟trainsubscript𝒟train\mathcal{D}_{\text{train}}caligraphic_D start_POSTSUBSCRIPT train end_POSTSUBSCRIPT is the training dataset, fθsubscript𝑓𝜃f_{\theta}italic_f start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT is the model with parameters θ𝜃\thetaitalic_θ, and \mathcal{L}caligraphic_L is the loss function 7. The fine-tuning process specifically involves updating the parameters layer by layer, ensuring that the generalized pre-trained knowledge is gradually adapted to the specific nuances of the new Limo dataset.

This fine-tuning strategy allows the use of pre-trained models, adapting them to new tasks with potentially smaller datasets while maintaining robust performance and preventing overfitting.

For more detailed information please refer to Appendix C.

VI Results

We conducted six experiments involving MOSEAC, MOSEAC without the αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT limitation, SEAC, CTCO, and SAC (at 20 Hz and 60 Hz, SAC20 and SAC60, respectively) within the simulation environment and applied these trained models to the real Limo. These training tests were performed on a computer equipped with an Intel Core i5-13600K CPU and an Nvidia RTX 4070 GPU running Ubuntu 22.04 LTS. The deployment tests were conducted on an Agilex Limo equipped with a Jetson Nano [38] (with a Cortex-A57 CPU and Maxwell 128 cores GPU) running Ubuntu 18.04 LTS.

We trained our RL model using PyTorch [39] within a Gymnasium environment [40]. Subsequently, we converted the model parameters to ONNX format [39]. Finally, we utilized TensorRT [41] to build an inference engine from the ONNX model in the AgileX Limo local environment and deployed it for RL navigation tasks 222Our code is publicly available on GitHub with the hyperparameters of the MOSEAC model inside.. More system information details can be found in Appendix D.

Refer to caption
Figure 4: Average returns of 5 reinforcement learning algorithms over 2.5M steps during training.
Refer to caption
Figure 5: Average energy costs of 5 reinforcement learning algorithms over 2.5M steps during training.

Figure 4 and Figure 5 illustrate the results of the training process. Several key insights can be drawn from these figures:

Figure 4 indicates that MOSEAC consistently improves over time compared to the other algorithms, suggesting that MOSEAC adapts well to the environment and maintains a stable learning curve. While SEAC also shows stable performance, it converges slightly slower than MOSEAC. Additionally, Figure 4 illustrates that MOSEAC exhibits higher action duration robustness than CTCO. MOSEAC utilizes a fixed discount factor (γ𝛾\gammaitalic_γ), ensuring stable long-term planning capabilities without the negative impact of varying γ𝛾\gammaitalic_γ. In contrast, CTCO’s performance is susceptible to the choice of action duration range, significantly affecting its γ𝛾\gammaitalic_γ. As training progresses, CTCO tends to favor smaller γ𝛾\gammaitalic_γ values, placing greater demands on its τ𝜏\tauitalic_τ parameter that controls the range of γ𝛾\gammaitalic_γ. This sensitivity makes CTCO less adaptable to diverse environments, whereas MOSEAC maintains consistent performance. Furthermore, Figure 5 provides a comparative analysis of energy costs among the different algorithms. MOSEAC demonstrates lower energy costs per task, indicating higher energy efficiency during training. This efficiency is critical for practical applications where compute resource consumption is a concern.

Refer to caption
Figure 6: Average returns of MOSEAC and MOSEAC (without αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT limitation) in 2.5M steps during the training.

We compared the effects of imposing an upper limit on the parameter αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT versus allowing it to increase without restriction in the context of our MOSEAC algorithm. Our findings indicate significant differences in performance stability and energy efficiency between the two approaches.

Introducing an upper limit on αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT (denoted as αmaxsubscript𝛼𝑚𝑎𝑥\alpha_{max}italic_α start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT) is required to prevent reward explosion. Figure 6 shows that MOSEAC has consistent and stable improvement in average reward. In contrast, MOSEAC without the upper limit initially followed a similar trend but eventually diverged, leading to instability and potential reward explosion. This divergence suggests that without the upper limit, αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT may increase uncontrollably, destabilizing the reward structure.

Refer to caption
Figure 7: Average Energy cost of MOSEAC and MOSEAC (without αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT limitation) in 2.5M steps during the training.

However, as shown in Figure 7, MOSEAC without an upper limit on αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT generally exhibited lower average energy costs compared to MOSEAC with the upper limit. This suggests that while the absence of an upper limit on αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT may lead to reward explosion, it does not significantly impair task performance in practice.

We update αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT based on the declining reward trend. However, in random multi-objective tasks with varying target locations, the average reward can fluctuate significantly, causing αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT to be updated continuously. This can lead to an unbounded increase in reward during training when αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is unrestricted, amplifying the reward signal. The increased gradient variability results in rapid strategy adjustments, enhancing short-term energy efficiency. Despite the potential for reward explosion, the impact on task performance is minimal, with notable improvements in energy efficiency.

Refer to caption
Figure 8: Energy cost (as number of time steps) for 100 random tasks. We use the same seed for MOSEAC and SEAC to ensure that the tasks are the same for the two algorithms.
Refer to caption
Figure 9: Time cost for 100 random tasks. We use the same seed for MOSEAC and SEAC to ensure that the tasks are the same for the two algorithms.

We compared MOSEAC’s task performance with the best-performing SEAC after training. Due to the poor performance of CTCO and SAC, we excluded them from the analysis.

Figure 8 and Figure 9 illustrate the energy and time distributions for MOSEAC and SEAC methods. Given the lack of normality in the data distribution, we use the Wilcoxon signed-rank test to compare the paired samples. For energy consumption (Figure 8), MOSEAC demonstrates significantly lower median and overall energy usage than SEAC (W=35.0,z=7.904,p<.001formulae-sequence𝑊35.0formulae-sequence𝑧7.904𝑝.001W=35.0,z=-7.904,p<.001italic_W = 35.0 , italic_z = - 7.904 , italic_p < .001). For time efficiency (Figure 9), MOSEAC also shows a significantly lower median time, indicating quicker task completion than SEAC (W=502.0,z=6.956,p<.001formulae-sequence𝑊502.0formulae-sequence𝑧6.956𝑝.001W=502.0,z=-6.956,p<.001italic_W = 502.0 , italic_z = - 6.956 , italic_p < .001). The descriptive statistics and test results are summarized in Appendix E.

The improved performance of MOSEAC over SEAC can be attributed to its reward function. While SEAC’s reward function is linear, combining task reward, energy penalty, and time penalty independently, MOSEAC introduces a multiplicative relationship between task reward and time-related reward. This non-linear interaction enhances the reward signal, particularly when both task performance and time efficiency are high, and naturally balances these factors. By kee** the energy penalty separate, MOSEAC maintains flexibility in tuning without complicating the relationship between time and task rewards. This design allows MOSEAC to more effectively guide the agent’s decisions, resulting in better energy efficiency and task completion speed in practical applications.

We conducted real-world tests on the Agilex Limo using ROS [42]333Our code for ROS workspace is publicly available on GitHub, which also provides support for ROS 2 [43] and Docker [44]. The Limo robot navigated to random and distinct endpoints using the MOSEAC model as its control policy for Ackerman steering 444Video avaliable here: https://youtu.be/VhTa66WqxoU. We collected these data and calculated trajectory and control output similarities with respect to simulation.

For trajectory similarity, we comine all trajectory data in to one trajetory, the ATE (Average Trajectory Error)[45] is 0.0360.0360.0360.036 meters, indicating minimal deviations between the actual and simulated paths. Additionally, the Dynamic Time War** (DTW) [46] value of 0.5310.5310.5310.531 supports the high degree of similarity between the temporal sequences of the trajectories. These metrics suggest that our method enables the Limo to follow the planned paths with high precision, closely mirroring the simulation.

Regarding control output similarity, the Mean Absolute Error (MAE) is 0.0020.0020.0020.002, and the Mean Squared Error (MSE) is 4.49E054.49𝐸054.49E-054.49 italic_E - 05. These low error values indicate that the control outputs in the real world closely match those in the simulation. Our method effectively minimizes the discrepancies in control signals, ensuring that the robot’s actions are consistent across different environments.

Overall, the empirical data supports the theoretical claims regarding MOSEAC’s performance in trajectory fidelity and control output consistency.

We also recorded the computing resource usage, highlighting the efficiency of the MOSEAC algorithm in terms of energy consumption, particularly computational energy. Figure 10 provides a comparison of the average usage per second of CPU and GPU resources between MOSEAC and SAC (Soft Actor-Critic) algorithms running at different frequencies (10 Hz and 60 Hz).

Refer to caption
Figure 10: Comparison of Average Compute Resource Usage Across Different Methods

The CPU usage data reveals that MOSEAC significantly reduces computational load compared to SAC at both 10 Hz and 60 Hz. Specifically, MOSEAC utilizes only 11.40%±0.12plus-or-minuspercent11.400.1211.40\%\pm 0.1211.40 % ± 0.12 of the CPU resources, whereas SAC requires 16.80%±0.14plus-or-minuspercent16.800.1416.80\%\pm 0.1416.80 % ± 0.14 at 10 Hz and 31.41%±1.47plus-or-minuspercent31.411.4731.41\%\pm 1.4731.41 % ± 1.47 at 60 Hz. Similarly, the GPU usage data indicates that MOSEAC is more efficient, using only 2.80%±0.07plus-or-minuspercent2.800.072.80\%\pm 0.072.80 % ± 0.07 compared to SAC’s 13.79%±0.05plus-or-minuspercent13.790.0513.79\%\pm 0.0513.79 % ± 0.05 at 10 Hz and 27.86%±0.19plus-or-minuspercent27.860.1927.86\%\pm 0.1927.86 % ± 0.19 at 60 Hz. This reduction conserves energy (increasing battery life) and frees processing power for tasks like perception and communication. The descriptive statistics and test results are summarized in Appendix F.

VII Conclusions

In this paper, we presented the Multi-Objective Soft Elastic Actor-Critic (MOSEAC) algorithm, a Variable Time Step Reinforcement Learning (VTS-RL) method designed with adaptive hyperparameters that respond to observed reward trends during training. Our analysis included theoretical performance guarantees, convergence analysis, and practical validation through simulations and real-world navigation tasks using a small rover.

We compared MOSEAC with other VTS-RL algorithms, such as SEAC [8] and CTCO [7]. While SEAC and CTCO improved over traditional fixed-time step methods, they still required extensive hyperparameter tuning and did not achieve the same level of efficiency and robustness as MOSEAC. SEAC’s reward structure, though effective, was less adaptable, and CTCO’s sensitivity to action duration further limited its practical application.

Additionally, the empirical data demonstrated that MOSEAC significantly outperforms traditional SAC algorithms, particularly in terms of computational resource efficiency. MOSEAC’s reduced CPU and GPU usage frees up resources for other critical tasks, such as environment perception and map reconstruction, enhancing the robot’s operational efficiency and extending battery life. This makes MOSEAC highly suitable for long-term and complex missions.

Our findings validate the robustness and applicability of MOSEAC in real-world scenarios, providing strong evidence of its potential to lower hardware requirements and improve data efficiency in reinforcement learning deployments.

Our future work will further refine the algorithm, particularly in the adaptive tuning of hyperparameters, to enhance its performance and applicability. We aim to apply MOSEAC to a broader range of robotic projects, including smart cars and robotic arms, to fully leverage its benefits in diverse practical settings.

VIII Acknowledgement

The authors sincerely thank Yann Bouteiller, Guillaume Ricard, and Wenqiang Du for their invaluable assistance with the sim-to-real method discussions, AgileX Limo usage, and ROS support. Their expertise and dedication were crucial in achieving high-quality sim-to-real deployment, which was essential for the success of this research. We greatly appreciate their significant time and effort invested in this project.

Appendix A Analysis of the Adaptive αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT Adjustment in MOSEAC

In MOSEAC we use an adaptive adjustment scheme for αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, specifically a simple linear increment with an upper limit αmaxsubscript𝛼\alpha_{\max}italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT. The inclusion of ψ𝜓\psiitalic_ψ and the reward adjustment strategy inherently involves Pareto optimization. Below, we provide a mathematical analysis of the stability of our scheme and its relation to the Pareto front.

A-A Adaptive Adjustment Scheme for αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT

Let αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be adjusted linearly over time with an upper limit αmaxsubscript𝛼\alpha_{\max}italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT:

αm(t)=min(αm,0+kt,αmax)subscript𝛼𝑚𝑡subscript𝛼𝑚0𝑘𝑡subscript𝛼\alpha_{m}(t)=\min(\alpha_{m,0}+kt,\alpha_{\max})italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) = roman_min ( italic_α start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT + italic_k italic_t , italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) (33)

where αm,0subscript𝛼𝑚0\alpha_{m,0}italic_α start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT is the initial value, k𝑘kitalic_k is the increment rate, and t𝑡titalic_t represents the time or iteration index.

The stability of this linear adjustment can be analyzed by examining the impact on the reward function R𝑅Ritalic_R and the policy updates.

A-B Reward Function and ψ𝜓\psiitalic_ψ

The reward function R𝑅Ritalic_R in the MOSEAC algorithm can be expressed as:

R=r+ψ𝑅𝑟𝜓R=r+\psiitalic_R = italic_r + italic_ψ (34)

where r𝑟ritalic_r is the immediate reward and ψ𝜓\psiitalic_ψ is an adjustment term that influences the reward based on the adaptive αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

Given the linear increment of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, the adjustment term ψ𝜓\psiitalic_ψ can be represented as:

ψ(t)=f(αm(t))=f(αm,0+kt) for αm(t)<αmax𝜓𝑡𝑓subscript𝛼𝑚𝑡𝑓subscript𝛼𝑚0𝑘𝑡 for subscript𝛼𝑚𝑡subscript𝛼\psi(t)=f(\alpha_{m}(t))=f(\alpha_{m,0}+kt)\text{ for }\alpha_{m}(t)<\alpha_{\max}italic_ψ ( italic_t ) = italic_f ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ) = italic_f ( italic_α start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT + italic_k italic_t ) for italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) < italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (35)
ψ(t)=f(αmax) for αm(t)αmax𝜓𝑡𝑓subscript𝛼 for subscript𝛼𝑚𝑡subscript𝛼\psi(t)=f(\alpha_{\max})\text{ for }\alpha_{m}(t)\geq\alpha_{\max}italic_ψ ( italic_t ) = italic_f ( italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) for italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ≥ italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (36)

where f𝑓fitalic_f is a function that defines how ψ𝜓\psiitalic_ψ depends on αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

A-C Stability Analysis

To analyze the stability of the reward function under this adaptive scheme, we examine the boundedness and convergence of R𝑅Ritalic_R. The stability is ensured if the cumulative reward remains bounded and the policy converges to an optimal policy over time.

  1. 1.

    Boundedness: The linear increment of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with an upper limit should ensure that R𝑅Ritalic_R does not grow unbounded. This requires that f(αm)𝑓subscript𝛼𝑚f(\alpha_{m})italic_f ( italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ) grows at a controlled rate.

    |R|=|r+f(αm,0+kt)|M for αm(t)<αmax𝑅𝑟𝑓subscript𝛼𝑚0𝑘𝑡𝑀 for subscript𝛼𝑚𝑡subscript𝛼|R|=|r+f(\alpha_{m,0}+kt)|\leq M\text{ for }\alpha_{m}(t)<\alpha_{\max}| italic_R | = | italic_r + italic_f ( italic_α start_POSTSUBSCRIPT italic_m , 0 end_POSTSUBSCRIPT + italic_k italic_t ) | ≤ italic_M for italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) < italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (37)
    |R|=|r+f(αmax)|M for αm(t)αmax𝑅𝑟𝑓subscript𝛼𝑀 for subscript𝛼𝑚𝑡subscript𝛼|R|=|r+f(\alpha_{\max})|\leq M\text{ for }\alpha_{m}(t)\geq\alpha_{\max}| italic_R | = | italic_r + italic_f ( italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ) | ≤ italic_M for italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ( italic_t ) ≥ italic_α start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT (38)

    where M𝑀Mitalic_M is a constant, indicating that R𝑅Ritalic_R remains bounded.

  2. 2.

    Convergence: The policy π𝜋\piitalic_π should converge to an optimal policy πsuperscript𝜋\pi^{*}italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT. The convergence is influenced by the adjustment term ψ𝜓\psiitalic_ψ and the learning rate αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.

    limtπ(t)=πsubscript𝑡𝜋𝑡superscript𝜋\lim_{t\to\infty}\pi(t)=\pi^{*}roman_lim start_POSTSUBSCRIPT italic_t → ∞ end_POSTSUBSCRIPT italic_π ( italic_t ) = italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT (39)

    If αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is adjusted linearly with an upper limit, the learning rate should decrease over time to ensure convergence.

A-D Multi-Objective Optimization

The multi-objective optimization aspect of the MOSEAC algorithm arises from the trade-off between multiple objectives in the reward function. The inclusion of ψ𝜓\psiitalic_ψ introduces a multi-objective optimization problem, where the algorithm aims to optimize the cumulative reward while balancing different aspects influenced by ψ𝜓\psiitalic_ψ.

The Pareto front represents the set of optimal policies where no single objective can be improved without degrading another [47]. The linear increment of αmsubscript𝛼𝑚\alpha_{m}italic_α start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT with an upper limit and the presence of ψ𝜓\psiitalic_ψ ensure that the algorithm explores the policy space effectively, converging to solutions on the Pareto front [48].

Appendix B Real Environment

Table III displays the coordinate information of the enclosed regions in the real environment; Table IV shows the specifications of the Agilex Limo; Table V shows the key performance metrics of the OptiTrack system.

TABLE III: Enclosed regions information (yellow lines arrays) in the real environment (in meters)
Zone Coordinates
zone_left_down [[0.0, 0.0], [-1.0, 0.0]], [[-1.0, 0.0], [-1.0, -1.0]], [[-1.0, -1.0], [0.0, -1.0]], [[0.0, -1.0], [0.0, -0.75]], [[0.0, -0.75], [-0.57, -0.75]], [[-0.57, -0.75], [-0.57, -0.3]], [[-0.57, -0.3], [0.0, -0.3]], [[0.0, -0.3], [0.0, 0.0]]
zone_left_up [[0.0, 0.4], [0.0, 1.0]], [[0.0, 1.0], [-1.0, 1.0]], [[-1.0, 1.0], [-1.0, 0.4]], [[-1.0, 0.4], [0.0, 0.4]]
zone_right_up [[0.5, 0.0], [1.0, 0.0]], [[1.0, 0.0], [1.0, 1.0]], [[1.0, 1.0], [0.5, 1.0]], [[0.5, 1.0], [0.5, 0.0]]
zone_right_down [[0.5, -1.0], [0.5, -0.5]], [[0.5, -0.5], [1.0, -0.5]], [[1.0, -0.5], [1.0, -1.0]], [[1.0, -1.0], [0.5, -1.0]]
TABLE IV: AgileX Limo Specifications [11]
Category Specifications
Dimensions 322mm x 220mm x 251mm
Weight 4.2 kg
Maximum Speed 1 m/s
Maximum Climbing Capacity 25° (omni-wheel, differential, Ackermann steering), 40° (tracked mode)
Ground Clearance 24 mm
Battery 12V Li-ion 5600mAh
Run Time 40 minutes of continuous operation
Standby Time 2 hours
Charging Time 2 hours
Operating Temperature -10°C to 40°C
IP Rating IP22 (splash-proof, dust-proof)
Sensors IMU: MPU6050
LiDAR: EAI X2L
Depth Camera: ORBBEC DaBai
CPU ARM64 Quad-Core 1.43GHz (Cortex-A57)
GPU 128-core NVIDIA Maxwell™ @ 921MHz
Communication Wi-Fi, Bluetooth 5.0
Operating System Ubuntu 18.04, ROS1 Melodic
TABLE V: Key Performance Metrics of the OptiTrack System [32]
Performance Metric Value Description
Camera Resolution Up to 4096x2160 High resolution for detailed tracking
Frame Rate Up to 360 FPS Ensures smooth motion capture
Latency As low as 3ms Provides real-time feedback
Field of View (FOV) 56° to 100° Suitable for various capture needs
Synchronization Precision <1µs Multi-camera synchronized capture
Marker Tracking Accuracy <0.5mm Precise 3D spatial positioning
Operating Range Up to 30m Suitable for large-scale capture environments
Optical Resolution 12 MP High-quality image data
Ambient Light Suppression Strong Reduces interference from ambient light
Data Interface Gigabit Ethernet Fast data transmission

Appendix C Simulation Environment

The position of the goal is randomized from these eight points in each episode. Setting multiple endpoints offers several advantages. It enhances the navigation policy’s generalization capability by training the agent to adapt to various goals rather than a single target. This promotes extensive exploration, prevents the agent from getting stuck in local optima, and increases robustness by preparing the agent to handle dynamic or uncertain target positions in real-world deployments [49].

Additionally, it improves data efficiency by allowing the agent to learn from diverse experiences in a single training session. Multiple endpoints provide more prosperous reward signals, accelerating learning through varied success and failure contexts. Moreover, it simulates realistic scenarios where multiple destinations are shared, thus increasing the practical value of the trained model. Lastly, this approach raises task complexity, challenging the agent to develop more sophisticated strategies and ultimately enhancing overall performance [50].

TABLE VI: List of goal positions with their coordinates
Goal Position Coordinates
Goal Position 1 [1.2, 1.2]
Goal Position 2 [1.2, -1.2]
Goal Position 3 [-1.2, 1.2]
Goal Position 4 [-1.2, -1.2]
Goal Position 5 [1.2, 0]
Goal Position 6 [0, 1.2]
Goal Position 7 [-1.2, 0]
Goal Position 8 [0, -1.2]

The start point is a fixed point as: [-0.2, -0.5]. To enhance the stability of the agent’s initial position and reduce the risk of misalignment in real-world deployment, we introduce uniform noise to the starting coordinates. This approach aims to minimize data uncertainty caused by position discrepancies. By adding noise uniformly in the range [0.05,0.05]0.050.05[-0.05,0.05][ - 0.05 , 0.05 ], we can achieve the desired effect. Let:

  • 𝐚locationsubscript𝐚location\mathbf{a}_{\text{location}}bold_a start_POSTSUBSCRIPT location end_POSTSUBSCRIPT be the agent’s position

  • U(a,b)𝑈𝑎𝑏U(a,b)italic_U ( italic_a , italic_b ) denote a uniform distribution in the range [a,b]𝑎𝑏[a,b][ italic_a , italic_b ]

The updated position formula is:

𝐚location=[0.2+U(0.05,0.05)0.5+U(0.05,0.05)]subscript𝐚locationmatrix0.2𝑈0.050.050.5𝑈0.050.05\mathbf{a}_{\text{location}}=\begin{bmatrix}-0.2+U(-0.05,0.05)\\ -0.5+U(-0.05,0.05)\end{bmatrix}bold_a start_POSTSUBSCRIPT location end_POSTSUBSCRIPT = [ start_ARG start_ROW start_CELL - 0.2 + italic_U ( - 0.05 , 0.05 ) end_CELL end_ROW start_ROW start_CELL - 0.5 + italic_U ( - 0.05 , 0.05 ) end_CELL end_ROW end_ARG ] (40)

This method ensures that the uniform noise maintains the desired variability without introducing bias [51].

All in all, the state dimension is 49, and their shape and space are shown in Table VII.

TABLE VII: Noteable, Limo cannot make the control duration correctly if the duration is not in this time range.
State details
Name Shape Space Annotation
Limo Position [2,][2,][ 2 , ] [1.5,1.5]1.51.5[-1.5,1.5][ - 1.5 , 1.5 ] in (X, Y), (meters)
Goal Position [2,][2,][ 2 , ] [1.5,1.5]1.51.5[-1.5,1.5][ - 1.5 , 1.5 ] in (X, Y), (meters)
Linear Velocity [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
Steering Angle [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
Control duration [1,][1,][ 1 , ] [0.02,0.5]0.020.5[0.02,0.5][ 0.02 , 0.5 ] in Seconds
Previous linear velocity [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
Angular velocity [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
20 radar point positions [40,][40,][ 40 , ] [1.5,1.5]1.51.5[-1.5,1.5][ - 1.5 , 1.5 ] reshape from [20,2]202[20,2][ 20 , 2 ], in (X, Y)
TABLE VIII: Noteable, Limo cannot make the control duration correctly if the duration is not in this time range.
Action details
Name Shape Space Annotation
Control duration [1,][1,][ 1 , ] [0.02,0.5]0.020.5[0.02,0.5][ 0.02 , 0.5 ] in Seconds
Linear velocity [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
Angular velocity [1,][1,][ 1 , ] [1.0,1.0]1.01.0[-1.0,1.0][ - 1.0 , 1.0 ]
TABLE IX: Reward Value Settings for Limo Environment
Reward Settings
Name Value Annotation
cross_punish 30.030.0-30.0- 30.0 cross with the enclosed regions
success_reward 500.0500.0500.0500.0 success to reach the goal
dead_punish 100.0100.0-100.0- 100.0 go out of the map
δ𝛿\deltaitalic_δ 0.20.20.20.2 if limo close enough to a point, in meters

Appendix D System details

TABLE X: Training PC Details
PC Key Softwares’ Ecosystem
Name Value
Nvidia Driver Version 450.67
Cuda Version 11.8
cuDNN Version 8.9.7
Python Version 3.8
Torch Version 2.1.0
ONNX Version 1.13.1
Gymnasium Version 0.29.1
TABLE XI: Agilex Limo Details
Agilex Limo Key Softwares’ Ecosystem
Name Value
Nvidia JetPack Version 4.6.4
Cuda Version 10.2_r440
cuDNN Version 8.2.1
TensorRT Version 8.2.1.3
Python Version 3.6
Pycuda Version 2020.1
ONNX Version 1.13.1
Limo Controller Version 2.4

Our experiments used the Agilex Limo equipped with a Jetson Nano running JetPack version 4.6.4. While newer versions such as 6.0+ are available, compatibility and support limitations necessitated our use of JetPack 4.6.4. Specifically, the older versions of cuDNN and TensorRT required for our setup are no longer available for direct download, although this may affect reproducibility. To assist others in replicating our results, we have provided custom-built support packages on our GitHub.

Appendix E Statistics for Energy and Time Consumptions

Statistic results for the MOSEAC and SEAC on energy and time consumptions are shown in Table XII.

TABLE XII: Descriptives
N Mean SD SE COV
MOSEAC_Energy 100 10.130 4.733 0.473 0.467
SEAC_Energy 100 11.660 4.740 0.474 0.407
MOSEAC_Time 100 4.341 2.038 0.204 0.469
SEAC_Time 100 4.456 2.044 0.204 0.459
TABLE XIII: Test of Normality (Shapiro-Wilk)
W p
MOSEAC_energy - SEAC_energy 0.916 <0.001absent0.001<0.001< 0.001
MOSEAC_time - SEAC_time 0.993 0.894
TABLE XIV: Paired Samples T-Test
Measure 1 Measure 2 W z p
MOSEAC_Energy - SEAC_Energy 35.000 -7.904 <0.001
MOSEAC_Time - SEAC_Time 502.000 -6.956 <0.001

Appendix F Statistics for Compute Resources

TABLE XV: Descriptives
N Mean SD SE COV
MOSEAC CPU usage (%) 88 11.400 0.370 0.131 0.032
SAC 10 Hz CPU usage (%) 88 16.800 0.400 0.141 0.024
SAC 60 Hz CPU usage (%) 88 31.413 1.298 0.459 0.041
MOSEAC GPU usage (%) 88 2.800 0.283 0.100 0.101
SAC 10 Hz GPU usage (%) 88 13.787 0.242 0.085 0.018
SAC 60 Hz GPU usage (%) 88 27.863 0.463 0.164 0.017
TABLE XVI: Test of Normality (Shapiro-Wilk)
W p
Measure 1 Measure 2 W p
MOSEAC CPU usage (%) SAC 10 Hz CPU usage (%) 0.947 0.685
SAC 60 Hz CPU usage (%) 0.960 0.806
MOSEAC GPU usage (%) SAC 10 Hz GPU usage (%) 0.835 0.067
SAC 60 Hz GPU usage (%) 0.901 0.296
TABLE XVII: Paired Samples T-Test
Measure 1 Measure 2 W z p
MOSEAC CPU usage (%) - SAC 10 Hz CPU usage (%) - -2.521 0.004
- SAC 60 Hz CPU usage (%) - -2.521 0.004
MOSEAC GPU usage (%) - SAC 10 Hz GPU usage (%) - -2.521 0.007
- SAC 60 Hz GPU usage (%) - -2.521 0.007

References

  • [1] R. Liu, F. Nageotte, P. Zanne, M. de Mathelin, and B. Dresp-Langley, “Deep reinforcement learning for the control of robotic manipulation: a focussed mini-review,” Robotics, vol. 10, no. 1, p. 22, 2021.
  • [2] J. Ibarz, J. Tan, C. Finn, M. Kalakrishnan, P. Pastor, and S. Levine, “How to train your robot with deep reinforcement learning: lessons we have learned,” The International Journal of Robotics Research, vol. 40, no. 4-5, pp. 698–721, 2021.
  • [3] N. Akalin and A. Loutfi, “Reinforcement learning approaches in social robotics,” Sensors, vol. 21, no. 4, p. 1292, 2021.
  • [4] B. Singh, R. Kumar, and V. P. Singh, “Reinforcement learning in robotic applications: a comprehensive survey,” Artificial Intelligence Review, vol. 55, no. 2, pp. 945–990, 2022.
  • [5] R. Majumdar, A. Mathur, M. Pirron, L. Stegner, and D. Zufferey, “Paracosm: A test framework for autonomous driving simulations,” in International Conference on Fundamental Approaches to Software Engineering.   Springer International Publishing Cham, 2021, pp. 172–195.
  • [6] E. Bregu, N. Casamassima, D. Cantoni, L. Mottola, and K. Whitehouse, “Reactive control of autonomous drones,” in Proceedings of the 14th Annual International Conference on Mobile Systems, Applications, and Services, 2016, pp. 207–219.
  • [7] A. Karimi, J. **, J. Luo, A. R. Mahmood, M. Jagersand, and S. Tosatto, “Dynamic decision frequency with continuous options,” in 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2023, pp. 7545–7552.
  • [8] D. Wang and G. Beltrame, “Deployable reinforcement learning with variable control rate,” arXiv preprint arXiv:2401.09286, 2024.
  • [9] ——, “Reinforcement learning with elastic time steps,” arXiv preprint arXiv:2402.14961, 2024.
  • [10] ——, “Moseac: Streamlined variable time step reinforcement learning,” arXiv preprint arXiv:2406.01521, 2024.
  • [11] AgileX Robotics, “Agilex limo - multi-modal mobile robot with ai modules,” https://www.globenewswire.com/, accessed: [date].
  • [12] P.-Y. Lajoie, B. Ramtoula, Y. Chang, L. Carlone, and G. Beltrame, “Door-slam: Distributed, online, and outlier resilient slam for robotic teams,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 1656–1663, 2020.
  • [13] S. Wan, Z. Gu, and Q. Ni, “Cognitive computing and wireless communications on the edge for healthcare service robots,” Computer Communications, vol. 149, pp. 99–106, 2020.
  • [14] Y. Bouteiller, S. Ramstedt, G. Beltrame, C. Pal, and J. Binas, “Reinforcement learning with random delays,” in International conference on learning representations, 2021.
  • [15] S. Amin, M. Gomrokchi, H. Aboutalebi, H. Satija, and D. Precup, “Locally persistent exploration in continuous control tasks with sparse rewards,” arXiv preprint arXiv:2012.13658, 2020.
  • [16] S. Park, J. Kim, and G. Kim, “Time discretization-invariant safe action repetition for policy gradient methods,” Advances in Neural Information Processing Systems, vol. 34, pp. 267–279, 2021.
  • [17] S. Sharma, A. Srinivas, and B. Ravindran, “Learning to repeat: Fine grained action repetition for deep reinforcement learning,” arXiv preprint arXiv:1702.06054, 2017.
  • [18] A. M. Metelli, F. Mazzolini, L. Bisi, L. Sabbioni, and M. Restelli, “Control frequency adaptation via action persistence in batch reinforcement learning,” in International Conference on Machine Learning.   PMLR, 2020, pp. 6862–6873.
  • [19] J. Lee, B.-J. Lee, and K.-E. Kim, “Reinforcement learning for control with multiple frequencies,” Advances in Neural Information Processing Systems, vol. 33, pp. 3254–3264, 2020.
  • [20] Y. Chen, H. Wu, Y. Liang, and G. Lai, “Varlenmarl: A framework of variable-length time-step multi-agent reinforcement learning for cooperative charging in sensor networks,” in 2021 18th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON).   IEEE, 2021, pp. 1–9.
  • [21] E. Even-Dar, S. Mannor, and Y. Mansour, “Action elimination and stop** conditions for reinforcement learning,” in Proceedings of the 20th International Conference on Machine Learning (ICML-03), 2003, pp. 162–169.
  • [22] S. Zhao, T. Zheng, D. Sui, J. Zhao, and Y. Zhu, “Reinforcement learning based variable dam** control of wearable robotic limbs for maintaining astronaut pose during extravehicular activity,” Frontiers in Neurorobotics, vol. 17, p. 1093718, 2023.
  • [23] N. Gottipati et al., “To the max: Reinventing reward in reinforcement learning,” arXiv preprint arXiv:2402.01361, 2020. [Online]. Available: https://ar5iv.labs.arxiv.baiduqq.workers.dev/html/2402.01361
  • [24] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [25] T. G. Dietterich, “Hierarchical reinforcement learning with the maxq value function decomposition,” Journal of artificial intelligence research, vol. 13, pp. 227–303, 2000.
  • [26] S. Li, R. Wang, M. Tang, and C. Zhang, “Hierarchical reinforcement learning with advantage-based auxiliary rewards,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  • [27] I. Kacem, S. Hammadi, and P. Borne, “Pareto-optimality approach for flexible job-shop scheduling problems: hybridization of evolutionary algorithms and fuzzy logic,” Mathematics and computers in simulation, vol. 60, no. 3-5, pp. 245–276, 2002.
  • [28] M. S. Monfared, S. E. Monabbati, and A. R. Kafshgar, “Pareto-optimal equilibrium points in non-cooperative multi-objective optimization problems,” Expert Systems with Applications, vol. 178, p. 114995, 2021.
  • [29] T. Haarnoja, A. Zhou, K. Hartikainen, G. Tucker, S. Ha, J. Tan, V. Kumar, H. Zhu, A. Gupta, P. Abbeel et al., “Soft actor-critic algorithms and applications,” arXiv preprint arXiv:1812.05905, 2018.
  • [30] S. Banach, “Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales,” Fundamenta mathematicae, vol. 3, no. 1, pp. 133–181, 1922.
  • [31] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in neural information processing systems, vol. 12, 1999.
  • [32] OptiTrack, “Optitrack - motion capture systems,” https://www.optitrack.com, accessed: [date].
  • [33] M. Rybczak, N. Popowniak, and A. Lazarowska, “A survey of machine learning approaches for mobile robot control,” Robotics, vol. 13, no. 1, p. 12, 2024.
  • [34] Authors, “Supervised and unsupervised deep learning applications for visual slam in robotics,” in IMECE.   ASME, 2022, p. V003T04A010.
  • [35] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” 2023.
  • [36] Authors, “Assessing the impact of distribution shift on reinforcement learning performance,” arXiv preprint arXiv:2402.03590, 2024. [Online]. Available: https://arxiv.longhoe.net/abs/2402.03590
  • [37] M. Katakura et al., “Reinforcement learning model with dynamic state space tested on target search tasks for monkeys: Extension to learning task events,” Frontiers in Robotics and AI, 2023. [Online]. Available: https://www.frontiersin.org/articles/10.3389/frobt.2023.00856/full
  • [38] “Jetson nano module,” https://developer.nvidia.com/embedded/jetson-nano, 2019.
  • [39] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, vol. 32, 2019, pp. 8024–8035. [Online]. Available: https://proceedings.neurips.cc/paper/2019/hash/bdbca288fee7f92f2bfa9f7012727740-Abstract.html
  • [40] M. Towers, J. K. Terry, A. Kwiatkowski, J. U. Balis, G. d. Cola, T. Deleu, M. Goulão, A. Kallinteris, A. KG, M. Krimmel, R. Perez-Vicente, A. Pierré, S. Schulhoff, J. J. Tai, A. T. J. Shen, and O. G. Younis, “Gymnasium,” Mar. 2023. [Online]. Available: https://zenodo.org/record/8127025
  • [41] “Tensorrt,” https://developer.nvidia.com/tensorrt, 2021.
  • [42] M. Quigley, K. Conley, B. Gerkey, J. Faust, T. Foote, J. Leibs, R. Wheeler, and A. Y. Ng, “Ros: an open-source robot operating system,” in ICRA workshop on open source software, vol. 3, no. 3.2.   Kobe, Japan, 2009, p. 5.
  • [43] S. Macenski, T. Foote, B. Gerkey, C. Lalancette, and W. Woodall, “Ros 2: Towards a performance-centric and real-time robotics framework,” in Robotics: Science and Systems (RSS) Workshop on Real-time and Performance in Robotic Systems, Online, 2020.
  • [44] D. Inc., Docker: Open Platform for Develo**, Ship**, and Running Applications, 2013, available at https://docs.docker.com/.
  • [45] H. Ryan, M. Paglione, and S. Green, “Review of trajectory accuracy methodology and comparison of error measurement metrics,” in AIAA Guidance, Navigation, and Control Conference and Exhibit, 2004, p. 4787.
  • [46] M. Müller, “Dynamic time war**,” Information retrieval for music and motion, pp. 69–84, 2007.
  • [47] V. Pareto, Manual of Political Economy.   New York: Augustus M. Kelley, 1971.
  • [48] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
  • [49] M. Riedmiller, R. Hafner, T. Lampe, M. Neunert, J. Degrave, T. van de Wiele, V. Mnih, and N. Heess, “Learning by playing - solving sparse reward tasks from scratch,” in Proceedings of the 35th International Conference on Machine Learning, 2018. [Online]. Available: https://arxiv.longhoe.net/abs/1802.09464
  • [50] M. G. Bellemare, S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos, “Unifying count-based exploration and intrinsic motivation,” arXiv preprint arXiv:1704.08302, 2017. [Online]. Available: https://arxiv.longhoe.net/abs/1704.08302
  • [51] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” The Journal of Machine Learning Research, vol. 17, no. 1, pp. 1334–1373, 2016. [Online]. Available: https://arxiv.longhoe.net/abs/1604.07316
[Uncaptioned image] Dong Wang (Member, IEEE) received his bachelor’s degree in electronic engineering from the School of Aviation, Northwestern Polytechnical University (NWPU), Xi’an, China, in 2017. He is pursuing his Ph.D. in the Department of Software Engineering at Polytechnique Montreal, Montreal, Canada. His research interests include reinforcement learning, computer vision, and robotics.
[Uncaptioned image] Giovanni Beltrame Giovanni Beltrame (Senior Member, IEEE) received the Ph.D. degree in computer engineering from Po- litecnico di Milano, Milan, Italy, in 2006. He worked as a Microelectronics Engineer with the European Space Agency, Paris, France, on a number of projects, spanning from radiation tolerant systems to computer-aided design. Since 2010, he has been the Professor with the Computer and Software Engineer- ing Department, Polytechnique Montreal, Montreal, QC, Canada, where he directs the MIST Lab. He has authored or coauthored more than 150 papers in international journals and conferences. His research interests include modeling and design of embedded systems, artificial intelligence, and robotics.