Learning Autonomous Race Driving with Action Map** Reinforcement Learning

Yuanda Wang [email protected] Xin Yuan [email protected] Changyin Sun [email protected] School of Automation, Southeast University, Nan**g 210096, China School of Artificial Intelligence, Anhui University, Hefei 230039, China
Abstract

Autonomous race driving poses a complex control challenge as vehicles must be operated at the edge of their handling limits to reduce lap times while respecting physical and safety constraints. This paper presents a novel reinforcement learning (RL)-based approach, incorporating the action map** (AM) mechanism to manage state-dependent input constraints arising from limited tire-road friction. A numerical approximation method is proposed to implement AM, addressing the complex dynamics associated with the friction constraints. The AM mechanism also allows the learned driving policy to be generalized to different friction conditions. Experimental results in our developed race simulator demonstrate that the proposed AM-RL approach achieves superior lap times and better success rates compared to the conventional RL-based approaches. The generalization capability of driving policy with AM is also validated in the experiments.

keywords:
Autonomous race driving, reinforcement learning, safety constraint, action map**.

1 Introduction

Autonomous driving has been a hot topic in both research and industry in recent decades [1, 2, 3, 4]. Various environmental perception [5, 6, 7], planning [8, 9], and motion control [10, 11] approaches for autonomous driving have been proposed. Many of them have been successfully applied to regular cars in the market. Current autonomous driving techniques have covered most driving scenarios, including highway driving, urban driving, autonomous parking, etc. In this work, we study autonomous driving in the race driving scenario where advanced driving skills are required to fully utilize the car’s handling capability and minimize the lap time. In highway and urban driving scenarios, the vehicle is operated near the equilibrium point, and the dynamic can be approximated by a linear model. Differently, in race driving, the vehicle is operated near its physical limits, and thus the complex nonlinear dynamics and constraints should be considered [12, 13]. These factors make autonomous race driving more challenging in terms of control than other driving scenarios.

The autonomous race driving controller should respect multiple input constraints due to physical and safety limits. The control inputs, including steering, acceleration, and deceleration have static range limits. For RL-based methods, this constraint can be easily addressed by applying a sigmoid or tanh activation function in the output layer of the policy network. Moreover, the inputs are further restricted by the limit of tire-road friction which depends on the car’s instantaneous motion states. For this kind of state-dependent constraint, most existing RL-based control methods use penalty terms in the reward function, which gives punishment to the current policy when the constraint is violated. Although this reward-sha** solution is rather simple, however, the learned policy tends to be relatively conservative. To address this state-dependent input constraint in RL-based control, we introduce the action map** (AM) mechanism [14], which converts the actions from the direct output from the neural network to real control inputs that satisfy the state-dependent constraints with a pre-defined map** function. This mechanism could effectively address the conservatism associated with penalty-based solutions.

In this paper, we develop a novel numerical AM-RL framework for autonomous race driving. The state-dependent control input constraint due to the limit of tire-road friction is addressed by the AM mechanism. By establishing a map** from unconstrained network output actions to constrained control inputs, it can be guaranteed that the vehicle is always controlled within the friction limits. However, the vehicle dynamics related to the friction limit are rather complex, and it is very difficult to find a closed-form expression for the map** function. Therefore, we further propose a numerical approximation method to implement AM. Then, we incorporate AM with the twin delayed deep deterministic policy gradient (TD3) algorithm [15] to train the race driving policy. For the states, we use a set of forward-observation points to indicate the curvature of the race track ahead. The reward function is specially designed to encourage the driving policy to maximize the car’s velocity along the race track while avoiding driving off the track or driving in the wrong way direction. Finally, the proposed race control approach is evaluated in our developed race simulator. The race driving policy trained with TD3 and AM obtains shorter lap times and higher success rates compared to other comparative approaches. The main contributions of this paper are summarized as follows:

  1. 1.

    The AM mechanism is introduced to RL-based autonomous race driving control problem to address the state-dependent input constraints arising from limited tire-road friction.

  2. 2.

    The AM mechanism enables the RL-based controller to better utilize the maximum tire-road friction, addressing the conservatism often associated with conventional reward-sha** solutions. The AM mechanism also allows the learned driving policy to be generalized to different friction conditions by adapting the friction constraint in the action map** function.

  3. 3.

    A numerical approximation method is proposed to implement the AM mechanism, overcoming the challenges of dealing with complex nonlinear dynamics with constraints. This numerical method further extends the original AM mechanism, enabling it to address constraint RL problems with a more general form.

The remainder of this paper is organized as follows. Section 2 reviewed related work about autonomous racing and safe RL. In Section 3, we briefly introduce the race vehicle’s dynamic model and the constraint of tire friction. In Section 4, the developed RL algorithm with AM mechanism for race driving is explained in detail. The simulation experiments, results, and discussions are presented in Section 5. At last, Section 6 concludes this paper.

2 Related Work

In recent decades, model predictive control (MPC) has been one of the major methods used to address the challenge of autonomous racing for both trajectory planning and motion control. In [16], the time-optimal driving control task with the constraint of race track boundaries is formulated as a nonlinear model predictive control (NMPC) problem using a generalized Gauss-Newton method. However, the tire-road friction constraint in race driving dynamics is not considered. This NMPC-based method is further extended in [17] where the lateral tire-road friction limit is described by a Pacejka model. Then, a Hessian sequential quadratic programming optimization algorithm is employed to solve the NMPC problem. Similar MPC-based methods are also presented in [18, 19, 20]. In those methods, the control strategies are developed based on the vehicle’s dynamic model and the race track’s geometrical model. To reduce the computational complexity of the optimization problems, the models are usually simplified and linearized. This could seriously affect the performance of the controller if applied to a real vehicle. To address this issue, the MPC is incorporated with some data-driven and learning-based approaches. The measurement data from real-world experiments are used to refine the dynamic model and optimize the controller. In [21], a learning-based MPC approach is proposed for autonomous racing. A relatively simple nominal vehicle model is built first. Then, the model is improved by Gaussian process regression based on the measurement data. Similarly, in [22], the affine time-varying prediction model is used to approximate the vehicle model. Moreover, the model predictive path integral (MPPI) control is proposed for autonomous racing in [23, 24]. With the help of the model learning ability, many control strategies have achieved success in real-world experiments with a variety of race cars, including a full-size electric formula race car [21], a 1:10 scale RC car [22], and a 1:5 scale rally race car [23, 24, 25].

More recently, the rapid advancement of machine learning has introduced novel solutions to complex control problems. Reinforcement learning (RL), a significant type of machine learning approach, has found extensive applications in addressing continuous control problems. The RL-based control algorithm directly trains a neural network-based control policy that maximizes a reward function by online interacting with the real system or using saved trajectories. Therefore, prior knowledge of the system dynamics might not needed, and the reward function is fully flexible. Many newly proposed RL algorithms, such as deep deterministic policy gradient (DDPG) [26], twin delayed deep deterministic policy gradient (TD3) [15], proximal policy optimization (PPO) [27], and soft actor-critic (SAC) [28], have shown notable capability in the control of complex nonlinear dynamic systems, like quadrotor helicopters and multi-joint manipulators [29, 30, 31].

In many physical control applications utilizing RL, the control policy must adhere to safety constraints. In the aforementioned RL algorithms, one could impose penalties on actions violating constraints to derive a safe policy. However, this approach is often inefficient and does not ensure safety throughout the training process. To address this issue, several safe RL algorithms have been proposed. In [32], the MPC method is combined with the RL algorithm to guarantee safe exploration in training. In [33], a general-purpose RL policy search algorithm named constraint policy optimization (CPO) is proposed based on the trust region method. Reward constraint policy optimization (RCPO) [34] is a similar safe RL algorithm that uses a penalty signal to guide the policy toward a constraint-satisfying solution. There is another kind of ‘plug-and-play’ style safe mechanism that directly works on top of the policy to correct the actions without changing the original RL algorithm, such as the safety layer (SL) technique [35] and the AM mechanism [14]. Although SL and AM share a similar structure, the specific approaches in building the constraint model and correcting unsafe actions are significantly different. A detailed comparison and discussion of the two methods, along with experiment results, are provided in Section 5. Moreover, in contrast to model-free RL algorithms, many safe RL algorithms or safe mechanisms require some prior knowledge of the environment, such as the dynamic model and constraint function model. If these models are not directly accessible, they can be acquired through a model identification process guided by safe rules or human demonstration.

The RL-based control methods have also been widely applied in the fields of autonomous driving and racing. In [36], a rally race driving policy is learned with the A3C algorithm in an end-to-end way. The image from a forward-facing camera is directly fed into the policy network without any mediated perception. Similarly, a vision-based lateral control strategy for autonomous driving on a race track is developed in [11]. A convolutional neural network is built to extract track features from driver-view images. Then, a DDPG-based control policy gives control commands based on the track features and the car’s speed. Furthermore, a race driving agent named ‘GT Sophy’ has achieved super-human performance in the Gran Turismo game [37, 38]. The agent is trained using an improved SAC algorithm. Instead of using images, this agent’s observation of the track ahead is represented by a series of points along the centerline and each edge of the track. Similarly, Remonda et al. [39] propose to use the look-ahead curvature to represent the upcoming shape of the track, and train policies with diverse variants of DDPG (with long short-term memory, prioritized experience replay, multi-step target, etc.). Their proposed approach outperforms the state-of-the-art bots in the TORCS simulator [40] and even surpasses professional drivers in qualifying sessions in a professional simulator [41].

In the aforementioned RL-based autonomous racing approaches, the safety constraint is not explicitly considered. Due to the complexity of autonomous racing tasks, those general-purpose safe RL algorithms are difficult to be applied. For example, implementing the CPO algorithm in autonomous racing is quite challenging as it involves evaluating constraint functions to determine the feasibility of a certain control policy. Consequently, the safety RL methods for autonomous racing are usually specially designed. Niu et al. [42] propose a two-stage safe RL racing approach. In the first stage, a rule-based safeguard module is employed to enforce the constraint during the policy training at low speed. Then, in the second stage, the rule-based module is replaced with a data-driven module to develop a closed-form analytical safety solution at high speed. This approach is validated with the TORCS simulator and achieves zero safety violations. In [43], a viability theory-based safety supervisory architecture is proposed. The supervisor is built with a viability kernel based on the car’s dynamic model. It ensures the vehicle stays within the friction limit while maintaining recursive feasibility during the training process. Among the approaches discussed above, incorporating an additional safety module emerges as a common and logical solution for safe RL in autonomous racing. Our proposed approach, featuring the AM mechanism, also adopts this strategy.

3 Vehicle Model

In this section, the single track vehicle model and friction constraint for autonomous race driving are presented. The single track model, or known as the ‘bicycle model’, is commonly used in car handling studies [12]. As shown in Fig. 1, based on the assumption that both front wheel steering systems have equal gear ratio, the dynamics and kinematics of two front wheels and two rear wheels can be represented by two center wheels located on the center of the front and rear axles. The pose and motion of the race car are defined under two coordinate frames: earth-fixed frame (oesubscript𝑜𝑒o_{e}italic_o start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT-XeYeZesubscript𝑋𝑒subscript𝑌𝑒subscript𝑍𝑒X_{e}Y_{e}Z_{e}italic_X start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT) and vehicle body-fixed frame (obsubscript𝑜𝑏o_{b}italic_o start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-XbYbZbsubscript𝑋𝑏subscript𝑌𝑏subscript𝑍𝑏X_{b}Y_{b}Z_{b}italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT italic_Z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT).

The notations used to describe the vehicle model in Fig. 1 are listed as follows:

  • 1.

    v𝑣vitalic_v: vehicle velocity at center of gravity in body-fixed frame.

  • 2.

    vx,vysubscript𝑣𝑥subscript𝑣𝑦v_{x},v_{y}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT: vehicle longitudinal and lateral velocity in body-fixed frame.

  • 3.

    ψ𝜓\psiitalic_ψ: vehicle heading angle (yaw angle).

  • 4.

    ω𝜔\omegaitalic_ω: vehicle turning rate.

  • 5.

    δ𝛿\deltaitalic_δ: front wheel steering angle.

  • 6.

    β𝛽\betaitalic_β: vehicle sideslip angle.

  • 7.

    lf,lrsubscript𝑙𝑓subscript𝑙𝑟l_{f},l_{r}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: distance from center of gravity to front/rear axle.

  • 8.

    αf,αrsubscript𝛼𝑓subscript𝛼𝑟\alpha_{f},\alpha_{r}italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT , italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT: tire sideslip angles of front/rear wheels.

  • 9.

    Fyf,Fyrsubscript𝐹𝑦𝑓subscript𝐹𝑦𝑟F_{yf},F_{yr}italic_F start_POSTSUBSCRIPT italic_y italic_f end_POSTSUBSCRIPT , italic_F start_POSTSUBSCRIPT italic_y italic_r end_POSTSUBSCRIPT: lateral tire force on front/rear wheels.

  • 10.

    Fmxsubscript𝐹𝑚𝑥F_{mx}italic_F start_POSTSUBSCRIPT italic_m italic_x end_POSTSUBSCRIPT: traction force on driven wheel.

  • 11.

    Fbxsubscript𝐹𝑏𝑥F_{bx}italic_F start_POSTSUBSCRIPT italic_b italic_x end_POSTSUBSCRIPT: braking force on wheels.

Refer to caption
Figure 1: Single track vehicle model in earth-fixed frame and vehicle-fixed frame.

Moreover, the effects of longitudinal and lateral load transfer are omitted. The race track is assumed to be flat and has uniform friction coefficient. In the following, the vehicle’s longitudinal dynamics, lateral dynamics, and the tire friction constraint.

3.1 Longitudinal Dynamics

The longitudinal model describes the vehicle’s motion along Xbsubscript𝑋𝑏X_{b}italic_X start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-axis of the body-fixed frame. A force balance along the vehicle’s longitudinal direction yields:

mv˙x=FtxFaeroFroll𝑚subscript˙𝑣𝑥subscript𝐹𝑡𝑥subscript𝐹𝑎𝑒𝑟𝑜subscript𝐹𝑟𝑜𝑙𝑙m\dot{v}_{x}=F_{tx}-F_{aero}-F_{roll}italic_m over˙ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_a italic_e italic_r italic_o end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_r italic_o italic_l italic_l end_POSTSUBSCRIPT (1)

where m𝑚mitalic_m is the total mass of the vehicle, Ftxsubscript𝐹𝑡𝑥F_{tx}italic_F start_POSTSUBSCRIPT italic_t italic_x end_POSTSUBSCRIPT is longitudinal tire force, Faerosubscript𝐹𝑎𝑒𝑟𝑜F_{aero}italic_F start_POSTSUBSCRIPT italic_a italic_e italic_r italic_o end_POSTSUBSCRIPT is aerodynamic drag force, and Frollsubscript𝐹𝑟𝑜𝑙𝑙F_{roll}italic_F start_POSTSUBSCRIPT italic_r italic_o italic_l italic_l end_POSTSUBSCRIPT is rolling resistance force. The longitudinal tire force is the friction force from the track that acts on the tire. It is the reacting force of the traction force or braking force.

3.1.1 Traction Force

The traction force is generated by the vehicle’s powertrain and acts on the driven wheels. Different from a gasoline engine with a gearbox, the electric motor has special characteristics when generating torques at different speeds. If the motor speed is no greater than the base speed or rated speed, that is, nmnbsubscript𝑛msubscript𝑛bn_{\text{m}}\leq n_{\text{b}}italic_n start_POSTSUBSCRIPT m end_POSTSUBSCRIPT ≤ italic_n start_POSTSUBSCRIPT b end_POSTSUBSCRIPT, the motor works in the constant torque region, and the output torque can be directly controlled by the motor control signal. The traction force is

Fmx=KmumRwsubscript𝐹𝑚𝑥subscript𝐾𝑚subscript𝑢𝑚subscript𝑅𝑤F_{mx}=\frac{K_{m}u_{m}}{R_{w}}italic_F start_POSTSUBSCRIPT italic_m italic_x end_POSTSUBSCRIPT = divide start_ARG italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG (2)

where Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT is the motor torque coefficient, which is determined by the transmission efficiency and the reduction gear ratio, Rwsubscript𝑅𝑤R_{w}italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is the radius of wheels, um[0,1]subscript𝑢𝑚01u_{m}\in[0,1]italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the normalized motor control input. If the motor speed is greater than the base speed, that is, nm>nbsubscript𝑛msubscript𝑛bn_{\text{m}}>n_{\text{b}}italic_n start_POSTSUBSCRIPT m end_POSTSUBSCRIPT > italic_n start_POSTSUBSCRIPT b end_POSTSUBSCRIPT, the motor works in the constant power region. The output torque is limited by the maximum power Pmaxsubscript𝑃maxP_{\text{max}}italic_P start_POSTSUBSCRIPT max end_POSTSUBSCRIPT. The traction force is

Fmx=min[KmumRw,PmaxnmRw]subscript𝐹𝑚𝑥subscript𝐾msubscript𝑢𝑚subscript𝑅𝑤subscript𝑃maxsubscript𝑛𝑚subscript𝑅𝑤F_{mx}=\min\Big{[}\frac{K_{\text{m}}u_{m}}{R_{w}},~{}~{}\frac{P_{\text{max}}}{% n_{m}R_{w}}\Big{]}italic_F start_POSTSUBSCRIPT italic_m italic_x end_POSTSUBSCRIPT = roman_min [ divide start_ARG italic_K start_POSTSUBSCRIPT m end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_ARG start_ARG italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG , divide start_ARG italic_P start_POSTSUBSCRIPT max end_POSTSUBSCRIPT end_ARG start_ARG italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT end_ARG ] (3)

3.1.2 Braking Force

The braking force is generated by the vehicle’s braking system and acts on all wheels. We assume the braking force is proportional to the braking control signal, that is,

Fbx=Kbubsubscript𝐹𝑏𝑥subscript𝐾bsubscript𝑢𝑏F_{bx}=K_{\text{b}}u_{b}italic_F start_POSTSUBSCRIPT italic_b italic_x end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT b end_POSTSUBSCRIPT italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT (4)

where ub[0,1]subscript𝑢𝑏01u_{b}\in[0,1]italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT ∈ [ 0 , 1 ] is the normalized braking control signal, Kbsubscript𝐾bK_{\text{b}}italic_K start_POSTSUBSCRIPT b end_POSTSUBSCRIPT is the coefficient of the braking system.

3.1.3 Aerodynamic Drag Force

The equivalent aerodynamic drag force on a vehicle in a windless environment can be given by

Faero=12ρACdAfvxsubscript𝐹𝑎𝑒𝑟𝑜12subscript𝜌𝐴subscript𝐶𝑑subscript𝐴𝑓subscript𝑣𝑥F_{aero}=\frac{1}{2}\rho_{A}C_{d}A_{f}v_{x}italic_F start_POSTSUBSCRIPT italic_a italic_e italic_r italic_o end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG 2 end_ARG italic_ρ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT (5)

where ρAsubscript𝜌𝐴\rho_{A}italic_ρ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT is the density of air, Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT is the aerodynamic drag coefficient, AFsubscript𝐴𝐹A_{F}italic_A start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT is the frontal area of the vehicle.

3.1.4 Rolling Resistance Force

The rolling resistance is roughly proportional to the down force on the tires, that is,

Froll=frmgsubscript𝐹𝑟𝑜𝑙𝑙subscript𝑓𝑟𝑚𝑔F_{roll}=f_{r}mgitalic_F start_POSTSUBSCRIPT italic_r italic_o italic_l italic_l end_POSTSUBSCRIPT = italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT italic_m italic_g (6)

where frsubscript𝑓𝑟f_{r}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the rolling resistance coefficient, and g𝑔gitalic_g is the acceleration due to gravity.

The longitudinal motion of the car is controlled by the motor control signal umsubscript𝑢𝑚u_{m}italic_u start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT and the brake control signal ubsubscript𝑢𝑏u_{b}italic_u start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT. However, two control inputs cannot be applied simultaneously in real situations. Therefore, we combine them into a single control signal ux[1,1]subscript𝑢𝑥11u_{x}\in[-1,1]italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ [ - 1 , 1 ], where ux=1subscript𝑢𝑥1u_{x}=1italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 1 denotes full motor power and ux=1subscript𝑢𝑥1u_{x}=-1italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = - 1 denotes full brake.

3.2 Lateral Dynamics

The lateral model describes the vehicle’s translational motion along Ybsubscript𝑌𝑏Y_{b}italic_Y start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-axis and rotational motion around Zbsubscript𝑍𝑏Z_{b}italic_Z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-axis. A force balance along the vehicle’s lateral direction yields

mv˙y=Fyf+Fyr𝑚subscript˙𝑣𝑦subscript𝐹𝑦𝑓subscript𝐹𝑦𝑟m\dot{v}_{y}=F_{yf}+F_{yr}italic_m over˙ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = italic_F start_POSTSUBSCRIPT italic_y italic_f end_POSTSUBSCRIPT + italic_F start_POSTSUBSCRIPT italic_y italic_r end_POSTSUBSCRIPT (7)

A moment balance around Zbsubscript𝑍𝑏Z_{b}italic_Z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-axis yields the rotational dynamics as

Izω˙=FyflfFyrlrsubscript𝐼𝑧˙𝜔subscript𝐹𝑦𝑓subscript𝑙𝑓subscript𝐹𝑦𝑟subscript𝑙𝑟I_{z}\dot{\omega}=F_{yf}l_{f}-F_{yr}l_{r}italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT over˙ start_ARG italic_ω end_ARG = italic_F start_POSTSUBSCRIPT italic_y italic_f end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT - italic_F start_POSTSUBSCRIPT italic_y italic_r end_POSTSUBSCRIPT italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT (8)

where Izsubscript𝐼𝑧I_{z}italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is vehicle body moments of inertia about the body-fixed Zbsubscript𝑍𝑏Z_{b}italic_Z start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT-axis. The lateral tire forces Fyfsubscript𝐹𝑦𝑓F_{yf}italic_F start_POSTSUBSCRIPT italic_y italic_f end_POSTSUBSCRIPT and Fyrsubscript𝐹𝑦𝑟F_{yr}italic_F start_POSTSUBSCRIPT italic_y italic_r end_POSTSUBSCRIPT are proportional to their corresponding tire sideslip angles, that is

Fyfsubscript𝐹𝑦𝑓\displaystyle F_{yf}italic_F start_POSTSUBSCRIPT italic_y italic_f end_POSTSUBSCRIPT =2Cαfαf=2Cαf(δθvf)absent2subscript𝐶subscript𝛼𝑓subscript𝛼𝑓2subscript𝐶subscript𝛼𝑓𝛿subscript𝜃subscript𝑣𝑓\displaystyle=2C_{\alpha_{f}}\alpha_{f}=2C_{\alpha_{f}}(\delta-\theta_{v_{f}})= 2 italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT = 2 italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_δ - italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (9)
Fyrsubscript𝐹𝑦𝑟\displaystyle F_{yr}italic_F start_POSTSUBSCRIPT italic_y italic_r end_POSTSUBSCRIPT =2Cαrαr=2Cαr(θvr)absent2subscript𝐶subscript𝛼𝑟subscript𝛼𝑟2subscript𝐶subscript𝛼𝑟subscript𝜃subscript𝑣𝑟\displaystyle=2C_{\alpha_{r}}\alpha_{r}=2C_{\alpha_{r}}(-\theta_{v_{r}})= 2 italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT = 2 italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( - italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) (10)

where Cαfsubscript𝐶subscript𝛼𝑓C_{\alpha_{f}}italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT and Cαrsubscript𝐶subscript𝛼𝑟C_{\alpha_{r}}italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT are cornering stiffness of the front and rear tires, θvfsubscript𝜃subscript𝑣𝑓\theta_{v_{f}}italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT and θvrsubscript𝜃subscript𝑣𝑟\theta_{v_{r}}italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT denote the direction of the velocity of front and rear wheels, and they can be obtained by

θvf=arctanvy+ωlfvxsubscript𝜃subscript𝑣𝑓subscript𝑣𝑦𝜔subscript𝑙𝑓subscript𝑣𝑥\displaystyle\theta_{v_{f}}=\arctan\frac{v_{y}+\omega l_{f}}{v_{x}}italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arctan divide start_ARG italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT + italic_ω italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG (11)
θvr=arctanvyωlrvxsubscript𝜃subscript𝑣𝑟subscript𝑣𝑦𝜔subscript𝑙𝑟subscript𝑣𝑥\displaystyle\theta_{v_{r}}=\arctan\frac{v_{y}-\omega l_{r}}{v_{x}}italic_θ start_POSTSUBSCRIPT italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT = roman_arctan divide start_ARG italic_v start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT - italic_ω italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT end_ARG (12)

Finally, the vehicle’s velocity in earth-fixed frame can be expressed with the following kinematic equations:

X˙˙𝑋\displaystyle\dot{X}over˙ start_ARG italic_X end_ARG =vcos(ψ+β)absent𝑣𝜓𝛽\displaystyle=v\cos(\psi+\beta)= italic_v roman_cos ( italic_ψ + italic_β ) (13)
Y˙˙𝑌\displaystyle\dot{Y}over˙ start_ARG italic_Y end_ARG =vsin(ψ+β)absent𝑣𝜓𝛽\displaystyle=v\sin(\psi+\beta)= italic_v roman_sin ( italic_ψ + italic_β ) (14)

Moreover, the lateral motion of the vehicle can also be approximated by a simpler kinematic model with the assumption that the slip angles of the front and rear wheels are both zero. Then, the slip angle and turning rate is expressed by:

β𝛽\displaystyle\betaitalic_β =arctan(lrtan(δ)lf+lr)absentsubscript𝑙𝑟𝛿subscript𝑙𝑓subscript𝑙𝑟\displaystyle=\arctan\Big{(}\frac{l_{r}\tan{(\delta)}}{l_{f}+l_{r}}\Big{)}= roman_arctan ( divide start_ARG italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT roman_tan ( italic_δ ) end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG ) (15)
ψ˙˙𝜓\displaystyle\dot{\psi}over˙ start_ARG italic_ψ end_ARG =ω=vtan(δ)cos(β)lf+lrabsent𝜔𝑣𝛿𝛽subscript𝑙𝑓subscript𝑙𝑟\displaystyle=\omega=\frac{v\tan(\delta)\cos(\beta)}{l_{f}+l_{r}}= italic_ω = divide start_ARG italic_v roman_tan ( italic_δ ) roman_cos ( italic_β ) end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT + italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_ARG (16)

For an autonomous racing vehicle, the steering angle of the front wheels δ𝛿\deltaitalic_δ is actuated by the electric power steering system. The input can be regarded as a normalized steering angular speed signal uy[1,1]subscript𝑢𝑦11u_{y}\in{[-1,1]}italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ [ - 1 , 1 ]. uy=1subscript𝑢𝑦1u_{y}=1italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 1 or 11-1- 1 makes the power steering system actuate steering wheels to left or right at the maximum angular speed δ˙maxsubscript˙𝛿\dot{\delta}_{\max}over˙ start_ARG italic_δ end_ARG start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT, and uy=0subscript𝑢𝑦0u_{y}=0italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT = 0 means maintain the current steering angle.

3.3 Constraint of Tire Friction

The longitudinal and lateral forces that control the vehicle rely on the friction force or grip between the tire and the road. In real conditions, the friction force has an upper limit, which depends on a number of physical factors, including road surface material, tire size, tire pressure, tire temperature, etc. If we assume these physical factors remain the same during the race, we could use a friction circle model to describe the constraint of tire friction. As shown in Fig. 2, the friction circle model indicates that the total resultant force Fxysubscript𝐹𝑥𝑦\vec{F}_{xy}over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT applied on the vehicle cannot exceed the maximum friction force, which is

|Fx+Fy|=|Fxy|<μmaxFzsubscript𝐹𝑥subscript𝐹𝑦subscript𝐹𝑥𝑦subscript𝜇subscript𝐹𝑧|\vec{F}_{x}+\vec{F}_{y}|=|\vec{F}_{xy}|<\mu_{\max}F_{z}| over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT + over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT | = | over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT | < italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT (17)

where Fxsubscript𝐹𝑥\vec{F}_{x}over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and Fysubscript𝐹𝑦\vec{F}_{y}over→ start_ARG italic_F end_ARG start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the longitudinal and lateral resultant force vectors, Fzsubscript𝐹𝑧F_{z}italic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT is the total vertical load on the tires. μmaxsubscript𝜇max\mu_{\text{max}}italic_μ start_POSTSUBSCRIPT max end_POSTSUBSCRIPT is the maximum tire-road friction coefficient. The aerodynamic lift force and downforce are omitted in the vehicle model, and we have Fz=mgsubscript𝐹𝑧𝑚𝑔F_{z}=mgitalic_F start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT = italic_m italic_g. Then, the constraint can also be represented by |axy|<μmaxgsubscript𝑎𝑥𝑦subscript𝜇𝑔|\vec{a}_{xy}|<\mu_{\max}g| over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT | < italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_g, where axysubscript𝑎𝑥𝑦\vec{a}_{xy}over→ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_x italic_y end_POSTSUBSCRIPT is the resultant acceleration measured from the vehicle. During the race, sudden and large braking or steering input, especially at high speed, could easily exceed the upper limit of grip, which could result in side slip or tail slip, and bring the vehicle into an uncontrollable spin. To guarantee the vehicle is controllable and stable in a race, we should always consider the constraint of tire friction when designing an autonomous race driving controller.

Refer to caption
Figure 2: Constraint of tire friction described by the friction circle

4 Race Driving with AM-RL

In this section, the autonomous race driving policy is developed with an RL-based method. To guarantee the driving policy satisfies the safety constraint of tire friction in both training and implementation, we apply our proposed AM mechanism to the RL algorithm. In the following, we first introduce the basics of RL and the race driving Markov decision process (MDP) model. Then, the AM mechanism for the tire friction constraint is described. Finally, we present the implementation process of applying AM to a specific RL algorithm and train an autonomous race diving policy.

4.1 Race Driving MDP Model

In this subsection, the race driving MDP model for RL-based autonomous racing approaches is presented. The basic idea of RL is iteratively optimizing the control policy of an agent based on the input-output experiences from a step-by-step agent-environment interaction system. The goal is to maximize the accumulated every-step reward. The agent-environment interaction system is generally modeled as an MDP, which is defined by a tuple 𝒮,𝒜,𝒫,,γ𝒮𝒜𝒫𝛾\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle⟨ caligraphic_S , caligraphic_A , caligraphic_P , caligraphic_R , italic_γ ⟩, where 𝒮𝒮\mathcal{S}caligraphic_S is the set of states, 𝒜𝒜\mathcal{A}caligraphic_A is the set of actions, 𝒫𝒫\mathcal{P}caligraphic_P and \mathcal{R}caligraphic_R are the state transition model and the reward function respectively, γ[0,1]𝛾01\gamma\in[0,1]italic_γ ∈ [ 0 , 1 ] is the discount factor. The policy is denoted as π𝜋\piitalic_π which maps states to actions. For each step of the interaction at time step t𝑡titalic_t, the agent in state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S, takes action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A following its policy π𝜋\piitalic_π, then receives a reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT with the reward function \mathcal{R}caligraphic_R, and finally enters a new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT following the state transition model 𝒫𝒫\mathcal{P}caligraphic_P. The accumulated reward from step t𝑡titalic_t onward is defined as the return, Rt=t=tTmaxγttrtsubscript𝑅𝑡subscriptsuperscriptsubscript𝑇superscript𝑡𝑡superscript𝛾superscript𝑡𝑡subscript𝑟superscript𝑡R_{t}=\sum^{T_{\max}}_{t^{\prime}=t}\gamma^{t^{\prime}-t}r_{t^{\prime}}italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_t end_POSTSUBSCRIPT italic_γ start_POSTSUPERSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT - italic_t end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_t start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT. The expected return following the policy π𝜋\piitalic_π can be represented by the state-action value function, which is defined as Qπ(s,a)=𝔼π[Rt|st,at]superscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]conditionalsubscript𝑅𝑡subscript𝑠𝑡subscript𝑎𝑡Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s_{t},a_{t}]italic_Q start_POSTSUPERSCRIPT italic_π end_POSTSUPERSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ].

The goal of race driving in this study is to finish a lap of the track safely in the shortest possible time. To solve the race driving task with RL, the race driving MDP model which expresses the interaction between an autonomous driving agent and a car-track environment is established first. In this MDP, the state transition model is determined by the aforementioned vehicle dynamic model. The fundamental principle of building the state and action space is to make the agent-environment interaction satisfy the Markov property to the largest extent. Therefore, all system states that could affect the race car’s motion should be included. Following this principle, the state, action, and reward function are defined and described as follows.

Refer to caption
Figure 3: Part of the states defined in our race driving MDP.

State: The states of the race driving MDP are divided into two groups. The first group of states describes the race car’s pose and motion. In the first group of states, vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, ω𝜔\omegaitalic_ω, and δ𝛿\deltaitalic_δ denote the vehicle’s longitudinal velocity, turning rate, and front wheel steering angle, respectively. These states have been defined in Section 3. The car’s position and orientation with respect to the track are expressed by the relative cross-centerline distance dc[1,1]subscript𝑑𝑐11d_{c}\in[-1,1]italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ∈ [ - 1 , 1 ] and the relative heading angle ϕitalic-ϕ\phiitalic_ϕ. As shown in Fig. 3, dcsubscript𝑑𝑐d_{c}italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT is the distance from the car’s center of gravity to the centerline of the track. ϕ(π,π]italic-ϕ𝜋𝜋\phi\in(-\pi,\pi]italic_ϕ ∈ ( - italic_π , italic_π ] is the car’s heading angle with respect to the tangential direction of the centerline projected point (orthogonal projection of the car’s center of gravity onto the centerline). ϕ=0italic-ϕ0\phi=0italic_ϕ = 0 means the car is precisely following the track direction, while ϕ>π/2italic-ϕ𝜋2\phi>\pi/2italic_ϕ > italic_π / 2 or ϕ<π/2italic-ϕ𝜋2\phi<-\pi/2italic_ϕ < - italic_π / 2 means the car is traveling in the opposite direction of the track, also known as wrong-way driving. The second group of states is composed of NFOsubscript𝑁FON_{\text{FO}}italic_N start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT forward-observation vectors, which are used to indicate the curvature of the race track ahead. As shown in Fig. 3, the vectors Vdisubscript𝑉subscript𝑑𝑖\vec{V}_{d_{i}}over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT (red dotted arrow lines) all start from the car’s center point (X,Y)𝑋𝑌(X,Y)( italic_X , italic_Y ) and point to the forward-observation points (Xdi,Ydi)subscript𝑋subscript𝑑𝑖subscript𝑌subscript𝑑𝑖(X_{d_{i}},Y_{d_{i}})( italic_X start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_Y start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) on the centerline. The forward-observation points are determined by moving the car’s centerline projected point forward along the centerline over a distance of disubscript𝑑𝑖d_{i}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Specifically, a forward-observation vector Vdisubscript𝑉subscript𝑑𝑖\vec{V}_{d_{i}}over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT is defined as:

Vdi=𝐑eb(ψ)[XdiX,YdiY]Tsubscript𝑉subscript𝑑𝑖subscriptsuperscript𝐑𝑏𝑒𝜓superscriptsubscript𝑋subscript𝑑𝑖𝑋subscript𝑌subscript𝑑𝑖𝑌𝑇\vec{V}_{d_{i}}=\mathbf{R}^{b}_{e}(\psi)[X_{d_{i}}-X,~{}~{}Y_{d_{i}}-Y]^{T}over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT = bold_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_ψ ) [ italic_X start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_X , italic_Y start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_Y ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT (18)

where 𝐑eb(ψ)2×2subscriptsuperscript𝐑𝑏𝑒𝜓superscript22\mathbf{R}^{b}_{e}(\psi)\in\mathbb{R}^{2\times 2}bold_R start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ( italic_ψ ) ∈ blackboard_R start_POSTSUPERSCRIPT 2 × 2 end_POSTSUPERSCRIPT is the rotational matrix that transforms the forward-looking vector from the inertial frame to the body-fixed frame. Then, NFOsubscript𝑁FON_{\text{FO}}italic_N start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT forward-observation vectors are connected and form a forward-observation feature 𝐕FO=[Vd1,Vd2,,VdNFO]2×NFOsubscript𝐕FOsubscript𝑉subscript𝑑1subscript𝑉subscript𝑑2subscript𝑉subscript𝑑subscript𝑁FOsuperscript2subscript𝑁FO\mathbf{V}_{\text{FO}}=[\vec{V}_{d_{1}},\vec{V}_{d_{2}},\ldots,\vec{V}_{d_{N_{% \text{FO}}}}]\in\mathbb{R}^{2\times N_{\text{FO}}}bold_V start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT = [ over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , … , over→ start_ARG italic_V end_ARG start_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_POSTSUBSCRIPT ] ∈ blackboard_R start_POSTSUPERSCRIPT 2 × italic_N start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT end_POSTSUPERSCRIPT. The state vector of the race driving MDP is defined as:

st=[vx,ω,δ,dc,ϕ,𝐕FO]subscript𝑠𝑡subscript𝑣𝑥𝜔𝛿subscript𝑑𝑐italic-ϕsubscript𝐕FOs_{t}=[v_{x},\omega,\delta,d_{c},\phi,\mathbf{V}_{\text{FO}}]italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_ω , italic_δ , italic_d start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_ϕ , bold_V start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT ] (19)

Action: The vehicle is controlled by the motor/brake control signal uxsubscript𝑢𝑥u_{x}italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and steering control signal uysubscript𝑢𝑦u_{y}italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT. Therefore, we define two actions, axsubscript𝑎𝑥a_{x}italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and aysubscript𝑎𝑦a_{y}italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT, as the output from the race driving policy. The action vector is defined as:

at=[ax,ay]subscript𝑎𝑡subscript𝑎𝑥subscript𝑎𝑦a_{t}=[a_{x},a_{y}]italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] (20)

The relationship between the actions and the control signals is further explained in Section 4.2.

Reward Function: The reward function guides the RL algorithm to optimize the policy toward a certain objective. To define a proper reward function for the autonomous race driving task, the objective of finishing a lap safely in the shortest time should be broken down into each time step. From the perspective of a single time step, the driving policy should maximize the car’s velocity along the track direction, which is denoted by vasubscript𝑣𝑎v_{a}italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT as shown in Fig. 3. Therefore, our velocity reward is set as: rvel=va=vxcosϕsubscript𝑟velsubscript𝑣𝑎subscript𝑣𝑥italic-ϕr_{\text{vel}}=v_{a}=v_{x}\cos{\phi}italic_r start_POSTSUBSCRIPT vel end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT = italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT roman_cos italic_ϕ. Furthermore, to guide the driving policy to operate safely, we set negative rewards rout=100subscript𝑟out100r_{\text{out}}=-100italic_r start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = - 100 for driving off the track, rww=100subscript𝑟ww100r_{\text{ww}}=-100italic_r start_POSTSUBSCRIPT ww end_POSTSUBSCRIPT = - 100 for driving in the wrong way direction, and rtf=100subscript𝑟tf100r_{\text{tf}}=-100italic_r start_POSTSUBSCRIPT tf end_POSTSUBSCRIPT = - 100 for violation of tire friction constraint. If the car is safely operated inside the track, rout=rww=rtf=0subscript𝑟outsubscript𝑟wwsubscript𝑟tf0r_{\text{out}}=r_{\text{ww}}=r_{\text{tf}}=0italic_r start_POSTSUBSCRIPT out end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT ww end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT tf end_POSTSUBSCRIPT = 0. Then, the reward function is defined as:

rt=rvel+rout+rww+rtfsubscript𝑟𝑡subscript𝑟velsubscript𝑟outsubscript𝑟wwsubscript𝑟tfr_{t}=r_{\text{vel}}+r_{\text{out}}+r_{\text{ww}}+r_{\text{tf}}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT vel end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT out end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT ww end_POSTSUBSCRIPT + italic_r start_POSTSUBSCRIPT tf end_POSTSUBSCRIPT (21)

It is noted that the negative reward rtfsubscript𝑟tfr_{\text{tf}}italic_r start_POSTSUBSCRIPT tf end_POSTSUBSCRIPT only applies to conventional RL algorithms that are not specially designed to address the friction constraint. For the proposed AM-RL approaches, rtfsubscript𝑟tfr_{\text{tf}}italic_r start_POSTSUBSCRIPT tf end_POSTSUBSCRIPT is not needed.

4.2 AM mechanism for Tire Friction Constraint

The autonomous racing control policy should generate proper control inputs that satisfy the constraint of tire friction to prevent the car from entering uncontrollable states. Based on the car’s dynamic model, the friction constraint dynamically changes and it is dependent on the car’s speed and steering angle. In this subsection, we proposed a numerical AM mechanism to tackle this type of state-dependent input constraint of tire friction.

The AM mechanism addresses the state-dependent constraint by establishing a map** between an unconstrained virtual policy πVsubscript𝜋V\pi_{\text{V}}italic_π start_POSTSUBSCRIPT V end_POSTSUBSCRIPT and a constrained real policy πRsubscript𝜋R\pi_{\text{R}}italic_π start_POSTSUBSCRIPT R end_POSTSUBSCRIPT. The virtual policy is represented by the neural network which directly maps states to actions, and it does not consider the constraints. Next, the virtual policy is converted to its corresponding constrained real policy that satisfies the state-dependent constraint. The real policy directly interacts with the real system, while the virtual policy is optimized with the RL algorithm using the interaction experiences.

More specifically, we define the unconstrained virtual policy as a=πV(s)𝑎subscript𝜋V𝑠a=\pi_{\text{V}}(s)italic_a = italic_π start_POSTSUBSCRIPT V end_POSTSUBSCRIPT ( italic_s ), where s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S is the state vector. Here we define 𝒮𝒮\mathcal{S}caligraphic_S as a compact subspace of Nssuperscriptsubscript𝑁𝑠\mathbb{R}^{N_{s}}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and Nssubscript𝑁𝑠N_{s}italic_N start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is the dimension of the state space. a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A is the virtual action, and 𝒜𝒜\mathcal{A}caligraphic_A is the unconstrained action space. To be compatible with the neural networks represented policy, in the following, the unconstrained action space 𝒜𝒜\mathcal{A}caligraphic_A is defined as [1,1]Nasuperscript11subscript𝑁𝑎[-1,1]^{N_{a}}[ - 1 , 1 ] start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT which is a compact subspace of Nasuperscriptsubscript𝑁𝑎\mathbb{R}^{N_{a}}blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT, and Nasubscript𝑁𝑎{N_{a}}italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT is the dimension of the action space. Let G𝐺Gitalic_G be a compact set-valued map from 𝒮𝒮\mathcal{S}caligraphic_S to the power set P(Na)𝑃superscriptsubscript𝑁𝑎P(\mathbb{R}^{N_{a}})italic_P ( blackboard_R start_POSTSUPERSCRIPT italic_N start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ), and G𝐺Gitalic_G is characterized by its graph Graph(G)={(x,y)|x𝒮,yG(x)}Graph𝐺conditional-set𝑥𝑦formulae-sequence𝑥𝒮𝑦𝐺𝑥\text{Graph}(G)=\{(x,y)|x\in\mathcal{S},y\in G(x)\}Graph ( italic_G ) = { ( italic_x , italic_y ) | italic_x ∈ caligraphic_S , italic_y ∈ italic_G ( italic_x ) }. In this work, we use G(s)𝐺𝑠G(s)italic_G ( italic_s ) to denote the control input space that satisfies the state-dependent constraint. Then, the constrained real policy is defined as u=πR(s)𝑢subscript𝜋R𝑠u=\pi_{\text{R}}(s)italic_u = italic_π start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( italic_s ), where uG(s)𝑢𝐺𝑠u\in G(s)italic_u ∈ italic_G ( italic_s ) is the real control input.

Let \mathcal{F}caligraphic_F be the set of all continuous functions map** from 𝒮𝒮\mathcal{S}caligraphic_S to 𝒜𝒜\mathcal{A}caligraphic_A. Let \mathcal{H}caligraphic_H be the set of all continuous functions π:𝒮s𝒮G(s):𝜋𝒮subscript𝑠𝒮𝐺𝑠\pi:\mathcal{S}\rightarrow\bigcup_{s\in\mathcal{S}}G(s)italic_π : caligraphic_S → ⋃ start_POSTSUBSCRIPT italic_s ∈ caligraphic_S end_POSTSUBSCRIPT italic_G ( italic_s ) that satisfying the range of π(s)𝜋𝑠\pi(s)italic_π ( italic_s ) in G(s)𝐺𝑠G(s)italic_G ( italic_s ). We assume that the union of all graphs of map π𝜋\piitalic_π in \mathcal{H}caligraphic_H is equal to Graph(G)Graph𝐺\text{Graph}(G)Graph ( italic_G ), that is, Graph(G)=π{(s,π(s))|s𝒮}Graph𝐺subscript𝜋conditional-set𝑠𝜋𝑠𝑠𝒮\text{Graph}(G)=\bigcup_{\pi\in\mathcal{H}}\{(s,\pi(s))|s\in\mathcal{S}\}Graph ( italic_G ) = ⋃ start_POSTSUBSCRIPT italic_π ∈ caligraphic_H end_POSTSUBSCRIPT { ( italic_s , italic_π ( italic_s ) ) | italic_s ∈ caligraphic_S }. Then, the connection between the virtual unconstrained policy πVsubscript𝜋V\pi_{\text{V}}\in\mathcal{F}italic_π start_POSTSUBSCRIPT V end_POSTSUBSCRIPT ∈ caligraphic_F and the real constrained policy πRsubscript𝜋R\pi_{\text{R}}\in\mathcal{H}italic_π start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ∈ caligraphic_H can be described as a map T::𝑇T:\mathcal{H}\rightarrow\mathcal{F}italic_T : caligraphic_H → caligraphic_F. According to the action map** theorem (Theorem 1 in [14]), the map T::𝑇T:\mathcal{H}\rightarrow\mathcal{F}italic_T : caligraphic_H → caligraphic_F exists if and only if there exists a continuous map h:Graph(G)𝒜:Graph𝐺𝒜h:\text{Graph}(G)\rightarrow\mathcal{A}italic_h : Graph ( italic_G ) → caligraphic_A such that, for each s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S, the map hs:G(s)𝒜:subscript𝑠𝐺𝑠𝒜h_{s}:G(s)\rightarrow\mathcal{A}italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT : italic_G ( italic_s ) → caligraphic_A is the homeomorphism of G(s)𝐺𝑠G(s)italic_G ( italic_s ) with 𝒜𝒜\mathcal{A}caligraphic_A and hssubscript𝑠h_{s}italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT is defined as hs(a)=h(s,a)subscript𝑠𝑎𝑠𝑎h_{s}(a)=h(s,a)italic_h start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT ( italic_a ) = italic_h ( italic_s , italic_a ).

In the following, the tire friction constraint in race driving is studied under the framework of AM. The unconstrained virtual policy is constructed as at=πV(st)subscript𝑎𝑡subscript𝜋Vsubscript𝑠𝑡a_{t}=\pi_{\text{V}}(s_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT V end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where at=[ax,ay]subscript𝑎𝑡subscript𝑎𝑥subscript𝑎𝑦a_{t}=[a_{x},a_{y}]italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] is the virtual action vector, and ax,ay[1,1]subscript𝑎𝑥subscript𝑎𝑦11a_{x},a_{y}\in[-1,1]italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ∈ [ - 1 , 1 ] denote the virtual longitudinal action and virtual lateral action respectively. This virtual policy is represented by a neural network and optimized through RL algorithms. The constrained real policy is denoted by ut=πR(st)subscript𝑢𝑡subscript𝜋Rsubscript𝑠𝑡u_{t}=\pi_{\text{R}}(s_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUBSCRIPT R end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), where ut=[ux,uy]subscript𝑢𝑡subscript𝑢𝑥subscript𝑢𝑦u_{t}=[u_{x},u_{y}]italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] is the real control input vector, and uxsubscript𝑢𝑥u_{x}italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, uysubscript𝑢𝑦u_{y}italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT are the normalized motor/brake and the steering control input respectively. The constrained real policy can be obtained by the policy space map** T::𝑇T:\mathcal{H}\rightarrow\mathcal{F}italic_T : caligraphic_H → caligraphic_F, which is realized through its corresponding continuous map** function:

ut=h(s^t,at)subscript𝑢𝑡subscript^𝑠𝑡subscript𝑎𝑡u_{t}=h(\hat{s}_{t},a_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (22)

This continuous map** function maps the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from the unconstrained virtual policy to the real control input utG(s^t)subscript𝑢𝑡𝐺subscript^𝑠𝑡u_{t}\in G(\hat{s}_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ italic_G ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) that satisfies the tire friction constraint. s^t=[vx,δ]subscript^𝑠𝑡subscript𝑣𝑥𝛿\hat{s}_{t}=[v_{x},\delta]over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_δ ] is a subset of the state vector which only contains the state variables related to the tire friction constraint. An illustrative example of the action map** while the race car’s longitudinal velocity vx=15.4m/ssubscript𝑣𝑥15.4m/sv_{x}=15.4~{}\text{m/s}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s, front wheel steering angle δ=7.9deg𝛿7.9deg\delta=7.9~{}\text{deg}italic_δ = 7.9 deg is given in Fig. 4. The boundary of the unconstrained action space 𝒜𝒜\mathcal{A}caligraphic_A is shown on the left, and the action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT can be freely selected inside the boundary. The boundary of the constrained control input space G(s^t)𝐺subscript^𝑠𝑡G(\hat{s}_{t})italic_G ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is shown on the right, and the shape of the boundary depends on the race car’s current states s^tsubscript^𝑠𝑡\hat{s}_{t}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. Any control input vector located outside the boundary will violate the tire friction constraint. In Fig. 4, we give two action vector examples at1=[0.75,0.25]subscript𝑎𝑡10.750.25a_{t1}=[-0.75,0.25]italic_a start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT = [ - 0.75 , 0.25 ] and at2=[0.75,0.75]subscript𝑎𝑡20.750.75a_{t2}=[0.75,-0.75]italic_a start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT = [ 0.75 , - 0.75 ] marked in blue arrows. If we directly use those two actions as the control inputs to the real system, at2subscript𝑎𝑡2a_{t2}italic_a start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT satisfies the constraint while at1subscript𝑎𝑡1a_{t1}italic_a start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT fails. Therefore, in this task, an intuitive explanation of the action map** is to map the atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to its corresponding utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and guarantee it is inside the boundary of G(s^t)𝐺subscript^𝑠𝑡G(\hat{s}_{t})italic_G ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ).

Refer to caption
Figure 4: Action map** example at state: vx=15.4m/s,δ=7.9degformulae-sequencesubscript𝑣𝑥15.4m/s𝛿7.9degv_{x}=15.4~{}\text{m/s},\delta=7.9~{}\text{deg}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s , italic_δ = 7.9 deg. The virtual action vector examples at1subscript𝑎𝑡1a_{t1}italic_a start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT and at2subscript𝑎𝑡2a_{t2}italic_a start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT and the boundary are shown on the left. The real control input examples ut1subscript𝑢𝑡1u_{t1}italic_u start_POSTSUBSCRIPT italic_t 1 end_POSTSUBSCRIPT and ut2subscript𝑢𝑡2u_{t2}italic_u start_POSTSUBSCRIPT italic_t 2 end_POSTSUBSCRIPT are shown on the right.

However, due to the complexity of the vehicle dynamics and constraints, it is rather difficult to give a closed-form expression of the map** function. Therefore, we provide a numerical approximation method to implement the action map** mechanism for a complex dynamic system with state-dependent input constraints. The basic idea is to shorten those overrun action vectors to the boundary of G(s^)𝐺^𝑠G(\hat{s})italic_G ( over^ start_ARG italic_s end_ARG ) while kee** the same direction. For the convenience of processing the vectors, both action vectors and control input vectors are temporarily converted to polar coordinate form. The action and control inputs are expressed by [ρa,ϑa]subscript𝜌𝑎subscriptitalic-ϑ𝑎[\rho_{a},\vartheta_{a}][ italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] and [ρu,ϑu]subscript𝜌𝑢subscriptitalic-ϑ𝑢[\rho_{u},\vartheta_{u}][ italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ] respectively. ρasubscript𝜌𝑎\rho_{a}italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ρusubscript𝜌𝑢\rho_{u}italic_ρ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denote the lengths; ϑasubscriptitalic-ϑ𝑎\vartheta_{a}italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT and ϑusubscriptitalic-ϑ𝑢\vartheta_{u}italic_ϑ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT denote the directions. Here, we define a control input space boundary function that gives the maximum length of action direction ϑitalic-ϑ\varthetaitalic_ϑ in the current car state, that is,

ρ¯=Ψ(vx,δ,ϑ)¯𝜌Ψsubscript𝑣𝑥𝛿italic-ϑ\bar{\rho}=\Psi(v_{x},\delta,\vartheta)over¯ start_ARG italic_ρ end_ARG = roman_Ψ ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_δ , italic_ϑ ) (23)

In the following, we present a numerical method to determine the boundary function. Let vi={v1,v2,,vNv}subscript𝑣𝑖subscript𝑣1subscript𝑣2subscript𝑣subscript𝑁𝑣v_{i}=\{v_{1},v_{2},\ldots,v_{N_{v}}\}italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_v start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT end_POSTSUBSCRIPT } be Nvsubscript𝑁𝑣N_{v}italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT evenly spaced velocity values over [0,vmax]0subscript𝑣[0,v_{\max}][ 0 , italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], and vmaxsubscript𝑣v_{\max}italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum speed. Let δj={δ1,δ2,,δNδ}subscript𝛿𝑗subscript𝛿1subscript𝛿2subscript𝛿subscript𝑁𝛿\delta_{j}=\{\delta_{1},\delta_{2},\ldots,\delta_{N_{\delta}}\}italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = { italic_δ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_δ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT end_POSTSUBSCRIPT } be Nδsubscript𝑁𝛿N_{\delta}italic_N start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT evenly spaced front wheel steering angle values over [δmax,δmax]subscript𝛿subscript𝛿[-\delta_{\max},\delta_{\max}][ - italic_δ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT ], and δmaxsubscript𝛿\delta_{\max}italic_δ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT is the maximum steering angle. Let ϑk={ϑ1,ϑ2,,ϑNϑ}subscriptitalic-ϑ𝑘subscriptitalic-ϑ1subscriptitalic-ϑ2subscriptitalic-ϑsubscript𝑁italic-ϑ\vartheta_{k}=\{\vartheta_{1},\vartheta_{2},\ldots,\vartheta_{N_{\vartheta}}\}italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = { italic_ϑ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_ϑ start_POSTSUBSCRIPT italic_N start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT end_POSTSUBSCRIPT } be Nϑsubscript𝑁italic-ϑN_{\vartheta}italic_N start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT evenly spaced action vector directions over (π,π]𝜋𝜋(-\pi,\pi]( - italic_π , italic_π ]. Then, we iterate all combinations of the race car’s state (vi,δj)subscript𝑣𝑖subscript𝛿𝑗(v_{i},\delta_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) with all control input directions and lengths, and check the car’s response to control inputs using the dynamic model. More specifically, for each car state (vi,δj)subscript𝑣𝑖subscript𝛿𝑗(v_{i},\delta_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), we iterate all action directions. For a certain direction ϑksubscriptitalic-ϑ𝑘\vartheta_{k}italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT, we apply the control input vector to the dynamic system while increasing the length ρ𝜌\rhoitalic_ρ till the control input fails to satisfy the tire friction constraint, and we could determine the maximum length ρ¯¯𝜌\bar{\rho}over¯ start_ARG italic_ρ end_ARG at car state (vi,δj)subscript𝑣𝑖subscript𝛿𝑗(v_{i},\delta_{j})( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) and action direction ϑksubscriptitalic-ϑ𝑘\vartheta_{k}italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT. Based on this sampling method, we obtain a look-up table of three dimensions to represent the boundary function, that is ρ¯i,j,k=Ψ(vi,δj,ϑk)subscript¯𝜌𝑖𝑗𝑘Ψsubscript𝑣𝑖subscript𝛿𝑗subscriptitalic-ϑ𝑘\bar{\rho}_{i,j,k}=\Psi(v_{i},\delta_{j},\vartheta_{k})over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = roman_Ψ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ). To demonstrate the boundary function built from sampling, we visualize the boundary function of vx=15.4m/ssubscript𝑣𝑥15.4m/sv_{x}=15.4~{}\text{m/s}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s in Fig. 5. Inside the 3D space is the admissible space for control inputs [ux,uy]subscript𝑢𝑥subscript𝑢𝑦[u_{x},u_{y}][ italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] of different steering angle δ𝛿\deltaitalic_δ. Fig. 6 gives another view of the control input boundaries at velocity vx=15.4m/ssubscript𝑣𝑥15.4m/sv_{x}=15.4~{}\text{m/s}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s with steering angle δ𝛿\deltaitalic_δ in range [4.6,8.1]4.68.1[4.6,8.1][ 4.6 , 8.1 ] deg.

Refer to caption
Figure 5: Admissible control input space at vx=15.4m/ssubscript𝑣𝑥15.4m/sv_{x}=15.4~{}\text{m/s}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s with full range of steering angle.
Refer to caption
Figure 6: Boundaries of constrained control input space with different steering angles δ[4.6,8.1]deg𝛿4.68.1deg\delta\in[4.6,8.1]~{}\text{deg}italic_δ ∈ [ 4.6 , 8.1 ] deg at velocity vx=15.4m/ssubscript𝑣𝑥15.4m/sv_{x}=15.4~{}\text{m/s}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT = 15.4 m/s

Due to the discrete nature of the boundary function, we cannot directly find the maximum length ρ¯¯𝜌\bar{\rho}over¯ start_ARG italic_ρ end_ARG for any vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT, δ𝛿\deltaitalic_δ, and ϑasubscriptitalic-ϑ𝑎\vartheta_{a}italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT, which are continuous values. Therefore, we use an off-the-shelf linear multidimensional interpolation method to approximate the maximum length. The approximated boundary function is denoted by ρ^=Ψ^(vx,δ,ϑ)^𝜌^Ψsubscript𝑣𝑥𝛿italic-ϑ\hat{\rho}=\hat{\Psi}(v_{x},\delta,\vartheta)over^ start_ARG italic_ρ end_ARG = over^ start_ARG roman_Ψ end_ARG ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_δ , italic_ϑ ), where ρ^^𝜌\hat{\rho}over^ start_ARG italic_ρ end_ARG is the approximate value of ρ¯¯𝜌\bar{\rho}over¯ start_ARG italic_ρ end_ARG. Finally, we summarize the numerical action map** procedure with the discrete boundary function in Algorithm 1. With the help of this numerical action map** method, the action that could violate the constraint is mapped to a safe control input right inside the boundary. This method fundamentally prevents the race car from entering uncontrollable states while making full use of the maximum tire-road friction.

Algorithm 1 Numerical Action Map** with Discrete Boundary Function
1:  Load discrete boundary function ρ¯i,j,k=Ψ(vi,δj,ϑk)subscript¯𝜌𝑖𝑗𝑘Ψsubscript𝑣𝑖subscript𝛿𝑗subscriptitalic-ϑ𝑘\bar{\rho}_{i,j,k}=\Psi(v_{i},\delta_{j},\vartheta_{k})over¯ start_ARG italic_ρ end_ARG start_POSTSUBSCRIPT italic_i , italic_j , italic_k end_POSTSUBSCRIPT = roman_Ψ ( italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )
2:  Input car speed vxsubscript𝑣𝑥v_{x}italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT and steering angle δ𝛿\deltaitalic_δ
3:  Input unconstraint virtual action at=[ax,ay]subscript𝑎𝑡subscript𝑎𝑥subscript𝑎𝑦a_{t}=[a_{x},a_{y}]italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ]
4:  Convert virtual action to polar coordinate form [ax,ay][ρa,ϑa]subscript𝑎𝑥subscript𝑎𝑦subscript𝜌𝑎subscriptitalic-ϑ𝑎[a_{x},a_{y}]\rightarrow[\rho_{a},\vartheta_{a}][ italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ] → [ italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ]
5:  Calculate maximum length ρ^=Ψ^(vx,δ,ϑa)^𝜌^Ψsubscript𝑣𝑥𝛿subscriptitalic-ϑ𝑎\hat{\rho}=\hat{\Psi}(v_{x},\delta,\vartheta_{a})over^ start_ARG italic_ρ end_ARG = over^ start_ARG roman_Ψ end_ARG ( italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_δ , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) using a multidimensional interpolation method on discrete boundary function
6:  if ρaρ^subscript𝜌𝑎^𝜌\rho_{a}\leq\hat{\rho}italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ≤ over^ start_ARG italic_ρ end_ARG then
7:     constraint control ut=[ρa,ϑa]subscript𝑢𝑡subscript𝜌𝑎subscriptitalic-ϑ𝑎u_{t}=[\rho_{a},\vartheta_{a}]italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ]
8:  else
9:     constraint control ut=[ρ^,ϑa]subscript𝑢𝑡^𝜌subscriptitalic-ϑ𝑎u_{t}=[\hat{\rho},\vartheta_{a}]italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ over^ start_ARG italic_ρ end_ARG , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ]
10:  end if
11:  Convert to Cartesian coordinate form ut=[ρ^,ϑa]or[ρa,ϑa][ux,uy]subscript𝑢𝑡^𝜌subscriptitalic-ϑ𝑎orsubscript𝜌𝑎subscriptitalic-ϑ𝑎subscript𝑢𝑥subscript𝑢𝑦u_{t}=[\hat{\rho},\vartheta_{a}]~{}\text{or}~{}[\rho_{a},\vartheta_{a}]% \rightarrow[u_{x},u_{y}]italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ over^ start_ARG italic_ρ end_ARG , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] or [ italic_ρ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT , italic_ϑ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ] → [ italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ]
12:  Output constraint control ut=[ux,uy]subscript𝑢𝑡subscript𝑢𝑥subscript𝑢𝑦u_{t}=[u_{x},u_{y}]italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = [ italic_u start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ]

4.3 Implementation of RL Training with AM

The AM mechanism can address the state-dependent constraint for a variety of policy gradient-based RL algorithms with parameterized policy and continuous action space, such as DDPG, TD3, PPO, SAC, etc. In this subsection, we incorporate the numerical approximation method of AM to the TD3 algorithm to train an autonomous race driving policy as an example.

TD3 is a deterministic policy-based reinforcement learning algorithm that employs an actor-critic architecture. The actor and critic functions are represented by fully connected neural networks. A block diagram of the network structure is shown in Fig. 7. The actor network, which represents the unconstrained virtual policy is denoted as πμ(s)superscript𝜋𝜇𝑠\pi^{\mu}(s)italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s ), and μ𝜇\muitalic_μ is the network parameters. As illustrated in Fig. 7, the actor network has two hidden layers, and each hidden layer contains 256256256256 hidden nodes with the ReLU activation function. For the output layer, we use the tanh function to limit the actions between 11-1- 1 and 1111. The action network’s target network shares the same structure as the action network, and it is denoted as πμ(s)superscript𝜋superscript𝜇𝑠\pi^{\mu^{\prime}}(s)italic_π start_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s ). The state-action value function Qπμ(s,a)superscript𝑄superscript𝜋𝜇𝑠𝑎Q^{\pi^{\mu}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) is approximated by the critic networks. Different from the conventional DDPG algorithm, the TD3 algorithm introduces a pair of critic networks to mitigate the value function overestimation. The two critic networks are denoted by Qw1(s,a)superscript𝑄subscript𝑤1𝑠𝑎Q^{w_{1}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) and Qw2(s,a)superscript𝑄subscript𝑤2𝑠𝑎Q^{w_{2}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ). They share the same structure which is shown in Fig. 7. The input of critic networks concatenates the state vector and the action vector. The hidden layers have the same structure as the actor network while the output layer uses a linear function. The target networks of the critic networks are denoted by Qw1(s,a)superscript𝑄subscriptsuperscript𝑤1𝑠𝑎Q^{w^{\prime}_{1}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) and Qw2(s,a)superscript𝑄subscriptsuperscript𝑤2𝑠𝑎Q^{w^{\prime}_{2}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ).

Refer to caption
Figure 7: Diagram of TD3 with AM for race driving policy training and the structure of the actor and critic networks.

The critic networks are trained using a batch training method with a replay buffer. For each time step t𝑡titalic_t, a state transition experience et=(st,at,rt+1,st+1)subscript𝑒𝑡subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1subscript𝑠𝑡1e_{t}=(s_{t},a_{t},r_{t+1},s_{t+1})italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) is saved to the replay buffer 𝒟𝒟\mathcal{D}caligraphic_D. During each training iteration, a batch of experiences (si,ai,ri,si)i=1,2,,Nsubscriptsubscript𝑠𝑖subscript𝑎𝑖subscript𝑟𝑖subscriptsuperscript𝑠𝑖𝑖12𝑁(s_{i},a_{i},r_{i},s^{\prime}_{i})_{i=1,2,\ldots,N}( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_i = 1 , 2 , … , italic_N end_POSTSUBSCRIPT is randomly selected from the replay buffer. Here, risubscript𝑟𝑖r_{i}italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and sisubscriptsuperscript𝑠𝑖s^{\prime}_{i}italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT are the reward received and state reached after taking aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT at state sisubscript𝑠𝑖s_{i}italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Next, two critic networks are trained separately to minimize the TD-error using the loss function:

(wj)=1Ni=1N[yiQwj(si,ai)]2,subscript𝑤𝑗1𝑁subscriptsuperscript𝑁𝑖1superscriptdelimited-[]subscript𝑦𝑖superscript𝑄subscript𝑤𝑗subscript𝑠𝑖subscript𝑎𝑖2\mathcal{L}(w_{j})=\frac{1}{N}\sum^{N}_{i=1}\Big{[}y_{i}-Q^{w_{j}}(s_{i},a_{i}% )\Big{]}^{2},caligraphic_L ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , (24)

where j=1,2𝑗12j=1,2italic_j = 1 , 2, and yisubscript𝑦𝑖y_{i}italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the target value from the target critic networks, which is,

yi=ri+γminj=1,2Qwj(si,πμ(si)+ϵ).subscript𝑦𝑖subscript𝑟𝑖𝛾subscript𝑗12superscript𝑄subscriptsuperscript𝑤𝑗subscriptsuperscript𝑠𝑖superscript𝜋superscript𝜇subscriptsuperscript𝑠𝑖italic-ϵy_{i}=r_{i}+\gamma\min_{j=1,2}Q^{w^{\prime}_{j}}(s^{\prime}_{i},\pi^{\mu^{% \prime}}(s^{\prime}_{i})+\epsilon).italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT + italic_γ roman_min start_POSTSUBSCRIPT italic_j = 1 , 2 end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_π start_POSTSUPERSCRIPT italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + italic_ϵ ) . (25)

The target value for the critic networks is from the smaller value between two target networks, which reduces the overestimation of the state-action value. ϵitalic-ϵ\epsilonitalic_ϵ is a small noise sampled from a clipped Gaussian distribution clip(𝒩(0,σϵ2),c,c)clip𝒩0superscriptsubscript𝜎italic-ϵ2𝑐𝑐\text{clip}(\mathcal{N}(0,\sigma_{\epsilon}^{2}),-c,c)clip ( caligraphic_N ( 0 , italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) , - italic_c , italic_c ), which is used to avoid overfitting. After the target values are determined, the gradients is given as:

wj(wj)=1Ni=1N[yiQwj(si,ai)]wjQwj(si,ai).subscriptsubscript𝑤𝑗subscript𝑤𝑗1𝑁subscriptsuperscript𝑁𝑖1delimited-[]subscript𝑦𝑖superscript𝑄subscript𝑤𝑗subscript𝑠𝑖subscript𝑎𝑖subscriptsubscript𝑤𝑗superscript𝑄subscript𝑤𝑗subscript𝑠𝑖subscript𝑎𝑖\nabla_{w_{j}}\mathcal{L}(w_{j})=\frac{1}{N}\sum^{N}_{i=1}\Big{[}y_{i}-Q^{w_{j% }}(s_{i},a_{i})\Big{]}\nabla_{w_{j}}Q^{w_{j}}(s_{i},a_{i}).∇ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT [ italic_y start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ] ∇ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) . (26)

The parameters of two critic networks are adjusted according to:

wjwjαwwj(wj),subscript𝑤𝑗subscript𝑤𝑗subscript𝛼𝑤subscriptsubscript𝑤𝑗subscript𝑤𝑗w_{j}\leftarrow w_{j}-\alpha_{w}\nabla_{w_{j}}\mathcal{L}(w_{j}),italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT - italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT caligraphic_L ( italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , (27)

where αwsubscript𝛼𝑤\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT is a small updating rate.

The policy’s performance is assessed based on the expected return starting from the initial time step, that is, J(μ)=𝔼πμ[R1]𝐽𝜇subscript𝔼superscript𝜋𝜇delimited-[]subscript𝑅1J(\mu)=\mathbb{E}_{\pi^{\mu}}[R_{1}]italic_J ( italic_μ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ italic_R start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ]. According to the deterministic policy gradient theorem [44], the gradient of the policy performance is given by:

μJ(μ)=𝔼πμ[μπμ(s)aQπμ(s,a)|a=πμ(s)].subscript𝜇𝐽𝜇subscript𝔼superscript𝜋𝜇delimited-[]evaluated-atsubscript𝜇superscript𝜋𝜇𝑠subscript𝑎superscript𝑄superscript𝜋𝜇𝑠𝑎𝑎superscript𝜋𝜇𝑠\nabla_{\mu}J(\mu)=\mathbb{E}_{\pi^{\mu}}\Big{[}\nabla_{\mu}\pi^{\mu}(s)\nabla% _{a}Q^{\pi^{\mu}}(s,a)|_{a=\pi^{\mu}(s)}\Big{]}.∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_J ( italic_μ ) = blackboard_E start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT [ ∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s ) ∇ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) | start_POSTSUBSCRIPT italic_a = italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s ) end_POSTSUBSCRIPT ] . (28)

Then, the actor network is trained using the approximated gradient of the performance function based on the critic network Qw1(s,a)superscript𝑄subscript𝑤1𝑠𝑎Q^{w_{1}}(s,a)italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s , italic_a ) and the same batch of transition experiences. The approximation of the policy gradient is given by:

μJ(μ)1Ni=1Nμπμ(si)aQw1(si,ai)|ai=πμ(si).subscript𝜇𝐽𝜇evaluated-at1𝑁subscriptsuperscript𝑁𝑖1subscript𝜇superscript𝜋𝜇subscript𝑠𝑖subscript𝑎superscript𝑄subscript𝑤1subscript𝑠𝑖subscript𝑎𝑖subscript𝑎𝑖superscript𝜋𝜇subscript𝑠𝑖\nabla_{\mu}J(\mu)\approx\frac{1}{N}\sum^{N}_{i=1}\nabla_{\mu}\pi^{\mu}(s_{i})% \nabla_{a}Q^{w_{1}}(s_{i},a_{i})|_{a_{i}=\pi^{\mu}(s_{i})}.∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_J ( italic_μ ) ≈ divide start_ARG 1 end_ARG start_ARG italic_N end_ARG ∑ start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ∇ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q start_POSTSUPERSCRIPT italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) | start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT . (29)

The virtual control policy is improved by updating the parameter μ𝜇\muitalic_μ using gradient ascent with a small updating rate aμsubscript𝑎𝜇a_{\mu}italic_a start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT:

μμ+αμμJ(μ)𝜇𝜇subscript𝛼𝜇subscript𝜇𝐽𝜇\mu\leftarrow\mu+\alpha_{\mu}\nabla_{\mu}J(\mu)italic_μ ← italic_μ + italic_α start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ∇ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_J ( italic_μ ) (30)

After each training iteration, the target networks are soft-updated with small update rates βwsubscript𝛽𝑤\beta_{w}italic_β start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT and βμsubscript𝛽𝜇\beta_{\mu}italic_β start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT:

wjsubscriptsuperscript𝑤𝑗\displaystyle w^{\prime}_{j}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT βwwj+(1βw)wjabsentsubscript𝛽𝑤subscript𝑤𝑗1subscript𝛽𝑤subscriptsuperscript𝑤𝑗\displaystyle\leftarrow\beta_{w}w_{j}+(1-\beta_{w})w^{\prime}_{j}← italic_β start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT italic_w start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + ( 1 - italic_β start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT ) italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (31)
μsuperscript𝜇\displaystyle\mu^{\prime}italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT βμμ+(1βμ)μabsentsubscript𝛽𝜇𝜇1subscript𝛽𝜇superscript𝜇\displaystyle\leftarrow\beta_{\mu}\mu+(1-\beta_{\mu})\mu^{\prime}← italic_β start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT italic_μ + ( 1 - italic_β start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) italic_μ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (32)

The overall training procedures of the race driving policy with TD3-AM are summarized in Algorithm 2. In line 9, a Gaussian noise nt𝒩(0,σμ2)similar-tosubscript𝑛𝑡𝒩0subscriptsuperscript𝜎2𝜇n_{t}\sim\mathcal{N}(0,\sigma^{2}_{\mu})italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∼ caligraphic_N ( 0 , italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT ) is added to the virtual action to promote exploration. In lines 14-20, a delayed policy update method is utilized to further improve the stability of the training process. In particular, updates to the actor network and the target networks occur after a set number (Tdelaysubscript𝑇delayT_{\text{delay}}italic_T start_POSTSUBSCRIPT delay end_POSTSUBSCRIPT) of updates to the critic networks. Furthermore, the policy is saved and evaluated at a certain interval of training iterations.

Algorithm 2 Race Driving Policy Training with TD3-AM
1:  Randomly initialize the parameters the actor network, twin critic networks, and their target networks.
2:  Initialize replay buffer 𝒟𝒟\mathcal{D}caligraphic_D
3:  Load race driving simulation environment
4:  for episode = 1111 to MaxEpisode do
5:     Set initial car state
6:     Observe initial state s1subscript𝑠1s_{1}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT
7:     for time step t=1𝑡1t=1italic_t = 1 to MaxStep do
8:        Generate virtual action at=πμ(st)+ntsubscript𝑎𝑡superscript𝜋𝜇subscript𝑠𝑡subscript𝑛𝑡a_{t}=\pi^{\mu}(s_{t})+n_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_π start_POSTSUPERSCRIPT italic_μ end_POSTSUPERSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_n start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
9:        Map** to real control input ut=h(s^t,at)subscript𝑢𝑡subscript^𝑠𝑡subscript𝑎𝑡u_{t}=h(\hat{s}_{t},a_{t})italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_h ( over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT )
10:        Apply control input utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to car dynamic model
11:        Observe new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT and receive reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT
12:        Store the transition (st,at,rt+1,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1subscript𝑠𝑡1(s_{t},a_{t},r_{t+1},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) in 𝒟𝒟\mathcal{D}caligraphic_D
13:        Select a batch of N𝑁Nitalic_N experiences randomly from 𝒟𝒟\mathcal{D}caligraphic_D
14:        Calculate target values according to (25)
15:        Update critic networks according to (26) and (27)
16:        if t𝑡titalic_t mod Tdelay=0subscript𝑇delay0T_{\text{delay}}=0italic_T start_POSTSUBSCRIPT delay end_POSTSUBSCRIPT = 0 then
17:           Update actor network according to (29) and (30)
18:           Update critic target networks according to (31)
19:           Update actor target network according to (32)
20:        end if
21:        if car drives off-track or wrong-way then
22:           break
23:        end if
24:     end for
25:  end for

5 Simulations and Results

In this section, the proposed RL-based race driving strategies are evaluated in our built race simulation environment. In the following, we first introduce the simulation environment and training details. Then, the evaluation results are presented and discussed.

5.1 Simulation Environment and Training Details

The simulation environment is developed following the vehicle model given in Section 3. The Runge-Kutta four-order (RK4) method is used to numerically solve the differential equations. The simulation time step is 0.010.010.010.01s. The car model used in the simulation is an all-electric mid-size sedan. The physical parameters of the car model are given in Table LABEL:tb_car.

Table 1: Car Model Parameters
Parameter Value Unit
Vehicle mass m𝑚mitalic_m 1860186018601860 kg
Front axle distance lfsubscript𝑙𝑓l_{f}italic_l start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 1.171.171.171.17 m
Rear axle distance lrsubscript𝑙𝑟l_{r}italic_l start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 1.771.771.771.77 m
Tire rolling radius Rwsubscript𝑅𝑤R_{w}italic_R start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 0.310.310.310.31 m
Tire corner stiffness Cαf,Cαrsubscript𝐶subscript𝛼𝑓subscript𝐶subscript𝛼𝑟C_{\alpha_{f}},C_{\alpha_{r}}italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_C start_POSTSUBSCRIPT italic_α start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT end_POSTSUBSCRIPT 54,5005450054,50054 , 500 N/rad
Tire rolling resistance coefficient frsubscript𝑓𝑟f_{r}italic_f start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT 0.0150.0150.0150.015 -
Maximum steering angle 35353535 deg
Yaw moment of inertia Izsubscript𝐼𝑧I_{z}italic_I start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT 4000400040004000 kgm2absentsuperscriptm2\cdot\text{m}^{2}⋅ m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Aerodynamic drag coefficient Cdsubscript𝐶𝑑C_{d}italic_C start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT 0.30.30.30.3 -
Density of air ρAsubscript𝜌𝐴\rho_{A}italic_ρ start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT 1.22581.22581.22581.2258 kg/m3superscriptm3\text{m}^{3}m start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT
Vehicle frontal area Afsubscript𝐴𝑓A_{f}italic_A start_POSTSUBSCRIPT italic_f end_POSTSUBSCRIPT 2.05 m2superscriptm2\text{m}^{2}m start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT
Maximum motor power Pmaxsubscript𝑃maxP_{\text{max}}italic_P start_POSTSUBSCRIPT max end_POSTSUBSCRIPT 125125125125 kW
Motor torque coefficient Kmsubscript𝐾𝑚K_{m}italic_K start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT 1,55015501,5501 , 550 Nmabsentm\cdot\text{m}⋅ m
Braking force coefficient Kbsubscript𝐾𝑏K_{b}italic_K start_POSTSUBSCRIPT italic_b end_POSTSUBSCRIPT 16,4221642216,42216 , 422 N
Maximum friction coefficient μmaxsubscript𝜇\mu_{\max}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT 1.151.151.151.15 -
Acceleration due to gravity g𝑔gitalic_g 9.819.819.819.81 m/s2msuperscripts2\text{m}/\text{s}^{2}m / s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT

We build two race tracks for race driving policy training and evaluation. The layout of track-A is given in Fig. 8(a). It is a simple testing track with only five corners. The length and width of track-A are 860860860860m and 20202020m respectively. The layout of track-B is given in Fig. 8(b). The layout is modeled after the Ruisi Circuit located in Bei**g, China. Track-B has ten corners. The total length is 1400140014001400m and the width is 10101010m. In Fig. 8, the finish line and running direction of the tracks are marked with a red line across the track and a white arrow respectively.

Refer to caption
(a) Track-A (length: 860m; width: 20m)
Refer to caption
(b) Track-B (length: 1400m; width: 10m)
Figure 8: Layout of two race tracks in the racing simulation environment

A typical race track can be regarded as a sequence of basic corners connected together. Therefore, the driving skill of cornering is the key to minimizing the lap time. As shown in Fig. 9, a basic corner can be divided into three parts: in-straight (red part), curve (blue part), and out-straight (green part). The solid green line indicates the racing line, which is the theoretical fastest line through the corner. Obviously, the racing line significantly reduces the tightness of the corner by using the in-straight and the out-straight parts, which allows for the highest speed possible to run through this corner. There are four key points on the racing line. The braking point is the position to start applying the brake before the corner. The turn-in point is the position to start steering into the corner. The apex is located on the inside of a corner, and it is the aiming point after the turn-in point. The apex is also the position to start acceleration. The exit point is where the car once again reaches the outside of the corner, and it is also the aiming point after the car passes the apex. The ideal racing line and key points depend on the curvature of the corner, the condition of the previous corner and the following corner, and the handling performance of the car being driven.

Refer to caption
Figure 9: A right angle turn corner with racing line and key points.

We conduct the following simulation experiments on a workstation with Ubuntu 20.04 operating system, Intel Core i9-13900k CPU, 32GB RAM, and NVIDIA GeForce RTX 4090 GPU. The neural networks for RL algorithms are built with the PyTorch framework and implemented on the GPU.

The driving policy is trained following Algorithm 2. To implement the numerical AM mechanism, we first determine the discrete boundary function based on the race car’s dynamics with friction constraint. The numbers of the discretization points are set as: Nv=Nδ=Nϑ=200subscript𝑁𝑣subscript𝑁𝛿subscript𝑁italic-ϑ200N_{v}=N_{\delta}=N_{\vartheta}=200italic_N start_POSTSUBSCRIPT italic_v end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_δ end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT italic_ϑ end_POSTSUBSCRIPT = 200. The discretization step sizes for speed, steering angle, and action vector angle are 0.150.150.150.15 m/s, 0.350.350.350.35 deg, and 1.81.81.81.8 deg, respectively. Then, we could use the AM mechanism in policy training following Algorithm 1. The training parameters are listed in Table LABEL:tb_rlparam. At the beginning of each training episode, we place the car on a random position in the straight part of the track with a random speed vx[0,30]subscript𝑣𝑥030v_{x}\in[0,30]italic_v start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT ∈ [ 0 , 30 ] . The initial ω𝜔\omegaitalic_ω, ϕitalic-ϕ\phiitalic_ϕ, and δ𝛿\deltaitalic_δ are all zero. For each time step, the states are normalized to [0,1]01[0,1][ 0 , 1 ] or [1,1]11[-1,1][ - 1 , 1 ] with respect to their minimum and maximum values. For the forward-observation feature state VFOsubscriptVFO\textbf{V}_{\text{FO}}V start_POSTSUBSCRIPT FO end_POSTSUBSCRIPT, we utilize 12121212 forward observation points at varying distances: di={10,20,30,40,60,80,100,120,140,160,180,200}subscript𝑑𝑖102030406080100120140160180200d_{i}=\{10,20,30,40,60,80,100,120,140,160,180,200\}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = { 10 , 20 , 30 , 40 , 60 , 80 , 100 , 120 , 140 , 160 , 180 , 200 } meters. The maximum duration for one episode is 10,0001000010,00010 , 000 steps, equivalent to 100100100100 seconds.

Table 2: Training Parameters
Parameter Value
Discount factor γ𝛾\gammaitalic_γ 0.990.990.990.99
Initial learning rate for critic network αwsubscript𝛼𝑤\alpha_{w}italic_α start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 0.00030.00030.00030.0003
Initial learning rate for actor network αμsubscript𝛼𝜇\alpha_{\mu}italic_α start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT 0.00030.00030.00030.0003
Updating rate for target critic network βwsubscript𝛽𝑤\beta_{w}italic_β start_POSTSUBSCRIPT italic_w end_POSTSUBSCRIPT 0.0050.0050.0050.005
Updating rate for target actor network βμsubscript𝛽𝜇\beta_{\mu}italic_β start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT 0.0050.0050.0050.005
Batch size N𝑁Nitalic_N 256256256256
Replay buffer size M𝑀Mitalic_M 1,000,00010000001,000,0001 , 000 , 000
Exploration noise variance σμsubscript𝜎𝜇\sigma_{\mu}italic_σ start_POSTSUBSCRIPT italic_μ end_POSTSUBSCRIPT 0.10.10.10.1
Policy smoothing variance σϵsubscript𝜎italic-ϵ\sigma_{\epsilon}italic_σ start_POSTSUBSCRIPT italic_ϵ end_POSTSUBSCRIPT 0.20.20.20.2
Update delay step Tdelaysubscript𝑇delayT_{\text{delay}}italic_T start_POSTSUBSCRIPT delay end_POSTSUBSCRIPT 2222

To further explore the capabilities of the proposed TD3-AM algorithm, we introduce several comparative approaches, including the conventional TD3 algorithm, the proximal policy optimization (PPO) algorithm, and the safety layer technique (SL). Given that TD3 operates as an off-policy RL algorithm, we opt for the on-policy PPO algorithm as an additional baseline RL algorithm, and we also apply AM to the PPO algorithm. In our experiments, the PPO algorithm is implemented with the Stable-Baselines3 library [45]. The structure of the actor and critic networks and the batch size used for PPO are the same as the TD3-AM. Other training parameters are also carefully tuned. The safety layer is a state-of-the-art safe exploration technique for RL with state constraints [35]. As we introduced before, SL shares a similar idea with AM, which is adjusting the original action from the actor network to satisfy the constraints. Therefore, it is also compatible with both on-policy and off-policy RL algorithms. In the following experiments, SL is built following the procedures given in [35]. We first collect transition data from the simulation environment where the race car is randomly initialized and given random actions. The transitions of constraint violations are specially marked. Then, the immediate-constraint function of SL is approximated with a two-hidden-layer neural network based on the collected data through supervised learning. After that, we apply SL to both TD3 and PPO algorithms.

5.2 Evaluation Results on Track-A

For the evaluation on track-A, we train and test all six approaches mentioned above, which are: 1) TD3; 2) TD3-SL; 3) TD3-AM; 4) PPO; 5) PPO-SL; 6) PPO-AM. Each approach is trained for 20202020M iterations. After every 10101010k iterations, the driving policy is evaluated by an evaluation episode where the race car starts from the finish line and runs for 100100100100s. The accumulated reward and lap time of the episode is recorded. For each approach, we perform 10101010 independent training trials starting with randomly initialized actor and critic networks.

Figure 10 (first row) displays the learning progress, as indicated by the average accumulated rewards of the evaluation episodes. The solid lines and shaded areas depict the average values and standard deviations of the rewards for the identical evaluation episode across 10 trials. The curves show that the learning progress of the TD3-based approaches is more stable and consistent, while the PPO-based approaches are not very stable. Moreover, the TD3-AM approach achieves the highest average reward.

Refer to caption
Figure 10: Evaluation result of six approaches from 10101010 independent trials. First row: average rewards with standard deviations; Second row: max, median, and min lap time; Third row: success rate of finish two laps.

In motorsport, the lap time of a ‘flying lap’ is a major criterion to evaluate a race car’s performance and the skill of a race driver. The ‘flying lap’ starts when the car crosses the finish line of the previous lap at a high rate of speed. In our evaluation episode, if the driving policy could finish at least two laps without any fault, the second lap can be regarded as a ‘flying lap’ and this episode is identified as a ‘success episode’. To demonstrate the performance improvement of the driving policies during the learning process, we aggregate the results of every 10101010 evaluation episode from 10101010 trials as a group. The lap time and success rate of all groups are given in the second and third rows of Fig. 10. The median values and the max/min value of lap time in each evaluation group are shown by solid lines and shaded areas respectively. If there is no success episode in one evaluation group, the lap time is not shown. The success rate of the TD3-AM policy finally reaches 90%percent9090\%90 %, while other policies are all below 50%percent5050\%50 %. The PPO-based policies obtain lower success rates than the TD3-based policies, and the PPO-based policies also take more training iterations before having a success episode. However, for the lap time, the PPO and PPO-SL policies obtain a shorter time than their corresponding TD3 and TD3-SL policies. More importantly, the driving policies with AM achieve further shorter lap time than other policies.

Furthermore, we introduce other criteria for comparison, including 1) the best lap time achieved in all evaluation episodes; 2) the average cumulative number of episodes terminated due to the friction constraint violation; and 3) training speed in seconds per 1k iteration. These values are listed in Table LABEL:tb_tracka. From the best lap time, we find that TD3 and PPO approaches make significant improvements after using AM. The best lap times of TD3-AM and PPO-AM are 22%percent2222\%22 % and 5%percent55\%5 % shorter than their corresponding baselines. In comparison, the improvements by introducing SL are relatively smaller. The TD3-AM driving policy achieves the best lap time of all policies. From the number of constraint violations, we observe that introducing SL could reduce the number of friction constraint violations by half, and the AM mechanism successfully avoids any violations. The training speed of the PPO-based approaches is approximately three times faster than the TD3-based approaches. Applying AM and SL to TD3 and PPO prolongs the training time about 0.40.40.40.4s and 0.20.20.20.2s slower per 1111k iteration. Note that processing the action map** or safety layer of a single step takes less than 0.20.20.20.2ms.

Although both SL and AM are designed to address the constraint, only the AM mechanism achieves zero violation in all episodes. The main reason is that the numerical AM mechanism is more efficient at handling constraints with complex input coupling and nonlinear dynamics. From the vehicle model given above, the friction constraint function includes nonlinear couplings among speed, steering angle, and control inputs. To deal with the constraint, the SL technique approximates the friction constraint function with a linear model with respect to g(sc)T[ax,ay]𝑔superscriptsubscript𝑠𝑐𝑇subscript𝑎𝑥subscript𝑎𝑦g(s_{c})^{T}[a_{x},a_{y}]italic_g ( italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT [ italic_a start_POSTSUBSCRIPT italic_x end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_y end_POSTSUBSCRIPT ], where the coefficient g(sc)𝑔subscript𝑠𝑐g(s_{c})italic_g ( italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is a function of car motion states scsubscript𝑠𝑐s_{c}italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT, extracted with a pre-trained neural network. Then, this constraint function is solved via an analytical optimization method. This method works well for decoupled systems. However, when there is coupling among inputs, this form of linear approximation does not fit the actual dynamics. We observe that the fitting accuracy of network g(sc)𝑔subscript𝑠𝑐g(s_{c})italic_g ( italic_s start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) is relatively low from the pre-training, and that makes some of the actions fail to satisfy the constraint. Nevertheless, SL still works in some areas of the state space and effectively reduces the number of constraint violations.

Table 3: Best lap time, constraint violation, training speed comparison of six approaches in track-A
Driving Policy Best Lap Time (s) Number of Constraint Violation Training Speed (s/1k iteration)
TD3 47.28 252 1.186
TD3-SL 44.26 123 1.572
TD3-AM 36.94 0 1.579
PPO 39.67 304 0.441
PPO-SL 39.54 135 0.644
PPO-AM 37.76 0 0.620

The trajectories of the fastest flying lap by all six driving policies are shown in Fig. 11. The speed is illustrated by the color of the trajectory. The speed at the exit point of corners C1-C5 is marked on the figure. From the trajectories and speed, we find that all driving policies have mastered some basic skills of race driving. The policies have learned to decelerate upon approaching a corner and accelerate after exiting a corner. The policies also have learned to use the racing line to drive at a higher speed through a corner. Comparing these trajectories reveals that the PPO-AM and TD3-AM policies exhibit smoother and more extended trajectories than others. The speeds at the exit points are consistently higher than those achieved by other policies. Both PPO-AM and TD3-AM policies showcase excellent driving skills, with highly similar trajectories.

Refer to caption
Figure 11: Best flying lap trajectories of six race driving policies in track-A. The speed is illustrated by the color of the trajectory.

To comprehensively demonstrate the driving behavior of the learned policies, we show the speed, throttle/brake control, steering control, and resultant acceleration in Fig. 12. Note that the x𝑥xitalic_x-axis denotes the track distance, which is the distance from the finish line to the vehicle’s position along the centerline of the race track. The exit points of corners C1-C5 are marked in Fig. 12. From the speed and throttle/brake curves, the TD3-AM policy tends to maintain a proper speed in corners. Differently, the TD3 policy continuously applies the brake, which makes the car run unnecessarily slow in corners. Furthermore, in corners C2, C3, and C4, the TD3 policy outputs fast-changing throttle/brake and steering control signals, which is harmful to the car’s balance in real driving. In comparison, the control signals from the TD3-AM policy are much smoother. From the resultant acceleration, both driving policies satisfy the constraint of friction, which is given by a red horizontal line (1.15g1.15𝑔1.15g1.15 italic_g) in the figure. In each corner, the acceleration of the TD3-AM policy is closer to the limit. That means the TD3-AM policy could better maximize the maneuverability of the vehicle.

For conventional RL-based approaches (TD3 and PPO), constraint violations are penalized in the reward function, discouraging control policies from approaching the friction limit. The reward-sha** solution non-equivalently transforms the original optimization problem with constraints into a multi-objective optimization problem, where the constraint conditions become part of the penalty terms, which makes the control policy conservative. In comparison, introducing the AM mechanism directly prevents exceeding the friction limit from the perspective of action space without changing the objective of the optimization problem. Our method reconciles the contradiction between the need for higher speed and satisfying the friction constraint in turning maneuvers. In summary, the AM mechanism significantly improves the efficiency of RL algorithms in optimizing the race driving policy.

Refer to caption
Figure 12: Speed, throttle/brake control, steering control, and resultant acceleration of TD3-AM and TD3 driving policies in track-A.

5.3 Evaluation Results on Track-B

The capability of the proposed TD3-AM driving policy is further evaluated on track-B. The generalization ability of the AM mechanism is also demonstrated and discussed. The corners in track-B are more complex and volatile compared to track-A, with the track width being only half of track-A. The six approaches are trained and evaluated on Track-B with identical training configurations, with the distinction that each approach is trained for 50M iterations instead of 20M. The learned driving policies are also evaluated every 1k iteration. However, both the TD3 and TD3-SL policies fail to achieve a satisfactory driving policy capable of completing a lap. The best lap times for the other four policies are listed in Table LABEL:tb_laptimeb. Similar to the results on track-A, the introduction of the AM mechanism also significantly enhances performance on track-B. Given the similarities in driving behaviors between PPO-AM and TD3-AM, we focus on demonstrating the TD3-AM driving policy in the following.

Table 4: Best lap time comparison in Track-B
Driving Policy Best Lap time (min:sec)
PPO 1:26.23
PPO-SL 1:18.38
PPO-AM 1:09.55
TD3-AM 1:04.16

The trajectory of the fastest flying lap by the trained TD3-AM driving policy is shown in Fig. 13. The color of the trajectory illustrates the speed. The speeds at the exit points of corners C1-C10 are marked in the figure. Although track-B is much more difficult than track-A, the TD3-AM algorithm still has successfully mastered the driving skills for track-B. In consecutive corners C3-C7 and hairpin corners C8 and C9, the driving agent utilizes the full width of the track to minimize corner curvature, resulting in a remarkably smooth trajectory. The overall trajectory closely aligns with the theoretical optimal race line.

Refer to caption
Figure 13: Flying lap trajectory by TD3-AM driving policy in track-B. The speed is illustrated by the color of the trajectory.

In real race situations, the constraint of friction frequently changes due to tire wear, tire replacement, or wet track. If the maximum friction decreases while the original driving policy is in use, there is a higher risk of violating the friction constraint, particularly in sharp corners. The proposed TD3-AM driving policy can easily adapt to different friction constraints by adjusting the friction limit in the action map** function. To demonstrate this feature, we compare the TD3-AM driving policy using two action map** functions where the friction limits μmaxsubscript𝜇\mu_{\max}italic_μ start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are 1.15g1.15𝑔1.15g1.15 italic_g and 1.0g1.0𝑔1.0g1.0 italic_g respectively. We use the most difficult part of track-B, consecutive corners C3-C7, to demonstrate the performances. The trajectories are compared in Fig. 14. The network output action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, control signal utsubscript𝑢𝑡u_{t}italic_u start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT of throttle/brake and steering, and resultant acceleration are compared in Fig. 15 where the exit point of corners C3-C7 are marked. The trajectories of the driving policies with the constraint of 1.15g1.15𝑔1.15g1.15 italic_g and 1.0g1.0𝑔1.0g1.0 italic_g are similar, yet the speed of the policy with 1.0g1.0𝑔1.0g1.0 italic_g policy is obviously slower. Since the maximum friction is lower, driving through a corner at a slower speed is the only way to avoid losing grip.

Refer to caption
Figure 14: Flying lap trajectory comparison by TD3-AM driving policy with different friction constraints (1.15g1.15𝑔1.15g1.15 italic_g and 1.0g1.0𝑔1.0g1.0 italic_g) in track-B corners C3 to C7.
Refer to caption
Figure 15: Speed, throttle/brake action and control, steering action and control, and resultant acceleration by TD3-AM driving policy with different friction constraint (1.15g1.15𝑔1.15g1.15 italic_g and 1.0g1.0𝑔1.0g1.0 italic_g) in track-B corners C3 to C7.

From the resultant acceleration curves in Fig. 15, both driving policies satisfy their corresponding constraint. The acceleration curves are very close to the upper limit in the corners, which means the policies could make the most of the tire grip to pass the corners at high speeds. That ability is mostly attributed to the AM mechanism. Specifically, from Fig. 15, when the resultant acceleration approaches the limit in sharp corners, the constrained control input space becomes smaller. At this stage, whenever the network outputs an action that is outside the constrained space, the action map** function gives the corresponding control signal that barely complies with the constraint. For example, in curves C3 and C4, the actor networks of both policies output the full-throttle action, while the acceleration curves indicate that turning at the current speed is very close to the friction limit. Then, the action map** function gives a lower throttle signal to maintain the maximum possible speed through the corner. It should be noted that directly using an action map** function with a lower friction limit can only guarantee that the driving policy satisfies the new friction constraint. The new driving policy cannot achieve the near-optimal driving performance as the original one, and it may not successfully finish a lap if the new friction limit is much lower. However, the new driving policy can be quickly improved and reach the near-optimal level with a few episodes of training in the lower friction situation. In this way, with the AM mechanism, the learned driving policy can quickly generalize to lower friction conditions without re-training the whole driving policy.

Track-B is modeled after a real race track, and many cars have been tested on this track by professional drivers. Although the driving policy evaluation on track-B is performed in the simulation environment, we can make a rough comparison between our learned driving policy and the real professional drivers. We use the flying lap time data from [46], and select six cars with similar lap times for comparison. The cars’ basic performance parameters and lap times are listed in Table 5. The maximum power (Max Pow.) and 0 to 100 km/h acceleration time (Acc. Time) indicate the car’s acceleration performance; the 100 km/h to 0 brake distance (Brake Dist.) is associated with the maximum friction force. From these data, although our test model has little merit in basic performance parameters, the learned driving policy still managed to achieve a comparable lap time. Given the disparities between the simulation environment and the real-world scenario, we cannot directly compare the performances between the learned policy and the professional driver. However, the proposed TD3-AM algorithm for race driving has demonstrated its potential to acquire professional-level driving skills.

Table 5: Flying Lap Time Comparison on Ruisi Circuit (track B)
Make/Model Max Power (kW) Acc. Time (s) Brake Dist (m) Lap Time (min:sec)
BMW 325Li (2021) 135 7.90 38.7 1:03.28
Honda Civic (2022) 134 8.13 38.2 1:03.91
VW Golf 8 (2021) 110 8.08 35.9 1:04.12
Our Test Model 125 8.80 42.5 1:04.16
Ford Focus (2019) 135 8.90 37.1 1:04.30
KIA K5 (2020) 176 7.48 34.1 1:05.10
Mazda Atenza (2020) 141 8.03 37.0 1:05.40

6 Conclusions and Future Work

In this paper, we present a novel numerical AM-RL framework for autonomous race driving. The proposed numerical AM mechanism enables the RL-based driving agent to safely operate the vehicle within the friction limit while maximizing its handling capability. Leveraging the proposed TD3-AM approach, we have successfully trained a race driving agent with professional-level skills. The simulation results highlight improved race driving performance and the generalization capability to different friction conditions of the proposed approach, representing a significant advancement in addressing the friction constraint in autonomous racing. The lap time of the TD3-AM driving policy is 22% shorter than the baseline TD3 driving policy, and the success rate is 90%, which is much higher than the baseline policies.

In our future work, we aim to assess the practical applicability of our AM-RL framework through real-world car race experiments. Due to the unavailability of an accurate dynamic model for the vehicle, we intend to first establish a conservative friction constraint function using approximate physical parameters, and then enhance its accuracy progressively through online learning. We will also explore methods to handle sensor noise and other uncertainties in real-world applications. Additionally, we also plan to investigate the capabilities of AM-RL in addressing other nonlinear control problems with constraints.

References

References

  • [1] S. Grigorescu, B. Trasnea, T. Cocias, G. Macesanu, A survey of deep learning techniques for autonomous driving, Journal of Field Robotics 37 (3) (2020) 362–386.
  • [2] E. Yurtsever, J. Lambert, A. Carballo, K. Takeda, A survey of autonomous driving: Common practices and emerging technologies, IEEE access 8 (2020) 58443–58469.
  • [3] L. Liu, S. Lu, R. Zhong, B. Wu, Y. Yao, Q. Zhang, W. Shi, Computing systems for autonomous driving: State of the art and challenges, IEEE Internet of Things Journal 8 (8) (2020) 6469–6486.
  • [4] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li, et al., Milestones in autonomous driving and intelligent vehicles: Survey of surveys, IEEE Transactions on Intelligent Vehicles 8 (2) (2022) 1046–1056.
  • [5] Y. Li, J. Ibanez-Guzman, Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems, IEEE Signal Processing Magazine 37 (4) (2020) 50–61.
  • [6] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, K. Dietmayer, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges, IEEE Transactions on Intelligent Transportation Systems 22 (3) (2020) 1341–1360.
  • [7] H. Fujiyoshi, T. Hirakawa, T. Yamashita, Deep learning-based image recognition for autonomous driving, IATSS research 43 (4) (2019) 244–252.
  • [8] L. Claussmann, M. Revilloud, D. Gruyer, S. Glaser, A review of motion planning for highway autonomous driving, IEEE Transactions on Intelligent Transportation Systems 21 (5) (2020) 1826–1848. doi:10.1109/TITS.2019.2913998.
  • [9] J. Chen, W. Zhan, M. Tomizuka, Autonomous driving motion planning with constrained iterative lqr, IEEE Transactions on Intelligent Vehicles 4 (2) (2019) 244–254.
  • [10] A. Amini, I. Gilitschenski, J. Phillips, J. Moseyko, R. Banerjee, S. Karaman, D. Rus, Learning robust control policies for end-to-end autonomous driving from data-driven simulation, IEEE Robotics and Automation Letters 5 (2) (2020) 1143–1150.
  • [11] D. Li, D. Zhao, Q. Zhang, Y. Chen, Reinforcement learning and deep learning based lateral control for autonomous driving [application notes], IEEE Computational Intelligence Magazine 14 (2) (2019) 83–98.
  • [12] M. Guiggiani, The science of vehicle dynamics, Pisa, Italy: Springer Netherlands (2014) 15.
  • [13] R. Rajamani, Vehicle dynamics and control, Springer Science & Business Media, 2011.
  • [14] X. Yuan, Y. Wang, J. Liu, C. Sun, Action map**: A reinforcement learning method for constrained-input systems, IEEE Transactions on Neural Networks and Learning Systems 34 (10) (2023) 7145–7157. doi:10.1109/TNNLS.2021.3138924.
  • [15] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, 2018, pp. 1587–1596.
  • [16] R. Verschueren, S. De Bruyne, M. Zanon, J. V. Frasch, M. Diehl, Towards time-optimal race car driving using nonlinear mpc in real-time, in: 53rd IEEE conference on decision and control, IEEE, 2014, pp. 2505–2510.
  • [17] R. Verschueren, M. Zanon, R. Quirynen, M. Diehl, Time-optimal race car driving using an online exact hessian based nonlinear mpc algorithm, in: 2016 European control conference (ECC), IEEE, 2016, pp. 141–147.
  • [18] P. Scheffe, T. M. Henneken, M. Kloock, B. Alrifaee, Sequential convex programming methods for real-time optimal trajectory planning in autonomous vehicle racing, IEEE Transactions on Intelligent Vehicles 8 (1) (2022) 661–672.
  • [19] R. C. T. Novi, A. Liniger, C. Annicchiarico, Real-time control for at-limit handling driving on a predefined path, Vehicle System Dynamics 58 (7) (2020) 1007–1036.
  • [20] A. Liniger, J. Lygeros, Real-time control for autonomous racing based on viability theory, IEEE Transactions on Control Systems Technology 27 (2) (2017) 464–478.
  • [21] J. Kabzan, L. Hewing, A. Liniger, M. N. Zeilinger, Learning-based model predictive control for autonomous racing, IEEE Robotics and Automation Letters 4 (4) (2019) 3363–3370.
  • [22] U. Rosolia, F. Borrelli, Learning how to autonomously race a car: a predictive control approach, IEEE Transactions on Control Systems Technology 28 (6) (2019) 2713–2719.
  • [23] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, E. A. Theodorou, Information theoretic mpc for model-based reinforcement learning, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 1714–1721.
  • [24] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, E. A. Theodorou, Information-theoretic model predictive control: Theory and applications to autonomous driving, IEEE Transactions on Robotics 34 (6) (2018) 1603–1622.
  • [25] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, J. M. Rehg, Aggressive deep driving: Model predictive control with a cnn cost model, arXiv preprint arXiv:1707.05303.
  • [26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, T. Y, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, in: ICLR, 2016.
  • [27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347.
  • [28] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: International conference on machine learning, PMLR, 2018, pp. 1861–1870.
  • [29] Y. Wang, J. Sun, H. He, C. Sun, Deterministic policy gradient with integral compensator for robust quadrotor control, IEEE Transactions on Systems, Man, and Cybernetics: Systems 50 (10) (2019) 3713–3725.
  • [30] J. Hwangbo, I. Sa, R. Siegwart, M. Hutter, Control of a quadrotor with reinforcement learning, IEEE Robotics and Automation Letters 2 (4) (2017) 2096–2103.
  • [31] S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, in: 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396.
  • [32] T. Koller, F. Berkenkamp, M. Turchetta, A. Krause, Learning-based model predictive control for safe exploration, in: 2018 IEEE conference on decision and control (CDC), IEEE, 2018, pp. 6059–6066.
  • [33] J. Achiam, D. Held, A. Tamar, P. Abbeel, Constrained policy optimization, in: International conference on machine learning, PMLR, 2017, pp. 22–31.
  • [34] C. Tessler, D. J. Mankowitz, S. Mannor, Reward constrained policy optimization, arXiv preprint arXiv:1805.11074.
  • [35] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, Y. Tassa, Safe exploration in continuous action spaces, arXiv preprint arXiv:1801.08757.
  • [36] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, F. Nashashibi, End-to-end race driving with deep reinforcement learning, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 2070–2075.
  • [37] F. Fuchs, Y. Song, E. Kaufmann, D. Scaramuzza, P. Dürr, Super-human performance in gran turismo sport using deep reinforcement learning, IEEE Robotics and Automation Letters 6 (3) (2021) 4257–4264.
  • [38] P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subramanian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs, et al., Outracing champion gran turismo drivers with deep reinforcement learning, Nature 602 (7896) (2022) 223–228.
  • [39] A. Remonda, S. Krebs, E. E. Veas, G. Luzhnica, R. Kern, Formula rl: Deep reinforcement learning for autonomous racing using telemetry data, in: Workshop on Scaling-Up Reinforcement Learning: SURL, 2019.
  • [40] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, A. Sumner, Torcs, the open racing car simulator, Software available at http://torcs. sourceforge. net 4 (6) (2000) 2.
  • [41] A. Remonda, E. Veas, G. Luzhnica, Comparing driving behavior of humans and autonomous driving in a professional racing simulator, PLoS one 16 (2) (2021) e0245320.
  • [42] J. Niu, Y. Hu, B. **, Y. Han, X. Li, Two-stage safe reinforcement learning for high-speed autonomous racing, in: 2020 IEEE international conference on Systems, Man, and Cybernetics (SMC), IEEE, 2020, pp. 3934–3941.
  • [43] B. D. Evans, H. W. Jordaan, H. A. Engelbrecht, Safe reinforcement learning for high-speed autonomous racing, Cognitive Robotics 3 (2023) 107–126.
  • [44] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: International conference on machine learning, PMLR, 2014, pp. 387–395.
  • [45] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (268) (2021) 1–8.
    URL http://jmlr.org/papers/v22/20-1364.html
  • [46] KBRACER, Ruisi circuit lap time leaderboard, https://kbracer.github.io (Jan. 2023).
    URL https://kbracer.github.io