Learning Autonomous Race Driving with Action Map** Reinforcement Learning

Yuanda Wang [email protected] Xin Yuan [email protected] Changyin Sun [email protected] School of Automation, Southeast University, Nan**g 210096, China School of Artificial Intelligence, Anhui University, Hefei 230039, China

Abstract

Autonomous race driving poses a complex control challenge as vehicles must be operated at the edge of their handling limits to reduce lap times while respecting physical and safety constraints. This paper presents a novel reinforcement learning (RL)-based approach, incorporating the action map** (AM) mechanism to manage state-dependent input constraints arising from limited tire-road friction. A numerical approximation method is proposed to implement AM, addressing the complex dynamics associated with the friction constraints. The AM mechanism also allows the learned driving policy to be generalized to different friction conditions. Experimental results in our developed race simulator demonstrate that the proposed AM-RL approach achieves superior lap times and better success rates compared to the conventional RL-based approaches. The generalization capability of driving policy with AM is also validated in the experiments.

keywords:

Autonomous race driving, reinforcement learning, safety constraint, action map**.

1 Introduction

Autonomous driving has been a hot topic in both research and industry in recent decades [1, 2, 3, 4]. Various environmental perception [5, 6, 7], planning [8, 9], and motion control [10, 11] approaches for autonomous driving have been proposed. Many of them have been successfully applied to regular cars in the market. Current autonomous driving techniques have covered most driving scenarios, including highway driving, urban driving, autonomous parking, etc. In this work, we study autonomous driving in the race driving scenario where advanced driving skills are required to fully utilize the car’s handling capability and minimize the lap time. In highway and urban driving scenarios, the vehicle is operated near the equilibrium point, and the dynamic can be approximated by a linear model. Differently, in race driving, the vehicle is operated near its physical limits, and thus the complex nonlinear dynamics and constraints should be considered [12, 13]. These factors make autonomous race driving more challenging in terms of control than other driving scenarios.

The autonomous race driving controller should respect multiple input constraints due to physical and safety limits. The control inputs, including steering, acceleration, and deceleration have static range limits. For RL-based methods, this constraint can be easily addressed by applying a sigmoid or tanh activation function in the output layer of the policy network. Moreover, the inputs are further restricted by the limit of tire-road friction which depends on the car’s instantaneous motion states. For this kind of state-dependent constraint, most existing RL-based control methods use penalty terms in the reward function, which gives punishment to the current policy when the constraint is violated. Although this reward-sha** solution is rather simple, however, the learned policy tends to be relatively conservative. To address this state-dependent input constraint in RL-based control, we introduce the action map** (AM) mechanism [14], which converts the actions from the direct output from the neural network to real control inputs that satisfy the state-dependent constraints with a pre-defined map** function. This mechanism could effectively address the conservatism associated with penalty-based solutions.

In this paper, we develop a novel numerical AM-RL framework for autonomous race driving. The state-dependent control input constraint due to the limit of tire-road friction is addressed by the AM mechanism. By establishing a map** from unconstrained network output actions to constrained control inputs, it can be guaranteed that the vehicle is always controlled within the friction limits. However, the vehicle dynamics related to the friction limit are rather complex, and it is very difficult to find a closed-form expression for the map** function. Therefore, we further propose a numerical approximation method to implement AM. Then, we incorporate AM with the twin delayed deep deterministic policy gradient (TD3) algorithm [15] to train the race driving policy. For the states, we use a set of forward-observation points to indicate the curvature of the race track ahead. The reward function is specially designed to encourage the driving policy to maximize the car’s velocity along the race track while avoiding driving off the track or driving in the wrong way direction. Finally, the proposed race control approach is evaluated in our developed race simulator. The race driving policy trained with TD3 and AM obtains shorter lap times and higher success rates compared to other comparative approaches. The main contributions of this paper are summarized as follows:

1.

The AM mechanism is introduced to RL-based autonomous race driving control problem to address the state-dependent input constraints arising from limited tire-road friction.
2.

The AM mechanism enables the RL-based controller to better utilize the maximum tire-road friction, addressing the conservatism often associated with conventional reward-sha** solutions. The AM mechanism also allows the learned driving policy to be generalized to different friction conditions by adapting the friction constraint in the action map** function.
3.

A numerical approximation method is proposed to implement the AM mechanism, overcoming the challenges of dealing with complex nonlinear dynamics with constraints. This numerical method further extends the original AM mechanism, enabling it to address constraint RL problems with a more general form.

The remainder of this paper is organized as follows. Section 2 reviewed related work about autonomous racing and safe RL. In Section 3, we briefly introduce the race vehicle’s dynamic model and the constraint of tire friction. In Section 4, the developed RL algorithm with AM mechanism for race driving is explained in detail. The simulation experiments, results, and discussions are presented in Section 5. At last, Section 6 concludes this paper.

2 Related Work

In recent decades, model predictive control (MPC) has been one of the major methods used to address the challenge of autonomous racing for both trajectory planning and motion control. In [16], the time-optimal driving control task with the constraint of race track boundaries is formulated as a nonlinear model predictive control (NMPC) problem using a generalized Gauss-Newton method. However, the tire-road friction constraint in race driving dynamics is not considered. This NMPC-based method is further extended in [17] where the lateral tire-road friction limit is described by a Pacejka model. Then, a Hessian sequential quadratic programming optimization algorithm is employed to solve the NMPC problem. Similar MPC-based methods are also presented in [18, 19, 20]. In those methods, the control strategies are developed based on the vehicle’s dynamic model and the race track’s geometrical model. To reduce the computational complexity of the optimization problems, the models are usually simplified and linearized. This could seriously affect the performance of the controller if applied to a real vehicle. To address this issue, the MPC is incorporated with some data-driven and learning-based approaches. The measurement data from real-world experiments are used to refine the dynamic model and optimize the controller. In [21], a learning-based MPC approach is proposed for autonomous racing. A relatively simple nominal vehicle model is built first. Then, the model is improved by Gaussian process regression based on the measurement data. Similarly, in [22], the affine time-varying prediction model is used to approximate the vehicle model. Moreover, the model predictive path integral (MPPI) control is proposed for autonomous racing in [23, 24]. With the help of the model learning ability, many control strategies have achieved success in real-world experiments with a variety of race cars, including a full-size electric formula race car [21], a 1:10 scale RC car [22], and a 1:5 scale rally race car [23, 24, 25].

More recently, the rapid advancement of machine learning has introduced novel solutions to complex control problems. Reinforcement learning (RL), a significant type of machine learning approach, has found extensive applications in addressing continuous control problems. The RL-based control algorithm directly trains a neural network-based control policy that maximizes a reward function by online interacting with the real system or using saved trajectories. Therefore, prior knowledge of the system dynamics might not needed, and the reward function is fully flexible. Many newly proposed RL algorithms, such as deep deterministic policy gradient (DDPG) [26], twin delayed deep deterministic policy gradient (TD3) [15], proximal policy optimization (PPO) [27], and soft actor-critic (SAC) [28], have shown notable capability in the control of complex nonlinear dynamic systems, like quadrotor helicopters and multi-joint manipulators [29, 30, 31].

In many physical control applications utilizing RL, the control policy must adhere to safety constraints. In the aforementioned RL algorithms, one could impose penalties on actions violating constraints to derive a safe policy. However, this approach is often inefficient and does not ensure safety throughout the training process. To address this issue, several safe RL algorithms have been proposed. In [32], the MPC method is combined with the RL algorithm to guarantee safe exploration in training. In [33], a general-purpose RL policy search algorithm named constraint policy optimization (CPO) is proposed based on the trust region method. Reward constraint policy optimization (RCPO) [34] is a similar safe RL algorithm that uses a penalty signal to guide the policy toward a constraint-satisfying solution. There is another kind of ‘plug-and-play’ style safe mechanism that directly works on top of the policy to correct the actions without changing the original RL algorithm, such as the safety layer (SL) technique [35] and the AM mechanism [14]. Although SL and AM share a similar structure, the specific approaches in building the constraint model and correcting unsafe actions are significantly different. A detailed comparison and discussion of the two methods, along with experiment results, are provided in Section 5. Moreover, in contrast to model-free RL algorithms, many safe RL algorithms or safe mechanisms require some prior knowledge of the environment, such as the dynamic model and constraint function model. If these models are not directly accessible, they can be acquired through a model identification process guided by safe rules or human demonstration.

The RL-based control methods have also been widely applied in the fields of autonomous driving and racing. In [36], a rally race driving policy is learned with the A3C algorithm in an end-to-end way. The image from a forward-facing camera is directly fed into the policy network without any mediated perception. Similarly, a vision-based lateral control strategy for autonomous driving on a race track is developed in [11]. A convolutional neural network is built to extract track features from driver-view images. Then, a DDPG-based control policy gives control commands based on the track features and the car’s speed. Furthermore, a race driving agent named ‘GT Sophy’ has achieved super-human performance in the Gran Turismo game [37, 38]. The agent is trained using an improved SAC algorithm. Instead of using images, this agent’s observation of the track ahead is represented by a series of points along the centerline and each edge of the track. Similarly, Remonda et al. [39] propose to use the look-ahead curvature to represent the upcoming shape of the track, and train policies with diverse variants of DDPG (with long short-term memory, prioritized experience replay, multi-step target, etc.). Their proposed approach outperforms the state-of-the-art bots in the TORCS simulator [40] and even surpasses professional drivers in qualifying sessions in a professional simulator [41].

In the aforementioned RL-based autonomous racing approaches, the safety constraint is not explicitly considered. Due to the complexity of autonomous racing tasks, those general-purpose safe RL algorithms are difficult to be applied. For example, implementing the CPO algorithm in autonomous racing is quite challenging as it involves evaluating constraint functions to determine the feasibility of a certain control policy. Consequently, the safety RL methods for autonomous racing are usually specially designed. Niu et al. [42] propose a two-stage safe RL racing approach. In the first stage, a rule-based safeguard module is employed to enforce the constraint during the policy training at low speed. Then, in the second stage, the rule-based module is replaced with a data-driven module to develop a closed-form analytical safety solution at high speed. This approach is validated with the TORCS simulator and achieves zero safety violations. In [43], a viability theory-based safety supervisory architecture is proposed. The supervisor is built with a viability kernel based on the car’s dynamic model. It ensures the vehicle stays within the friction limit while maintaining recursive feasibility during the training process. Among the approaches discussed above, incorporating an additional safety module emerges as a common and logical solution for safe RL in autonomous racing. Our proposed approach, featuring the AM mechanism, also adopts this strategy.

3 Vehicle Model

In this section, the single track vehicle model and friction constraint for autonomous race driving are presented. The single track model, or known as the ‘bicycle model’, is commonly used in car handling studies [12]. As shown in Fig. 1, based on the assumption that both front wheel steering systems have equal gear ratio, the dynamics and kinematics of two front wheels and two rear wheels can be represented by two center wheels located on the center of the front and rear axles. The pose and motion of the race car are defined under two coordinate frames: earth-fixed frame ( $o_{e}$ - $X_{e}Y_{e}Z_{e}$ ) and vehicle body-fixed frame ( $o_{b}$ - $X_{b}Y_{b}Z_{b}$ ).

The notations used to describe the vehicle model in Fig. 1 are listed as follows:

1.

$v$ : vehicle velocity at center of gravity in body-fixed frame.
2.

$v_{x},v_{y}$ : vehicle longitudinal and lateral velocity in body-fixed frame.
3.

$\psi$ : vehicle heading angle (yaw angle).
4.

$\omega$ : vehicle turning rate.
5.

$\delta$ : front wheel steering angle.
6.

$\beta$ : vehicle sideslip angle.
7.

$l_{f},l_{r}$ : distance from center of gravity to front/rear axle.
8.

$\alpha_{f},\alpha_{r}$ : tire sideslip angles of front/rear wheels.
9.

$F_{yf},F_{yr}$ : lateral tire force on front/rear wheels.
10.

$F_{mx}$ : traction force on driven wheel.
11.

$F_{bx}$ : braking force on wheels.

Refer to caption — Figure 1: Single track vehicle model in earth-fixed frame and vehicle-fixed frame.

Moreover, the effects of longitudinal and lateral load transfer are omitted. The race track is assumed to be flat and has uniform friction coefficient. In the following, the vehicle’s longitudinal dynamics, lateral dynamics, and the tire friction constraint.

3.1 Longitudinal Dynamics

The longitudinal model describes the vehicle’s motion along $X_{b}$ -axis of the body-fixed frame. A force balance along the vehicle’s longitudinal direction yields:

m\dot{v}_{x}=F_{tx}-F_{aero}-F_{roll}

(1)

where $m$ is the total mass of the vehicle, $F_{tx}$ is longitudinal tire force, $F_{aero}$ is aerodynamic drag force, and $F_{roll}$ is rolling resistance force. The longitudinal tire force is the friction force from the track that acts on the tire. It is the reacting force of the traction force or braking force.

3.1.1 Traction Force

The traction force is generated by the vehicle’s powertrain and acts on the driven wheels. Different from a gasoline engine with a gearbox, the electric motor has special characteristics when generating torques at different speeds. If the motor speed is no greater than the base speed or rated speed, that is, $n_{\text{m}}\leq n_{\text{b}}$ , the motor works in the constant torque region, and the output torque can be directly controlled by the motor control signal. The traction force is

F_{mx}=\frac{K_{m}u_{m}}{R_{w}}

(2)

where $K_{m}$ is the motor torque coefficient, which is determined by the transmission efficiency and the reduction gear ratio, $R_{w}$ is the radius of wheels, $u_{m}\in[0,1]$ is the normalized motor control input. If the motor speed is greater than the base speed, that is, $n_{\text{m}}>n_{\text{b}}$ , the motor works in the constant power region. The output torque is limited by the maximum power $P_{\text{max}}$ . The traction force is

F_{mx}=\min\Big{[}\frac{K_{\text{m}}u_{m}}{R_{w}},~{}~{}\frac{P_{\text{max}}}{% n_{m}R_{w}}\Big{]}

(3)

3.1.2 Braking Force

The braking force is generated by the vehicle’s braking system and acts on all wheels. We assume the braking force is proportional to the braking control signal, that is,

F_{bx}=K_{\text{b}}u_{b}

(4)

where $u_{b}\in[0,1]$ is the normalized braking control signal, $K_{\text{b}}$ is the coefficient of the braking system.

3.1.3 Aerodynamic Drag Force

The equivalent aerodynamic drag force on a vehicle in a windless environment can be given by

F_{aero}=\frac{1}{2}\rho_{A}C_{d}A_{f}v_{x}

(5)

where $\rho_{A}$ is the density of air, $C_{d}$ is the aerodynamic drag coefficient, $A_{F}$ is the frontal area of the vehicle.

3.1.4 Rolling Resistance Force

The rolling resistance is roughly proportional to the down force on the tires, that is,

F_{roll}=f_{r}mg

(6)

where $f_{r}$ is the rolling resistance coefficient, and $g$ is the acceleration due to gravity.

The longitudinal motion of the car is controlled by the motor control signal $u_{m}$ and the brake control signal $u_{b}$ . However, two control inputs cannot be applied simultaneously in real situations. Therefore, we combine them into a single control signal $u_{x}\in[-1,1]$ , where $u_{x}=1$ denotes full motor power and $u_{x}=-1$ denotes full brake.

3.2 Lateral Dynamics

The lateral model describes the vehicle’s translational motion along $Y_{b}$ -axis and rotational motion around $Z_{b}$ -axis. A force balance along the vehicle’s lateral direction yields

m\dot{v}_{y}=F_{yf}+F_{yr}

(7)

A moment balance around $Z_{b}$ -axis yields the rotational dynamics as

I_{z}\dot{\omega}=F_{yf}l_{f}-F_{yr}l_{r}

(8)

where $I_{z}$ is vehicle body moments of inertia about the body-fixed $Z_{b}$ -axis. The lateral tire forces $F_{yf}$ and $F_{yr}$ are proportional to their corresponding tire sideslip angles, that is

	$\displaystyle F_{yf}$	$\displaystyle=2C_{\alpha_{f}}\alpha_{f}=2C_{\alpha_{f}}(\delta-\theta_{v_{f}})$		(9)
	$\displaystyle F_{yr}$	$\displaystyle=2C_{\alpha_{r}}\alpha_{r}=2C_{\alpha_{r}}(-\theta_{v_{r}})$		(10)

where $C_{\alpha_{f}}$ and $C_{\alpha_{r}}$ are cornering stiffness of the front and rear tires, $\theta_{v_{f}}$ and $\theta_{v_{r}}$ denote the direction of the velocity of front and rear wheels, and they can be obtained by

	$\displaystyle\theta_{v_{f}}=\arctan\frac{v_{y}+\omega l_{f}}{v_{x}}$		(11)
	$\displaystyle\theta_{v_{r}}=\arctan\frac{v_{y}-\omega l_{r}}{v_{x}}$		(12)

Finally, the vehicle’s velocity in earth-fixed frame can be expressed with the following kinematic equations:

	$\displaystyle\dot{X}$	$\displaystyle=v\cos(\psi+\beta)$		(13)
	$\displaystyle\dot{Y}$	$\displaystyle=v\sin(\psi+\beta)$		(14)

Moreover, the lateral motion of the vehicle can also be approximated by a simpler kinematic model with the assumption that the slip angles of the front and rear wheels are both zero. Then, the slip angle and turning rate is expressed by:

	$\displaystyle\beta$	$\displaystyle=\arctan\Big{(}\frac{l_{r}\tan{(\delta)}}{l_{f}+l_{r}}\Big{)}$		(15)
	$\displaystyle\dot{\psi}$	$\displaystyle=\omega=\frac{v\tan(\delta)\cos(\beta)}{l_{f}+l_{r}}$		(16)

For an autonomous racing vehicle, the steering angle of the front wheels $\delta$ is actuated by the electric power steering system. The input can be regarded as a normalized steering angular speed signal $u_{y}\in{[-1,1]}$ . $u_{y}=1$ or $-1$ makes the power steering system actuate steering wheels to left or right at the maximum angular speed $\dot{\delta}_{\max}$ , and $u_{y}=0$ means maintain the current steering angle.

3.3 Constraint of Tire Friction

The longitudinal and lateral forces that control the vehicle rely on the friction force or grip between the tire and the road. In real conditions, the friction force has an upper limit, which depends on a number of physical factors, including road surface material, tire size, tire pressure, tire temperature, etc. If we assume these physical factors remain the same during the race, we could use a friction circle model to describe the constraint of tire friction. As shown in Fig. 2, the friction circle model indicates that the total resultant force $\vec{F}_{xy}$ applied on the vehicle cannot exceed the maximum friction force, which is

|\vec{F}_{x}+\vec{F}_{y}|=|\vec{F}_{xy}|<\mu_{\max}F_{z}

(17)

where $\vec{F}_{x}$ and $\vec{F}_{y}$ are the longitudinal and lateral resultant force vectors, $F_{z}$ is the total vertical load on the tires. $\mu_{\text{max}}$ is the maximum tire-road friction coefficient. The aerodynamic lift force and downforce are omitted in the vehicle model, and we have $F_{z}=mg$ . Then, the constraint can also be represented by $|\vec{a}_{xy}|<\mu_{\max}g$ , where $\vec{a}_{xy}$ is the resultant acceleration measured from the vehicle. During the race, sudden and large braking or steering input, especially at high speed, could easily exceed the upper limit of grip, which could result in side slip or tail slip, and bring the vehicle into an uncontrollable spin. To guarantee the vehicle is controllable and stable in a race, we should always consider the constraint of tire friction when designing an autonomous race driving controller.

4 Race Driving with AM-RL

In this section, the autonomous race driving policy is developed with an RL-based method. To guarantee the driving policy satisfies the safety constraint of tire friction in both training and implementation, we apply our proposed AM mechanism to the RL algorithm. In the following, we first introduce the basics of RL and the race driving Markov decision process (MDP) model. Then, the AM mechanism for the tire friction constraint is described. Finally, we present the implementation process of applying AM to a specific RL algorithm and train an autonomous race diving policy.

4.1 Race Driving MDP Model

In this subsection, the race driving MDP model for RL-based autonomous racing approaches is presented. The basic idea of RL is iteratively optimizing the control policy of an agent based on the input-output experiences from a step-by-step agent-environment interaction system. The goal is to maximize the accumulated every-step reward. The agent-environment interaction system is generally modeled as an MDP, which is defined by a tuple $\langle\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma\rangle$ , where $\mathcal{S}$ is the set of states, $\mathcal{A}$ is the set of actions, $\mathcal{P}$ and $\mathcal{R}$ are the state transition model and the reward function respectively, $\gamma\in[0,1]$ is the discount factor. The policy is denoted as $\pi$ which maps states to actions. For each step of the interaction at time step $t$ , the agent in state $s_{t}\in\mathcal{S}$ , takes action $a_{t}\in\mathcal{A}$ following its policy $\pi$ , then receives a reward $r_{t+1}$ with the reward function $\mathcal{R}$ , and finally enters a new state $s_{t+1}$ following the state transition model $\mathcal{P}$ . The accumulated reward from step $t$ onward is defined as the return, $R_{t}=\sum^{T_{\max}}_{t^{\prime}=t}\gamma^{t^{\prime}-t}r_{t^{\prime}}$ . The expected return following the policy $\pi$ can be represented by the state-action value function, which is defined as $Q^{\pi}(s,a)=\mathbb{E}_{\pi}[R_{t}|s_{t},a_{t}]$ .

The goal of race driving in this study is to finish a lap of the track safely in the shortest possible time. To solve the race driving task with RL, the race driving MDP model which expresses the interaction between an autonomous driving agent and a car-track environment is established first. In this MDP, the state transition model is determined by the aforementioned vehicle dynamic model. The fundamental principle of building the state and action space is to make the agent-environment interaction satisfy the Markov property to the largest extent. Therefore, all system states that could affect the race car’s motion should be included. Following this principle, the state, action, and reward function are defined and described as follows.

State: The states of the race driving MDP are divided into two groups. The first group of states describes the race car’s pose and motion. In the first group of states, $v_{x}$ , $\omega$ , and $\delta$ denote the vehicle’s longitudinal velocity, turning rate, and front wheel steering angle, respectively. These states have been defined in Section 3. The car’s position and orientation with respect to the track are expressed by the relative cross-centerline distance $d_{c}\in[-1,1]$ and the relative heading angle $\phi$ . As shown in Fig. 3, $d_{c}$ is the distance from the car’s center of gravity to the centerline of the track. $\phi\in(-\pi,\pi]$ is the car’s heading angle with respect to the tangential direction of the centerline projected point (orthogonal projection of the car’s center of gravity onto the centerline). $\phi=0$ means the car is precisely following the track direction, while $\phi>\pi/2$ or $\phi<-\pi/2$ means the car is traveling in the opposite direction of the track, also known as wrong-way driving. The second group of states is composed of $N_{\text{FO}}$ forward-observation vectors, which are used to indicate the curvature of the race track ahead. As shown in Fig. 3, the vectors $\vec{V}_{d_{i}}$ (red dotted arrow lines) all start from the car’s center point $(X,Y)$ and point to the forward-observation points $(X_{d_{i}},Y_{d_{i}})$ on the centerline. The forward-observation points are determined by moving the car’s centerline projected point forward along the centerline over a distance of $d_{i}$ . Specifically, a forward-observation vector $\vec{V}_{d_{i}}$ is defined as:

\vec{V}_{d_{i}}=\mathbf{R}^{b}_{e}(\psi)[X_{d_{i}}-X,~{}~{}Y_{d_{i}}-Y]^{T}

(18)

where $\mathbf{R}^{b}_{e}(\psi)\in\mathbb{R}^{2\times 2}$ is the rotational matrix that transforms the forward-looking vector from the inertial frame to the body-fixed frame. Then, $N_{\text{FO}}$ forward-observation vectors are connected and form a forward-observation feature $\mathbf{V}_{\text{FO}}=[\vec{V}_{d_{1}},\vec{V}_{d_{2}},\ldots,\vec{V}_{d_{N_{% \text{FO}}}}]\in\mathbb{R}^{2\times N_{\text{FO}}}$ . The state vector of the race driving MDP is defined as:

s_{t}=[v_{x},\omega,\delta,d_{c},\phi,\mathbf{V}_{\text{FO}}]

(19)

Action: The vehicle is controlled by the motor/brake control signal $u_{x}$ and steering control signal $u_{y}$ . Therefore, we define two actions, $a_{x}$ and $a_{y}$ , as the output from the race driving policy. The action vector is defined as:

a_{t}=[a_{x},a_{y}]

(20)

The relationship between the actions and the control signals is further explained in Section 4.2.

Reward Function: The reward function guides the RL algorithm to optimize the policy toward a certain objective. To define a proper reward function for the autonomous race driving task, the objective of finishing a lap safely in the shortest time should be broken down into each time step. From the perspective of a single time step, the driving policy should maximize the car’s velocity along the track direction, which is denoted by $v_{a}$ as shown in Fig. 3. Therefore, our velocity reward is set as: $r_{\text{vel}}=v_{a}=v_{x}\cos{\phi}$ . Furthermore, to guide the driving policy to operate safely, we set negative rewards $r_{\text{out}}=-100$ for driving off the track, $r_{\text{ww}}=-100$ for driving in the wrong way direction, and $r_{\text{tf}}=-100$ for violation of tire friction constraint. If the car is safely operated inside the track, $r_{\text{out}}=r_{\text{ww}}=r_{\text{tf}}=0$ . Then, the reward function is defined as:

r_{t}=r_{\text{vel}}+r_{\text{out}}+r_{\text{ww}}+r_{\text{tf}}

(21)

It is noted that the negative reward $r_{\text{tf}}$ only applies to conventional RL algorithms that are not specially designed to address the friction constraint. For the proposed AM-RL approaches, $r_{\text{tf}}$ is not needed.

4.2 AM mechanism for Tire Friction Constraint

The autonomous racing control policy should generate proper control inputs that satisfy the constraint of tire friction to prevent the car from entering uncontrollable states. Based on the car’s dynamic model, the friction constraint dynamically changes and it is dependent on the car’s speed and steering angle. In this subsection, we proposed a numerical AM mechanism to tackle this type of state-dependent input constraint of tire friction.

The AM mechanism addresses the state-dependent constraint by establishing a map** between an unconstrained virtual policy $\pi_{\text{V}}$ and a constrained real policy $\pi_{\text{R}}$ . The virtual policy is represented by the neural network which directly maps states to actions, and it does not consider the constraints. Next, the virtual policy is converted to its corresponding constrained real policy that satisfies the state-dependent constraint. The real policy directly interacts with the real system, while the virtual policy is optimized with the RL algorithm using the interaction experiences.

More specifically, we define the unconstrained virtual policy as $a=\pi_{\text{V}}(s)$ , where $s\in\mathcal{S}$ is the state vector. Here we define $\mathcal{S}$ as a compact subspace of $\mathbb{R}^{N_{s}}$ , and $N_{s}$ is the dimension of the state space. $a\in\mathcal{A}$ is the virtual action, and $\mathcal{A}$ is the unconstrained action space. To be compatible with the neural networks represented policy, in the following, the unconstrained action space $\mathcal{A}$ is defined as $[-1,1]^{N_{a}}$ which is a compact subspace of $\mathbb{R}^{N_{a}}$ , and ${N_{a}}$ is the dimension of the action space. Let $G$ be a compact set-valued map from $\mathcal{S}$ to the power set $P(\mathbb{R}^{N_{a}})$ , and $G$ is characterized by its graph $\text{Graph}(G)=\{(x,y)|x\in\mathcal{S},y\in G(x)\}$ . In this work, we use $G(s)$ to denote the control input space that satisfies the state-dependent constraint. Then, the constrained real policy is defined as $u=\pi_{\text{R}}(s)$ , where $u\in G(s)$ is the real control input.

Let $\mathcal{F}$ be the set of all continuous functions map** from $\mathcal{S}$ to $\mathcal{A}$ . Let $\mathcal{H}$ be the set of all continuous functions $\pi:\mathcal{S}\rightarrow\bigcup_{s\in\mathcal{S}}G(s)$ that satisfying the range of $\pi(s)$ in $G(s)$ . We assume that the union of all graphs of map $\pi$ in $\mathcal{H}$ is equal to $\text{Graph}(G)$ , that is, $\text{Graph}(G)=\bigcup_{\pi\in\mathcal{H}}\{(s,\pi(s))|s\in\mathcal{S}\}$ . Then, the connection between the virtual unconstrained policy $\pi_{\text{V}}\in\mathcal{F}$ and the real constrained policy $\pi_{\text{R}}\in\mathcal{H}$ can be described as a map $T:\mathcal{H}\rightarrow\mathcal{F}$ . According to the action map** theorem (Theorem 1 in [14]), the map $T:\mathcal{H}\rightarrow\mathcal{F}$ exists if and only if there exists a continuous map $h:\text{Graph}(G)\rightarrow\mathcal{A}$ such that, for each $s\in\mathcal{S}$ , the map $h_{s}:G(s)\rightarrow\mathcal{A}$ is the homeomorphism of $G(s)$ with $\mathcal{A}$ and $h_{s}$ is defined as $h_{s}(a)=h(s,a)$ .

In the following, the tire friction constraint in race driving is studied under the framework of AM. The unconstrained virtual policy is constructed as $a_{t}=\pi_{\text{V}}(s_{t})$ , where $a_{t}=[a_{x},a_{y}]$ is the virtual action vector, and $a_{x},a_{y}\in[-1,1]$ denote the virtual longitudinal action and virtual lateral action respectively. This virtual policy is represented by a neural network and optimized through RL algorithms. The constrained real policy is denoted by $u_{t}=\pi_{\text{R}}(s_{t})$ , where $u_{t}=[u_{x},u_{y}]$ is the real control input vector, and $u_{x}$ , $u_{y}$ are the normalized motor/brake and the steering control input respectively. The constrained real policy can be obtained by the policy space map** $T:\mathcal{H}\rightarrow\mathcal{F}$ , which is realized through its corresponding continuous map** function:

u_{t}=h(\hat{s}_{t},a_{t})

(22)

This continuous map** function maps the action $a_{t}$ from the unconstrained virtual policy to the real control input $u_{t}\in G(\hat{s}_{t})$ that satisfies the tire friction constraint. $\hat{s}_{t}=[v_{x},\delta]$ is a subset of the state vector which only contains the state variables related to the tire friction constraint. An illustrative example of the action map** while the race car’s longitudinal velocity $v_{x}=15.4~{}\text{m/s}$ , front wheel steering angle $\delta=7.9~{}\text{deg}$ is given in Fig. 4. The boundary of the unconstrained action space $\mathcal{A}$ is shown on the left, and the action $a_{t}$ can be freely selected inside the boundary. The boundary of the constrained control input space $G(\hat{s}_{t})$ is shown on the right, and the shape of the boundary depends on the race car’s current states $\hat{s}_{t}$ . Any control input vector located outside the boundary will violate the tire friction constraint. In Fig. 4, we give two action vector examples $a_{t1}=[-0.75,0.25]$ and $a_{t2}=[0.75,-0.75]$ marked in blue arrows. If we directly use those two actions as the control inputs to the real system, $a_{t2}$ satisfies the constraint while $a_{t1}$ fails. Therefore, in this task, an intuitive explanation of the action map** is to map the $a_{t}$ to its corresponding $u_{t}$ and guarantee it is inside the boundary of $G(\hat{s}_{t})$ .

However, due to the complexity of the vehicle dynamics and constraints, it is rather difficult to give a closed-form expression of the map** function. Therefore, we provide a numerical approximation method to implement the action map** mechanism for a complex dynamic system with state-dependent input constraints. The basic idea is to shorten those overrun action vectors to the boundary of $G(\hat{s})$ while kee** the same direction. For the convenience of processing the vectors, both action vectors and control input vectors are temporarily converted to polar coordinate form. The action and control inputs are expressed by $[\rho_{a},\vartheta_{a}]$ and $[\rho_{u},\vartheta_{u}]$ respectively. $\rho_{a}$ and $\rho_{u}$ denote the lengths; $\vartheta_{a}$ and $\vartheta_{u}$ denote the directions. Here, we define a control input space boundary function that gives the maximum length of action direction $\vartheta$ in the current car state, that is,

\bar{\rho}=\Psi(v_{x},\delta,\vartheta)

(23)

In the following, we present a numerical method to determine the boundary function. Let $v_{i}=\{v_{1},v_{2},\ldots,v_{N_{v}}\}$ be $N_{v}$ evenly spaced velocity values over $[0,v_{\max}]$ , and $v_{\max}$ is the maximum speed. Let $\delta_{j}=\{\delta_{1},\delta_{2},\ldots,\delta_{N_{\delta}}\}$ be $N_{\delta}$ evenly spaced front wheel steering angle values over $[-\delta_{\max},\delta_{\max}]$ , and $\delta_{\max}$ is the maximum steering angle. Let $\vartheta_{k}=\{\vartheta_{1},\vartheta_{2},\ldots,\vartheta_{N_{\vartheta}}\}$ be $N_{\vartheta}$ evenly spaced action vector directions over $(-\pi,\pi]$ . Then, we iterate all combinations of the race car’s state $(v_{i},\delta_{j})$ with all control input directions and lengths, and check the car’s response to control inputs using the dynamic model. More specifically, for each car state $(v_{i},\delta_{j})$ , we iterate all action directions. For a certain direction $\vartheta_{k}$ , we apply the control input vector to the dynamic system while increasing the length $\rho$ till the control input fails to satisfy the tire friction constraint, and we could determine the maximum length $\bar{\rho}$ at car state $(v_{i},\delta_{j})$ and action direction $\vartheta_{k}$ . Based on this sampling method, we obtain a look-up table of three dimensions to represent the boundary function, that is $\bar{\rho}_{i,j,k}=\Psi(v_{i},\delta_{j},\vartheta_{k})$ . To demonstrate the boundary function built from sampling, we visualize the boundary function of $v_{x}=15.4~{}\text{m/s}$ in Fig. 5. Inside the 3D space is the admissible space for control inputs $[u_{x},u_{y}]$ of different steering angle $\delta$ . Fig. 6 gives another view of the control input boundaries at velocity $v_{x}=15.4~{}\text{m/s}$ with steering angle $\delta$ in range $[4.6,8.1]$ deg.

Due to the discrete nature of the boundary function, we cannot directly find the maximum length $\bar{\rho}$ for any $v_{x}$ , $\delta$ , and $\vartheta_{a}$ , which are continuous values. Therefore, we use an off-the-shelf linear multidimensional interpolation method to approximate the maximum length. The approximated boundary function is denoted by $\hat{\rho}=\hat{\Psi}(v_{x},\delta,\vartheta)$ , where $\hat{\rho}$ is the approximate value of $\bar{\rho}$ . Finally, we summarize the numerical action map** procedure with the discrete boundary function in Algorithm 1. With the help of this numerical action map** method, the action that could violate the constraint is mapped to a safe control input right inside the boundary. This method fundamentally prevents the race car from entering uncontrollable states while making full use of the maximum tire-road friction.

Algorithm 1 Numerical Action Map** with Discrete Boundary Function

1: Load discrete boundary function

\bar{\rho}_{i,j,k}=\Psi(v_{i},\delta_{j},\vartheta_{k})

2: Input car speed

v_{x}

and steering angle

\delta

3: Input unconstraint virtual action

a_{t}=[a_{x},a_{y}]

4: Convert virtual action to polar coordinate form

[a_{x},a_{y}]\rightarrow[\rho_{a},\vartheta_{a}]

5: Calculate maximum length

\hat{\rho}=\hat{\Psi}(v_{x},\delta,\vartheta_{a})

using a multidimensional interpolation method on discrete boundary function

6: if

\rho_{a}\leq\hat{\rho}

then

7: constraint control

u_{t}=[\rho_{a},\vartheta_{a}]

8: else

9: constraint control

u_{t}=[\hat{\rho},\vartheta_{a}]

10: end if

11: Convert to Cartesian coordinate form

u_{t}=[\hat{\rho},\vartheta_{a}]~{}\text{or}~{}[\rho_{a},\vartheta_{a}]% \rightarrow[u_{x},u_{y}]

12: Output constraint control

u_{t}=[u_{x},u_{y}]

4.3 Implementation of RL Training with AM

The AM mechanism can address the state-dependent constraint for a variety of policy gradient-based RL algorithms with parameterized policy and continuous action space, such as DDPG, TD3, PPO, SAC, etc. In this subsection, we incorporate the numerical approximation method of AM to the TD3 algorithm to train an autonomous race driving policy as an example.

TD3 is a deterministic policy-based reinforcement learning algorithm that employs an actor-critic architecture. The actor and critic functions are represented by fully connected neural networks. A block diagram of the network structure is shown in Fig. 7. The actor network, which represents the unconstrained virtual policy is denoted as $\pi^{\mu}(s)$ , and $\mu$ is the network parameters. As illustrated in Fig. 7, the actor network has two hidden layers, and each hidden layer contains $256$ hidden nodes with the ReLU activation function. For the output layer, we use the tanh function to limit the actions between $-1$ and $1$ . The action network’s target network shares the same structure as the action network, and it is denoted as $\pi^{\mu^{\prime}}(s)$ . The state-action value function $Q^{\pi^{\mu}}(s,a)$ is approximated by the critic networks. Different from the conventional DDPG algorithm, the TD3 algorithm introduces a pair of critic networks to mitigate the value function overestimation. The two critic networks are denoted by $Q^{w_{1}}(s,a)$ and $Q^{w_{2}}(s,a)$ . They share the same structure which is shown in Fig. 7. The input of critic networks concatenates the state vector and the action vector. The hidden layers have the same structure as the actor network while the output layer uses a linear function. The target networks of the critic networks are denoted by $Q^{w^{\prime}_{1}}(s,a)$ and $Q^{w^{\prime}_{2}}(s,a)$ .

The critic networks are trained using a batch training method with a replay buffer. For each time step $t$ , a state transition experience $e_{t}=(s_{t},a_{t},r_{t+1},s_{t+1})$ is saved to the replay buffer $\mathcal{D}$ . During each training iteration, a batch of experiences $(s_{i},a_{i},r_{i},s^{\prime}_{i})_{i=1,2,\ldots,N}$ is randomly selected from the replay buffer. Here, $r_{i}$ and $s^{\prime}_{i}$ are the reward received and state reached after taking $a_{i}$ at state $s_{i}$ . Next, two critic networks are trained separately to minimize the TD-error using the loss function:

\mathcal{L}(w_{j})=\frac{1}{N}\sum^{N}_{i=1}\Big{[}y_{i}-Q^{w_{j}}(s_{i},a_{i}% )\Big{]}^{2},

(24)

where $j=1,2$ , and $y_{i}$ is the target value from the target critic networks, which is,

y_{i}=r_{i}+\gamma\min_{j=1,2}Q^{w^{\prime}_{j}}(s^{\prime}_{i},\pi^{\mu^{% \prime}}(s^{\prime}_{i})+\epsilon).

(25)

The target value for the critic networks is from the smaller value between two target networks, which reduces the overestimation of the state-action value. $\epsilon$ is a small noise sampled from a clipped Gaussian distribution $\text{clip}(\mathcal{N}(0,\sigma_{\epsilon}^{2}),-c,c)$ , which is used to avoid overfitting. After the target values are determined, the gradients is given as:

\nabla_{w_{j}}\mathcal{L}(w_{j})=\frac{1}{N}\sum^{N}_{i=1}\Big{[}y_{i}-Q^{w_{j% }}(s_{i},a_{i})\Big{]}\nabla_{w_{j}}Q^{w_{j}}(s_{i},a_{i}).

(26)

The parameters of two critic networks are adjusted according to:

w_{j}\leftarrow w_{j}-\alpha_{w}\nabla_{w_{j}}\mathcal{L}(w_{j}),

(27)

where $\alpha_{w}$ is a small updating rate.

The policy’s performance is assessed based on the expected return starting from the initial time step, that is, $J(\mu)=\mathbb{E}_{\pi^{\mu}}[R_{1}]$ . According to the deterministic policy gradient theorem [44], the gradient of the policy performance is given by:

\nabla_{\mu}J(\mu)=\mathbb{E}_{\pi^{\mu}}\Big{[}\nabla_{\mu}\pi^{\mu}(s)\nabla% _{a}Q^{\pi^{\mu}}(s,a)|_{a=\pi^{\mu}(s)}\Big{]}.

(28)

Then, the actor network is trained using the approximated gradient of the performance function based on the critic network $Q^{w_{1}}(s,a)$ and the same batch of transition experiences. The approximation of the policy gradient is given by:

\nabla_{\mu}J(\mu)\approx\frac{1}{N}\sum^{N}_{i=1}\nabla_{\mu}\pi^{\mu}(s_{i})% \nabla_{a}Q^{w_{1}}(s_{i},a_{i})|_{a_{i}=\pi^{\mu}(s_{i})}.

(29)

The virtual control policy is improved by updating the parameter $\mu$ using gradient ascent with a small updating rate $a_{\mu}$ :

\mu\leftarrow\mu+\alpha_{\mu}\nabla_{\mu}J(\mu)

(30)

After each training iteration, the target networks are soft-updated with small update rates $\beta_{w}$ and $\beta_{\mu}$ :

	$\displaystyle w^{\prime}_{j}$	$\displaystyle\leftarrow\beta_{w}w_{j}+(1-\beta_{w})w^{\prime}_{j}$		(31)
	$\displaystyle\mu^{\prime}$	$\displaystyle\leftarrow\beta_{\mu}\mu+(1-\beta_{\mu})\mu^{\prime}$		(32)

The overall training procedures of the race driving policy with TD3-AM are summarized in Algorithm 2. In line 9, a Gaussian noise $n_{t}\sim\mathcal{N}(0,\sigma^{2}_{\mu})$ is added to the virtual action to promote exploration. In lines 14-20, a delayed policy update method is utilized to further improve the stability of the training process. In particular, updates to the actor network and the target networks occur after a set number ( $T_{\text{delay}}$ ) of updates to the critic networks. Furthermore, the policy is saved and evaluated at a certain interval of training iterations.

Algorithm 2 Race Driving Policy Training with TD3-AM

1: Randomly initialize the parameters the actor network, twin critic networks, and their target networks.

2: Initialize replay buffer

\mathcal{D}

3: Load race driving simulation environment

4: for episode =

1

to MaxEpisode do

5: Set initial car state

6: Observe initial state

s_{1}

7: for time step

t=1

to MaxStep do

8: Generate virtual action

a_{t}=\pi^{\mu}(s_{t})+n_{t}

9: Map** to real control input

u_{t}=h(\hat{s}_{t},a_{t})

10: Apply control input

u_{t}

to car dynamic model

11: Observe new state

s_{t+1}

and receive reward

r_{t+1}

12: Store the transition

(s_{t},a_{t},r_{t+1},s_{t+1})

\mathcal{D}

13: Select a batch of

N

experiences randomly from

\mathcal{D}

14: Calculate target values according to (25)

15: Update critic networks according to (26) and (27)

16: if

t

mod

T_{\text{delay}}=0

then

17: Update actor network according to (29) and (30)

18: Update critic target networks according to (31)

19: Update actor target network according to (32)

20: end if

21: if car drives off-track or wrong-way then

22: break

23: end if

24: end for

25: end for

5 Simulations and Results

In this section, the proposed RL-based race driving strategies are evaluated in our built race simulation environment. In the following, we first introduce the simulation environment and training details. Then, the evaluation results are presented and discussed.

5.1 Simulation Environment and Training Details

The simulation environment is developed following the vehicle model given in Section 3. The Runge-Kutta four-order (RK4) method is used to numerically solve the differential equations. The simulation time step is $0.01$ s. The car model used in the simulation is an all-electric mid-size sedan. The physical parameters of the car model are given in Table LABEL:tb_car.

Table 1: Car Model Parameters

Parameter	Value	Unit
Vehicle mass $m$	$1860$	kg
Front axle distance $l_{f}$	$1.17$	m
Rear axle distance $l_{r}$	$1.77$	m
Tire rolling radius $R_{w}$	$0.31$	m
Tire corner stiffness $C_{\alpha_{f}},C_{\alpha_{r}}$	$54,500$	N/rad
Tire rolling resistance coefficient $f_{r}$	$0.015$	-
Maximum steering angle	$35$	deg
Yaw moment of inertia $I_{z}$	$4000$	kg $\cdot\text{m}^{2}$
Aerodynamic drag coefficient $C_{d}$	$0.3$	-
Density of air $\rho_{A}$	$1.2258$	kg/ $\text{m}^{3}$
Vehicle frontal area $A_{f}$	2.05	$\text{m}^{2}$
Maximum motor power $P_{\text{max}}$	$125$	kW
Motor torque coefficient $K_{m}$	$1,550$	N $\cdot\text{m}$
Braking force coefficient $K_{b}$	$16,422$	N
Maximum friction coefficient $\mu_{\max}$	$1.15$	-
Acceleration due to gravity $g$	$9.81$	$\text{m}/\text{s}^{2}$

We build two race tracks for race driving policy training and evaluation. The layout of track-A is given in Fig. 8(a). It is a simple testing track with only five corners. The length and width of track-A are $860$ m and $20$ m respectively. The layout of track-B is given in Fig. 8(b). The layout is modeled after the Ruisi Circuit located in Bei**g, China. Track-B has ten corners. The total length is $1400$ m and the width is $10$ m. In Fig. 8, the finish line and running direction of the tracks are marked with a red line across the track and a white arrow respectively.

A typical race track can be regarded as a sequence of basic corners connected together. Therefore, the driving skill of cornering is the key to minimizing the lap time. As shown in Fig. 9, a basic corner can be divided into three parts: in-straight (red part), curve (blue part), and out-straight (green part). The solid green line indicates the racing line, which is the theoretical fastest line through the corner. Obviously, the racing line significantly reduces the tightness of the corner by using the in-straight and the out-straight parts, which allows for the highest speed possible to run through this corner. There are four key points on the racing line. The braking point is the position to start applying the brake before the corner. The turn-in point is the position to start steering into the corner. The apex is located on the inside of a corner, and it is the aiming point after the turn-in point. The apex is also the position to start acceleration. The exit point is where the car once again reaches the outside of the corner, and it is also the aiming point after the car passes the apex. The ideal racing line and key points depend on the curvature of the corner, the condition of the previous corner and the following corner, and the handling performance of the car being driven.

We conduct the following simulation experiments on a workstation with Ubuntu 20.04 operating system, Intel Core i9-13900k CPU, 32GB RAM, and NVIDIA GeForce RTX 4090 GPU. The neural networks for RL algorithms are built with the PyTorch framework and implemented on the GPU.

The driving policy is trained following Algorithm 2. To implement the numerical AM mechanism, we first determine the discrete boundary function based on the race car’s dynamics with friction constraint. The numbers of the discretization points are set as: $N_{v}=N_{\delta}=N_{\vartheta}=200$ . The discretization step sizes for speed, steering angle, and action vector angle are $0.15$ m/s, $0.35$ deg, and $1.8$ deg, respectively. Then, we could use the AM mechanism in policy training following Algorithm 1. The training parameters are listed in Table LABEL:tb_rlparam. At the beginning of each training episode, we place the car on a random position in the straight part of the track with a random speed $v_{x}\in[0,30]$ . The initial $\omega$ , $\phi$ , and $\delta$ are all zero. For each time step, the states are normalized to $[0,1]$ or $[-1,1]$ with respect to their minimum and maximum values. For the forward-observation feature state $\textbf{V}_{\text{FO}}$ , we utilize $12$ forward observation points at varying distances: $d_{i}=\{10,20,30,40,60,80,100,120,140,160,180,200\}$ meters. The maximum duration for one episode is $10,000$ steps, equivalent to $100$ seconds.

Table 2: Training Parameters

Parameter	Value
Discount factor $\gamma$	$0.99$
Initial learning rate for critic network $\alpha_{w}$	$0.0003$
Initial learning rate for actor network $\alpha_{\mu}$	$0.0003$
Updating rate for target critic network $\beta_{w}$	$0.005$
Updating rate for target actor network $\beta_{\mu}$	$0.005$
Batch size $N$	$256$
Replay buffer size $M$	$1,000,000$
Exploration noise variance $\sigma_{\mu}$	$0.1$
Policy smoothing variance $\sigma_{\epsilon}$	$0.2$
Update delay step $T_{\text{delay}}$	$2$

To further explore the capabilities of the proposed TD3-AM algorithm, we introduce several comparative approaches, including the conventional TD3 algorithm, the proximal policy optimization (PPO) algorithm, and the safety layer technique (SL). Given that TD3 operates as an off-policy RL algorithm, we opt for the on-policy PPO algorithm as an additional baseline RL algorithm, and we also apply AM to the PPO algorithm. In our experiments, the PPO algorithm is implemented with the Stable-Baselines3 library [45]. The structure of the actor and critic networks and the batch size used for PPO are the same as the TD3-AM. Other training parameters are also carefully tuned. The safety layer is a state-of-the-art safe exploration technique for RL with state constraints [35]. As we introduced before, SL shares a similar idea with AM, which is adjusting the original action from the actor network to satisfy the constraints. Therefore, it is also compatible with both on-policy and off-policy RL algorithms. In the following experiments, SL is built following the procedures given in [35]. We first collect transition data from the simulation environment where the race car is randomly initialized and given random actions. The transitions of constraint violations are specially marked. Then, the immediate-constraint function of SL is approximated with a two-hidden-layer neural network based on the collected data through supervised learning. After that, we apply SL to both TD3 and PPO algorithms.

5.2 Evaluation Results on Track-A

For the evaluation on track-A, we train and test all six approaches mentioned above, which are: 1) TD3; 2) TD3-SL; 3) TD3-AM; 4) PPO; 5) PPO-SL; 6) PPO-AM. Each approach is trained for $20$ M iterations. After every $10$ k iterations, the driving policy is evaluated by an evaluation episode where the race car starts from the finish line and runs for $100$ s. The accumulated reward and lap time of the episode is recorded. For each approach, we perform $10$ independent training trials starting with randomly initialized actor and critic networks.

Figure 10 (first row) displays the learning progress, as indicated by the average accumulated rewards of the evaluation episodes. The solid lines and shaded areas depict the average values and standard deviations of the rewards for the identical evaluation episode across 10 trials. The curves show that the learning progress of the TD3-based approaches is more stable and consistent, while the PPO-based approaches are not very stable. Moreover, the TD3-AM approach achieves the highest average reward.

In motorsport, the lap time of a ‘flying lap’ is a major criterion to evaluate a race car’s performance and the skill of a race driver. The ‘flying lap’ starts when the car crosses the finish line of the previous lap at a high rate of speed. In our evaluation episode, if the driving policy could finish at least two laps without any fault, the second lap can be regarded as a ‘flying lap’ and this episode is identified as a ‘success episode’. To demonstrate the performance improvement of the driving policies during the learning process, we aggregate the results of every $10$ evaluation episode from $10$ trials as a group. The lap time and success rate of all groups are given in the second and third rows of Fig. 10. The median values and the max/min value of lap time in each evaluation group are shown by solid lines and shaded areas respectively. If there is no success episode in one evaluation group, the lap time is not shown. The success rate of the TD3-AM policy finally reaches $90\%$ , while other policies are all below $50\%$ . The PPO-based policies obtain lower success rates than the TD3-based policies, and the PPO-based policies also take more training iterations before having a success episode. However, for the lap time, the PPO and PPO-SL policies obtain a shorter time than their corresponding TD3 and TD3-SL policies. More importantly, the driving policies with AM achieve further shorter lap time than other policies.

Furthermore, we introduce other criteria for comparison, including 1) the best lap time achieved in all evaluation episodes; 2) the average cumulative number of episodes terminated due to the friction constraint violation; and 3) training speed in seconds per 1k iteration. These values are listed in Table LABEL:tb_tracka. From the best lap time, we find that TD3 and PPO approaches make significant improvements after using AM. The best lap times of TD3-AM and PPO-AM are $22\%$ and $5\%$ shorter than their corresponding baselines. In comparison, the improvements by introducing SL are relatively smaller. The TD3-AM driving policy achieves the best lap time of all policies. From the number of constraint violations, we observe that introducing SL could reduce the number of friction constraint violations by half, and the AM mechanism successfully avoids any violations. The training speed of the PPO-based approaches is approximately three times faster than the TD3-based approaches. Applying AM and SL to TD3 and PPO prolongs the training time about $0.4$ s and $0.2$ s slower per $1$ k iteration. Note that processing the action map** or safety layer of a single step takes less than $0.2$ ms.

Although both SL and AM are designed to address the constraint, only the AM mechanism achieves zero violation in all episodes. The main reason is that the numerical AM mechanism is more efficient at handling constraints with complex input coupling and nonlinear dynamics. From the vehicle model given above, the friction constraint function includes nonlinear couplings among speed, steering angle, and control inputs. To deal with the constraint, the SL technique approximates the friction constraint function with a linear model with respect to $g(s_{c})^{T}[a_{x},a_{y}]$ , where the coefficient $g(s_{c})$ is a function of car motion states $s_{c}$ , extracted with a pre-trained neural network. Then, this constraint function is solved via an analytical optimization method. This method works well for decoupled systems. However, when there is coupling among inputs, this form of linear approximation does not fit the actual dynamics. We observe that the fitting accuracy of network $g(s_{c})$ is relatively low from the pre-training, and that makes some of the actions fail to satisfy the constraint. Nevertheless, SL still works in some areas of the state space and effectively reduces the number of constraint violations.

Table 3: Best lap time, constraint violation, training speed comparison of six approaches in track-A

Driving Policy	Best Lap Time (s)	Number of Constraint Violation	Training Speed (s/1k iteration)
TD3	47.28	252	1.186
TD3-SL	44.26	123	1.572
TD3-AM	36.94	0	1.579
PPO	39.67	304	0.441
PPO-SL	39.54	135	0.644
PPO-AM	37.76	0	0.620

The trajectories of the fastest flying lap by all six driving policies are shown in Fig. 11. The speed is illustrated by the color of the trajectory. The speed at the exit point of corners C1-C5 is marked on the figure. From the trajectories and speed, we find that all driving policies have mastered some basic skills of race driving. The policies have learned to decelerate upon approaching a corner and accelerate after exiting a corner. The policies also have learned to use the racing line to drive at a higher speed through a corner. Comparing these trajectories reveals that the PPO-AM and TD3-AM policies exhibit smoother and more extended trajectories than others. The speeds at the exit points are consistently higher than those achieved by other policies. Both PPO-AM and TD3-AM policies showcase excellent driving skills, with highly similar trajectories.

To comprehensively demonstrate the driving behavior of the learned policies, we show the speed, throttle/brake control, steering control, and resultant acceleration in Fig. 12. Note that the $x$ -axis denotes the track distance, which is the distance from the finish line to the vehicle’s position along the centerline of the race track. The exit points of corners C1-C5 are marked in Fig. 12. From the speed and throttle/brake curves, the TD3-AM policy tends to maintain a proper speed in corners. Differently, the TD3 policy continuously applies the brake, which makes the car run unnecessarily slow in corners. Furthermore, in corners C2, C3, and C4, the TD3 policy outputs fast-changing throttle/brake and steering control signals, which is harmful to the car’s balance in real driving. In comparison, the control signals from the TD3-AM policy are much smoother. From the resultant acceleration, both driving policies satisfy the constraint of friction, which is given by a red horizontal line ( $1.15g$ ) in the figure. In each corner, the acceleration of the TD3-AM policy is closer to the limit. That means the TD3-AM policy could better maximize the maneuverability of the vehicle.

For conventional RL-based approaches (TD3 and PPO), constraint violations are penalized in the reward function, discouraging control policies from approaching the friction limit. The reward-sha** solution non-equivalently transforms the original optimization problem with constraints into a multi-objective optimization problem, where the constraint conditions become part of the penalty terms, which makes the control policy conservative. In comparison, introducing the AM mechanism directly prevents exceeding the friction limit from the perspective of action space without changing the objective of the optimization problem. Our method reconciles the contradiction between the need for higher speed and satisfying the friction constraint in turning maneuvers. In summary, the AM mechanism significantly improves the efficiency of RL algorithms in optimizing the race driving policy.

5.3 Evaluation Results on Track-B

The capability of the proposed TD3-AM driving policy is further evaluated on track-B. The generalization ability of the AM mechanism is also demonstrated and discussed. The corners in track-B are more complex and volatile compared to track-A, with the track width being only half of track-A. The six approaches are trained and evaluated on Track-B with identical training configurations, with the distinction that each approach is trained for 50M iterations instead of 20M. The learned driving policies are also evaluated every 1k iteration. However, both the TD3 and TD3-SL policies fail to achieve a satisfactory driving policy capable of completing a lap. The best lap times for the other four policies are listed in Table LABEL:tb_laptimeb. Similar to the results on track-A, the introduction of the AM mechanism also significantly enhances performance on track-B. Given the similarities in driving behaviors between PPO-AM and TD3-AM, we focus on demonstrating the TD3-AM driving policy in the following.

Table 4: Best lap time comparison in Track-B

Driving Policy	Best Lap time (min:sec)
PPO	1:26.23
PPO-SL	1:18.38
PPO-AM	1:09.55
TD3-AM	1:04.16

The trajectory of the fastest flying lap by the trained TD3-AM driving policy is shown in Fig. 13. The color of the trajectory illustrates the speed. The speeds at the exit points of corners C1-C10 are marked in the figure. Although track-B is much more difficult than track-A, the TD3-AM algorithm still has successfully mastered the driving skills for track-B. In consecutive corners C3-C7 and hairpin corners C8 and C9, the driving agent utilizes the full width of the track to minimize corner curvature, resulting in a remarkably smooth trajectory. The overall trajectory closely aligns with the theoretical optimal race line.

In real race situations, the constraint of friction frequently changes due to tire wear, tire replacement, or wet track. If the maximum friction decreases while the original driving policy is in use, there is a higher risk of violating the friction constraint, particularly in sharp corners. The proposed TD3-AM driving policy can easily adapt to different friction constraints by adjusting the friction limit in the action map** function. To demonstrate this feature, we compare the TD3-AM driving policy using two action map** functions where the friction limits $\mu_{\max}$ are $1.15g$ and $1.0g$ respectively. We use the most difficult part of track-B, consecutive corners C3-C7, to demonstrate the performances. The trajectories are compared in Fig. 14. The network output action $a_{t}$ , control signal $u_{t}$ of throttle/brake and steering, and resultant acceleration are compared in Fig. 15 where the exit point of corners C3-C7 are marked. The trajectories of the driving policies with the constraint of $1.15g$ and $1.0g$ are similar, yet the speed of the policy with $1.0g$ policy is obviously slower. Since the maximum friction is lower, driving through a corner at a slower speed is the only way to avoid losing grip.

From the resultant acceleration curves in Fig. 15, both driving policies satisfy their corresponding constraint. The acceleration curves are very close to the upper limit in the corners, which means the policies could make the most of the tire grip to pass the corners at high speeds. That ability is mostly attributed to the AM mechanism. Specifically, from Fig. 15, when the resultant acceleration approaches the limit in sharp corners, the constrained control input space becomes smaller. At this stage, whenever the network outputs an action that is outside the constrained space, the action map** function gives the corresponding control signal that barely complies with the constraint. For example, in curves C3 and C4, the actor networks of both policies output the full-throttle action, while the acceleration curves indicate that turning at the current speed is very close to the friction limit. Then, the action map** function gives a lower throttle signal to maintain the maximum possible speed through the corner. It should be noted that directly using an action map** function with a lower friction limit can only guarantee that the driving policy satisfies the new friction constraint. The new driving policy cannot achieve the near-optimal driving performance as the original one, and it may not successfully finish a lap if the new friction limit is much lower. However, the new driving policy can be quickly improved and reach the near-optimal level with a few episodes of training in the lower friction situation. In this way, with the AM mechanism, the learned driving policy can quickly generalize to lower friction conditions without re-training the whole driving policy.

Track-B is modeled after a real race track, and many cars have been tested on this track by professional drivers. Although the driving policy evaluation on track-B is performed in the simulation environment, we can make a rough comparison between our learned driving policy and the real professional drivers. We use the flying lap time data from [46], and select six cars with similar lap times for comparison. The cars’ basic performance parameters and lap times are listed in Table 5. The maximum power (Max Pow.) and 0 to 100 km/h acceleration time (Acc. Time) indicate the car’s acceleration performance; the 100 km/h to 0 brake distance (Brake Dist.) is associated with the maximum friction force. From these data, although our test model has little merit in basic performance parameters, the learned driving policy still managed to achieve a comparable lap time. Given the disparities between the simulation environment and the real-world scenario, we cannot directly compare the performances between the learned policy and the professional driver. However, the proposed TD3-AM algorithm for race driving has demonstrated its potential to acquire professional-level driving skills.

Table 5: Flying Lap Time Comparison on Ruisi Circuit (track B)

Make/Model	Max Power (kW)	Acc. Time (s)	Brake Dist (m)	Lap Time (min:sec)
BMW 325Li (2021)	135	7.90	38.7	1:03.28
Honda Civic (2022)	134	8.13	38.2	1:03.91
VW Golf 8 (2021)	110	8.08	35.9	1:04.12
Our Test Model	125	8.80	42.5	1:04.16
Ford Focus (2019)	135	8.90	37.1	1:04.30
KIA K5 (2020)	176	7.48	34.1	1:05.10
Mazda Atenza (2020)	141	8.03	37.0	1:05.40

6 Conclusions and Future Work

In this paper, we present a novel numerical AM-RL framework for autonomous race driving. The proposed numerical AM mechanism enables the RL-based driving agent to safely operate the vehicle within the friction limit while maximizing its handling capability. Leveraging the proposed TD3-AM approach, we have successfully trained a race driving agent with professional-level skills. The simulation results highlight improved race driving performance and the generalization capability to different friction conditions of the proposed approach, representing a significant advancement in addressing the friction constraint in autonomous racing. The lap time of the TD3-AM driving policy is 22% shorter than the baseline TD3 driving policy, and the success rate is 90%, which is much higher than the baseline policies.

In our future work, we aim to assess the practical applicability of our AM-RL framework through real-world car race experiments. Due to the unavailability of an accurate dynamic model for the vehicle, we intend to first establish a conservative friction constraint function using approximate physical parameters, and then enhance its accuracy progressively through online learning. We will also explore methods to handle sensor noise and other uncertainties in real-world applications. Additionally, we also plan to investigate the capabilities of AM-RL in addressing other nonlinear control problems with constraints.

References

[1] S. Grigorescu, B. Trasnea, T. Cocias, G. Macesanu, A survey of deep learning techniques for autonomous driving, Journal of Field Robotics 37 (3) (2020) 362–386.
[2] E. Yurtsever, J. Lambert, A. Carballo, K. Takeda, A survey of autonomous driving: Common practices and emerging technologies, IEEE access 8 (2020) 58443–58469.
[3] L. Liu, S. Lu, R. Zhong, B. Wu, Y. Yao, Q. Zhang, W. Shi, Computing systems for autonomous driving: State of the art and challenges, IEEE Internet of Things Journal 8 (8) (2020) 6469–6486.
[4] L. Chen, Y. Li, C. Huang, B. Li, Y. Xing, D. Tian, L. Li, Z. Hu, X. Na, Z. Li, et al., Milestones in autonomous driving and intelligent vehicles: Survey of surveys, IEEE Transactions on Intelligent Vehicles 8 (2) (2022) 1046–1056.
[5] Y. Li, J. Ibanez-Guzman, Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems, IEEE Signal Processing Magazine 37 (4) (2020) 50–61.
[6] D. Feng, C. Haase-Schütz, L. Rosenbaum, H. Hertlein, C. Glaeser, F. Timm, W. Wiesbeck, K. Dietmayer, Deep multi-modal object detection and semantic segmentation for autonomous driving: Datasets, methods, and challenges, IEEE Transactions on Intelligent Transportation Systems 22 (3) (2020) 1341–1360.
[7] H. Fujiyoshi, T. Hirakawa, T. Yamashita, Deep learning-based image recognition for autonomous driving, IATSS research 43 (4) (2019) 244–252.
[8] L. Claussmann, M. Revilloud, D. Gruyer, S. Glaser, A review of motion planning for highway autonomous driving, IEEE Transactions on Intelligent Transportation Systems 21 (5) (2020) 1826–1848. doi:10.1109/TITS.2019.2913998.
[9] J. Chen, W. Zhan, M. Tomizuka, Autonomous driving motion planning with constrained iterative lqr, IEEE Transactions on Intelligent Vehicles 4 (2) (2019) 244–254.
[10] A. Amini, I. Gilitschenski, J. Phillips, J. Moseyko, R. Banerjee, S. Karaman, D. Rus, Learning robust control policies for end-to-end autonomous driving from data-driven simulation, IEEE Robotics and Automation Letters 5 (2) (2020) 1143–1150.
[11] D. Li, D. Zhao, Q. Zhang, Y. Chen, Reinforcement learning and deep learning based lateral control for autonomous driving [application notes], IEEE Computational Intelligence Magazine 14 (2) (2019) 83–98.
[12] M. Guiggiani, The science of vehicle dynamics, Pisa, Italy: Springer Netherlands (2014) 15.
[13] R. Rajamani, Vehicle dynamics and control, Springer Science & Business Media, 2011.
[14] X. Yuan, Y. Wang, J. Liu, C. Sun, Action map**: A reinforcement learning method for constrained-input systems, IEEE Transactions on Neural Networks and Learning Systems 34 (10) (2023) 7145–7157. doi:10.1109/TNNLS.2021.3138924.
[15] S. Fujimoto, H. Hoof, D. Meger, Addressing function approximation error in actor-critic methods, in: International Conference on Machine Learning, 2018, pp. 1587–1596.
[16] R. Verschueren, S. De Bruyne, M. Zanon, J. V. Frasch, M. Diehl, Towards time-optimal race car driving using nonlinear mpc in real-time, in: 53rd IEEE conference on decision and control, IEEE, 2014, pp. 2505–2510.
[17] R. Verschueren, M. Zanon, R. Quirynen, M. Diehl, Time-optimal race car driving using an online exact hessian based nonlinear mpc algorithm, in: 2016 European control conference (ECC), IEEE, 2016, pp. 141–147.
[18] P. Scheffe, T. M. Henneken, M. Kloock, B. Alrifaee, Sequential convex programming methods for real-time optimal trajectory planning in autonomous vehicle racing, IEEE Transactions on Intelligent Vehicles 8 (1) (2022) 661–672.
[19] R. C. T. Novi, A. Liniger, C. Annicchiarico, Real-time control for at-limit handling driving on a predefined path, Vehicle System Dynamics 58 (7) (2020) 1007–1036.
[20] A. Liniger, J. Lygeros, Real-time control for autonomous racing based on viability theory, IEEE Transactions on Control Systems Technology 27 (2) (2017) 464–478.
[21] J. Kabzan, L. Hewing, A. Liniger, M. N. Zeilinger, Learning-based model predictive control for autonomous racing, IEEE Robotics and Automation Letters 4 (4) (2019) 3363–3370.
[22] U. Rosolia, F. Borrelli, Learning how to autonomously race a car: a predictive control approach, IEEE Transactions on Control Systems Technology 28 (6) (2019) 2713–2719.
[23] G. Williams, N. Wagener, B. Goldfain, P. Drews, J. M. Rehg, B. Boots, E. A. Theodorou, Information theoretic mpc for model-based reinforcement learning, in: 2017 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2017, pp. 1714–1721.
[24] G. Williams, P. Drews, B. Goldfain, J. M. Rehg, E. A. Theodorou, Information-theoretic model predictive control: Theory and applications to autonomous driving, IEEE Transactions on Robotics 34 (6) (2018) 1603–1622.
[25] P. Drews, G. Williams, B. Goldfain, E. A. Theodorou, J. M. Rehg, Aggressive deep driving: Model predictive control with a cnn cost model, arXiv preprint arXiv:1707.05303.
[26] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, T. Y, D. Silver, D. Wierstra, Continuous control with deep reinforcement learning, in: ICLR, 2016.
[27] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347.
[28] T. Haarnoja, A. Zhou, P. Abbeel, S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor, in: International conference on machine learning, PMLR, 2018, pp. 1861–1870.
[29] Y. Wang, J. Sun, H. He, C. Sun, Deterministic policy gradient with integral compensator for robust quadrotor control, IEEE Transactions on Systems, Man, and Cybernetics: Systems 50 (10) (2019) 3713–3725.
[30] J. Hwangbo, I. Sa, R. Siegwart, M. Hutter, Control of a quadrotor with reinforcement learning, IEEE Robotics and Automation Letters 2 (4) (2017) 2096–2103.
[31] S. Gu, E. Holly, T. Lillicrap, S. Levine, Deep reinforcement learning for robotic manipulation with asynchronous off-policy updates, in: 2017 IEEE international conference on robotics and automation (ICRA), IEEE, 2017, pp. 3389–3396.
[32] T. Koller, F. Berkenkamp, M. Turchetta, A. Krause, Learning-based model predictive control for safe exploration, in: 2018 IEEE conference on decision and control (CDC), IEEE, 2018, pp. 6059–6066.
[33] J. Achiam, D. Held, A. Tamar, P. Abbeel, Constrained policy optimization, in: International conference on machine learning, PMLR, 2017, pp. 22–31.
[34] C. Tessler, D. J. Mankowitz, S. Mannor, Reward constrained policy optimization, arXiv preprint arXiv:1805.11074.
[35] G. Dalal, K. Dvijotham, M. Vecerik, T. Hester, C. Paduraru, Y. Tassa, Safe exploration in continuous action spaces, arXiv preprint arXiv:1801.08757.
[36] M. Jaritz, R. De Charette, M. Toromanoff, E. Perot, F. Nashashibi, End-to-end race driving with deep reinforcement learning, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 2070–2075.
[37] F. Fuchs, Y. Song, E. Kaufmann, D. Scaramuzza, P. Dürr, Super-human performance in gran turismo sport using deep reinforcement learning, IEEE Robotics and Automation Letters 6 (3) (2021) 4257–4264.
[38] P. R. Wurman, S. Barrett, K. Kawamoto, J. MacGlashan, K. Subramanian, T. J. Walsh, R. Capobianco, A. Devlic, F. Eckert, F. Fuchs, et al., Outracing champion gran turismo drivers with deep reinforcement learning, Nature 602 (7896) (2022) 223–228.
[39] A. Remonda, S. Krebs, E. E. Veas, G. Luzhnica, R. Kern, Formula rl: Deep reinforcement learning for autonomous racing using telemetry data, in: Workshop on Scaling-Up Reinforcement Learning: SURL, 2019.
[40] B. Wymann, E. Espié, C. Guionneau, C. Dimitrakakis, R. Coulom, A. Sumner, Torcs, the open racing car simulator, Software available at http://torcs. sourceforge. net 4 (6) (2000) 2.
[41] A. Remonda, E. Veas, G. Luzhnica, Comparing driving behavior of humans and autonomous driving in a professional racing simulator, PLoS one 16 (2) (2021) e0245320.
[42] J. Niu, Y. Hu, B. **, Y. Han, X. Li, Two-stage safe reinforcement learning for high-speed autonomous racing, in: 2020 IEEE international conference on Systems, Man, and Cybernetics (SMC), IEEE, 2020, pp. 3934–3941.
[43] B. D. Evans, H. W. Jordaan, H. A. Engelbrecht, Safe reinforcement learning for high-speed autonomous racing, Cognitive Robotics 3 (2023) 107–126.
[44] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: International conference on machine learning, PMLR, 2014, pp. 387–395.
[45] A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, N. Dormann, Stable-baselines3: Reliable reinforcement learning implementations, Journal of Machine Learning Research 22 (268) (2021) 1–8.
URL http://jmlr.org/papers/v22/20-1364.html
[46] KBRACER, Ruisi circuit lap time leaderboard, https://kbracer.github.io (Jan. 2023).
URL https://kbracer.github.io