Generalized Multi-Objective Reinforcement Learning with Envelope Updates in URLLC-enabled Vehicular Networks

Zijiang Yan, and Hina Tabassum Z.Yan and H.Tabassum were with the Department of Electrical Engineering and Computer Science, York University, Toronto, ON, M3J 1P3 Canada e-mail: {zjiyan,hinat}@yorku.ca.This research was supported by a Discovery Grant funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). A preliminary version of this work has been presented at the IEEE Global Communications Conference (GLOBECOM), 2022 [1]

Abstract

We develop a novel multi-objective reinforcement learning (MORL) framework to jointly optimize wireless network selection and autonomous driving policies in a multi-band vehicular network operating on conventional sub-6GHz spectrum and Terahertz frequencies. The proposed framework is designed to (i) maximize the traffic flow and minimize collisions by controlling the vehicle’s motion dynamics (i.e., speed and acceleration), and (ii) enhance the ultra-reliable low-latency communication (URLLC) while minimizing handoffs (HOs). We cast this problem as a multi-objective Markov Decision Process (MOMDP) and develop solutions for both predefined and unknown preferences of the conflicting objectives. Specifically, deep-Q-network and double deep-Q-network-based solutions are developed first that consider scalarizing the transportation and telecommunication rewards using predefined preferences. We then develop a novel envelope MORL solution which develop policies that address multiple objectives with unknown preferences to the agent. While this approach reduces reliance on scalar rewards, policy effectiveness varying with different preferences is a challenge. To address this, we apply a generalized version of the Bellman equation and optimize the convex envelope of multi-objective Q values to learn a unified parametric representation capable of generating optimal policies across all possible preference configurations. Following an initial learning phase, our agent can execute optimal policies under any specified preference or infer preferences from minimal data samples. Numerical results validate the efficacy of the envelope-based MORL solution and demonstrate interesting insights related to the inter-dependency of vehicle motion dynamics, HOs, and the communication data rate. The proposed policies enable autonomous vehicles to adopt safe driving behaviors with improved connectivity.

Index Terms:

Autonomous driving, multi-objective reinforcement learning, multi-band network selection, resource allocation

I Introduction

Facilitating ultra-reliable and low-latency vehicle-to-infrastructure (V2I) communications is a fundamental prerequisite for the realization of autonomous and intelligent transportation systems. Different from throughput-oriented conventional communications, ensuring ultra-reliable low latency communications (URLLC) is challenging as it relies on ensuring the signal-to-interference ratio (SINR), data rate, over-the-air/queuing latency, and decoding probability. Conventional radio frequency (RF) alone cannot efficiently meet the stringent URLLC requirement due to its limited coverage and narrow transmission bandwidths. In this context, 6G technology enables combining the conventional sub-6GHz transmissions¹¹1We use sub-6GHz and RF communication interchangeably in this paper. in conjunction with extremely high frequencies such as THz transmissions, where the former can compensate for the severe path-loss attenuation of THz transmission, and the latter can help overcome the RF spectrum congestion.

On the other hand, the use of deep reinforcement learning (DRL) is becoming critical for online decision making in highly random mobility-oriented wireless environments. In the context of V2I communications, a plethora of research works focused on improving network quality of service (QoS) (e.g., including transmission delay, link throughput, etc) via DRL-based resource allocation [2, 3, 4]. These research works were focused on considering sub-channel and power allocation to enhance V2I communications. Particularly, the authors in [2, 3] adopted deep Q-network (DQN) and multi-agent DQN to enhance the total throughput of V2I connections and the payload delivery rate of vehicle-to-vehicle (V2V) connections simultaneously. Xu et al. [4] derived the contribution-based dual-clip proximal policy to optimize V2I and V2V links separately. However, their system model only contains a single BS where handovers (HOs) are not considered.

Recently, the authors in [5, 6, 7, 8, 9] formulated similar optimization tasks as multi-objective optimization problem (MOOP) and proposed to use multi-objective reinforcement learning (MORL) solutions. Hu et al. [5] implemented Double-Loop Learning (DLL) to minimize the latency of real-time services transmission and maximize the throughput of non-instant services transmission. In [6, 7], the authors adopted the weighted Tchebycheff method and weighted-sum-MORL to maximize the fraction between the data rate and the power consumption, respectively. From [8], Guo et al. applied the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm to address the joint handover and power allocation problem. In [9], Khan et al. utilized the Asynchronous Advantage Actor-Critic (A3C) algorithm to devise a vehicle-RSU association policy, aiming to enhance the mobile user experience by maximizing sum data rate of multiple AVs while ensuring a minimum level of service rate for all AVs.

Along another note, most of the existing research works in the transportation are focused on collision-avoidance [10, 11], safe driving [12], and efficient fuel consumption [12, 13, 14]. For instance, in [10], the authors applied a RL framework for faster travel and the reward is proportional to the AV’s velocity along with a penalty for vehicle collision. The action space included acceleration, deceleration, lane changes (LC), and maintain speed, whereas the state space was based on AVs’ locations and their respective velocities. In [12], the authors applied DDQN to enhance AVs’ driving safety and fuel consumption. The state space included AVs’ locations, fuel consumption, and velocities, whereas the actions included speeds and LC of AVs. In [13], the authors applied the Intelligent Driver Model (IDM) and Minimizing Overall Braking Induced by Lane Change (MOBIL) to control steering and lane change. The proposed reward design encourages long traffic, high speed and discourages unnecessary LC. Authors in [14] introduced multi-objective actor-critic to improve the tradeoff between energy consumption and travel efficiency of AVs. The authors derived MO actor-critic to optimize two objectives.

To date, none of the existing research works [2, 4, 3, 8, 9, 5, 15, 6, 7, 10, 12, 13, 14] have considered the inter-dependency of the AV motion dynamics to wireless data rates. To improve the communication data rate and minimize road collisions of connected vehicles jointly, it is critical to optimize the AVs’ network selection and driving policies simultaneously. In the sequel, our contributions can be summarized as follows:

•

We develop an MORL framework to design joint network selection and autonomous driving policies in a multi-band vehicular network (VNet). The objectives are to (i) maximize the traffic flow and minimize collisions by controlling the vehicle’s motion dynamics (i.e., speed and acceleration) from a transportation perspective, and (ii) maximize the data rates and minimize handoffs (HOs) by jointly controlling the vehicle’s motion dynamics and network selection from telecommunication perspective. We consider a novel reward function that maximizes data rate and traffic flow, ensures traffic load balancing across the network, penalizes HOs, and unsafe driving behaviors.
•

The considered problem is formulated as a multi-objective Markov decision process (MOMDP) that has two-dimensional action space and rewards consist of telecommunication and autonomous driving utilities. We then propose single policy MORL solutions with predefined preferences thus converting the MOOP into a single-objective and apply DQN and double DQN solutions. The resulting optimal policy depends on the relative preferences of the objectives.
•

Learning optimized policies across multiple preferences remains challenging. To address this, we then develop a novel envelope MORL solution to effectively navigate the entire spectrum of preferences within a given domain. This approach empowers the trained model to generate the best possible policy tailored to any user-defined preference. Our algorithm hinges on two fundamental insights: firstly, we demonstrate that the optimality operator governing a generalized Bellman equation with preferences exhibits valid contraction properties. Secondly, by optimizing for the convex envelope of multi-objective Q-values, we ensure an efficient alignment between preferences and the resultant optimal policies. Leveraging hindsight experience replay, we recycle transitions to facilitate learning across various sampled preferences, while employing homotopy optimization to maintain manageable learning processes.
•

We develop a novel simulation testbed that emulates multi-band wireless network-enabled VNet RF-THz-Highway-Env based on highway-env [16]. This test environment not only inherits the advantages of autonomous driving, and lane changes on the highway from [16], but also implements RF/THz channel propagation modeling, network selection, and HO control.
•

Numerical results shows that the proposed solution outperforms weighted sum-based MORL solutions with DQN by $12.7\%$ , $18.9\%$ , and $12.3\%$ on average transportation reward, average communication reward, and average HO rate, respectively.

The rest of this work is organized as follows. Section II shows the system model, and Section III illustrates the RL, DRL, MORL background, and problem formulation. Section IV introduces the Multi-Objective Reinforcement Learning Approach. The simulations are presented in Section V, and Section VI concludes this research work.

II System Model and Problem Formulation

A multi-band downlink network comprising $n_{R}$ RBSs and $n_{T}$ TBSs is considered. A multi-vehicle network is also considered, with a multi-lane road comprising $N_{L}$ lanes. $M$ AVs receive information from the BSs (deployed alongside the road) through V2I communications (as depicted in Figure 1). Each AV is permitted to associate with only one BS at a time, regardless of whether the BS is an RBS or TBS. The on-board units (OBUs) on the AVs receive real-time information from the VNet, including the velocity, acceleration, and lane position of surrounding vehicles. Each RBS and TBS has a bandwidth available to it, represented by $W_{R}$ and $W_{T}$ , respectively. Each RBS and TBS is capable of supporting a maximum number of AVs, denoted by $Q_{R}$ and $Q_{T}$ , respectively. All AVs are equipped with a single antenna.

Refer to caption — Figure 1: The diagram illustrates the structure of the multi-band VNet model. The blue and red circles represent TBSs and RBSs, respectively. The solid and dashed lines represent desired signal links and interference links, respectively.

II-A Downlink V2I Data Transmission Model

The signal transmitted by the RBS is subject to path-loss and short-term channel fading. Subsequently, the signal-to-interference-plus-noise ratio (SINR) of the $j$ -th AV from $i$ -th RBS is given as [17]:

{}\mathrm{SINR}^{\mathrm{RF}}_{ij}=\frac{P_{R}^{\mathrm{tx}}\>G_{R}^{\mathrm{% tx}}\>G_{R}^{\mathrm{rx}}\left(\frac{c}{4\pi f_{R}}\right)^{2}H_{i}}{r_{ij}^{% \alpha}\left(\sigma^{2}+I_{R_{j}}\right)},

(1)

where $P_{R}^{\mathrm{tx}},G_{R}^{\mathrm{tx}},G_{R}^{\mathrm{rx}},c,f_{R},r_{ij},$ and $\alpha$ denote the transmit power of the RBSs, antenna transmitting gain, antenna receiving gain, speed of light, RF carrier frequency (in GHz), distance between the $j$ -th AV and the $i$ -th RBS, and path-loss exponent, respectively. In addition, $H_{i}$ is the exponentially distributed channel fading power observed at the $j$ -th AV from the $i$ -th RBS, $\sigma^{2}$ is the power of thermal noise at the receiver, $I_{R_{j}}$ is the cumulative interference at $j$ -th AV from the interfering RBSs, i.e., where $P_{R}^{\text{tx}}$ , $G_{R}^{\text{tx}}$ , $G_{R}^{\text{rx}}$ , $c$ , $f_{R}$ , $r_{ij}$ , and $\alpha$ represent the transmit power of the RBSs, the gain of the transmitting antenna, the gain of the receiving antenna, the speed of light, the RF carrier frequency in GHz, the distance between the $j$ -th AV and the $i$ -th RBS, and the path-loss exponent, respectively. Furthermore, $H_{i}$ denotes the exponentially distributed channel fading power observed by the $j$ -th AV from the $i$ -th RBS, $\sigma^{2}$ is the power of the thermal noise at the receiver, and $I_{R_{j}}$ is the cumulative interference experienced by the $j$ -th AV from other interfering RBSs. $I_{R_{j}}=\sum_{k\neq i}P_{R}^{\mathrm{tx}}\gamma_{R}r_{kj}^{-\alpha}H_{k}$ where $r_{kj}$ is the distance between the $k$ -th interfering RBS and the $j$ -th AV, $H_{k}$ is the power of fading from the $k$ -th interfering RBS to the $j$ -th AV, and $\gamma_{R}=G_{R}^{\mathrm{tx}}\>G_{R}^{\mathrm{rx}}\left({c}/{4\pi f_{R}}% \right)^{2}$ . In the context of a Terahertz (THz) network, where molecular absorption significantly impacts signal propagation, the significance of line-of-sight (LOS) transmissions over non-line-of-sight (NLOS) transmissions is dominant. Consequently, the SINR for a given $j$ -th AV can be modeled as follows [17]:

\mathrm{SINR}^{\mathrm{THz}}_{ij}=\frac{G_{T}^{\mathrm{tx}}G_{T}^{\mathrm{rx}}% \left(\frac{c}{4\pi f_{T}}\right)^{2}P_{T}^{\mathrm{tx}}\>\mathrm{exp}(-K_{a}(% f_{T})r_{ij})r_{ij}^{-2}}{N_{T_{j}}+I_{T_{j}}},

(2)

where $G_{T}^{\text{tx}},G_{T}^{\text{rx}},P_{T}^{\text{tx}},f_{T},r_{ij},\text{ and % }K_{a}(f_{T})$ represent the transmit antenna gain of the TBS, the receiving antenna gain of the TBS, the transmit power of the TBS, the THz carrier frequency, the distance between the $j$ -th AV and the $i$ -th TBS, and the molecular absorption coefficient of the transmission medium, respectively ²²2For the sake of brevity, the argument of $K_{a}(f_{T})$ will henceforth be omitted in this study.. It is important to note that: $G_{T}^{\text{rx}}(\theta)\text{ and }G_{T}^{\text{tx}}(\theta)$ denote the antenna gains at the receiver and transmitter sides corresponding to the boresight direction angle $\theta\in[-\pi,\pi)$ . The beamforming gain from the main and side lobes of the TBS transmitting antenna is subsequently defined as,

G_{T}^{q}\left(\theta\right)=\begin{cases}G^{q}_{\mathrm{max}}&\mid\theta\mid% \leq w_{q}\\ G^{q}_{\mathrm{min}}&\mid\theta\mid>w_{q}\end{cases},

(3)

where the superscript $q$ is used to indicate the transmit/ receive antenna, i.e., $q\in\{\mathrm{tx,rx}\}$ , $w_{q}$ is the beamwidth of the main lobe, $G^{q}_{\mathrm{max}}$ and $G^{q}_{\mathrm{min}}$ are the beamforming gains of the main and side lobes, respectively. We assume that AVs can align the receiving beam with the TBS transmit beam using beam alignment techniques. For the alignment between the user and interfering TBSs, we define a random variable $\Theta$ , $\Theta\in\{G^{\mathrm{tx}}_{\mathrm{max}}G^{\mathrm{rx}}_{\mathrm{max}},G^{% \mathrm{tx}}_{\mathrm{max}}G^{\mathrm{rx}}_{\mathrm{min}},G^{\mathrm{tx}}_{% \mathrm{min}}G^{\mathrm{rx}}_{\mathrm{max}},G^{\mathrm{tx}}_{\mathrm{min}}G^{% \mathrm{rx}}_{\mathrm{min}}\},$ and the respective probability for each case is $F_{\mathrm{tx}}F_{\mathrm{rx}}$ , $F_{\mathrm{tx}}(1-F_{\mathrm{rx}})$ , $(1-F_{\mathrm{tx}})F_{\mathrm{rx}}$ , and $(1-F_{\mathrm{tx}})(1-F_{\mathrm{rx}})$ , where $F_{\mathrm{tx}}=\frac{\theta_{\mathrm{tx}}}{2\pi}$ $F_{\mathrm{rx}}=\frac{\theta_{\mathrm{rx}}}{2\pi}$ with $\theta_{\mathrm{tx}},\theta_{\mathrm{rx}}$ being the beamwidth on the transmitter and receiver antenna, respectively. Without loss of generality, we consider negligible side lobe gains. Thus, the cumulative interference $I_{T}$ between AV and the interfering TBS is given as $I_{T}=\sum_{k\neq i}\gamma_{T}\>P_{T}^{\mathrm{tx}}\>F_{\mathrm{tx}}F_{\mathrm% {rx}}r_{kj}^{-2}\mathrm{exp}(-K_{a}\>{r_{kj}}),$ where $\gamma_{T}=G_{T}^{\mathrm{tx}}G_{T}^{\mathrm{rx}}\left(\frac{c}{4\pi f_{T}}% \right)^{2}$ . The cumulative thermal and molecular absorption noise is thus given as:

\begin{multlined}N_{T_{j}}=N_{0}+\>P_{T}^{\mathrm{tx}}\gamma_{T}\>{r_{ij}^{-2}% }\>(1-e^{-K_{a}\>{r_{ij}}})+\\ \sum_{k\neq i}\gamma_{T}F_{\mathrm{tx}}F_{\mathrm{rx}}\>P_{T}^{\mathrm{tx}}\>{% r_{kj}^{-2}}(1-e^{-K_{a}\>{r_{kj}}}).\end{multlined}N_{T_{j}}=N_{0}+\>P_{T}^{% \mathrm{tx}}\gamma_{T}\>{r_{ij}^{-2}}\>(1-e^{-K_{a}\>{r_{ij}}})+\\ \sum_{k\neq i}\gamma_{T}F_{\mathrm{tx}}F_{\mathrm{rx}}\>P_{T}^{\mathrm{tx}}\>{% r_{kj}^{-2}}(1-e^{-K_{a}\>{r_{kj}}}).

(4)

The traditional data rate relies on Shannon’s capacity, which can be attained as the block-length of channel codes approaches infinity. Nevertheless, to prevent prolonged transmission delays in URLLC, the block length must be limited. Consequently, Shannon’s capacity cannot be realized due to the presence of a non-zero decoding error probability [18]. From [19], the achievable rate in the short block-length regime over an AWGN channel can be approximated as:

R_{ij}=\frac{W_{j}}{\ln{2}}\left[\ln(1+\mathrm{SINR}_{ij})-\sqrt{\frac{V}{L_{B% }}}f_{Q}^{-1}(\epsilon_{c})\right]

(5)

where $W_{j},L_{B},\epsilon_{c},f_{Q}^{-1}(\cdot),V$ are the transmission bandwidth of BS $j$ , blocklength, decoding error probability, the inverse $Q$ function, and the channel dispersion, respectively. $V$ can be calculated by $1-\frac{1}{(1+\mathrm{SINR}_{ij})^{2}}$ . Given that $D_{t}$ time to transmit $L_{B}$ symbols, the time and frequency resources can be computed by $D_{t}W=L_{B}$ . As block length $L_{B}$ approaches infinity, the achieve rate in (5) reaches Shannon’s capacity.

Each AV maintains a list of the BSs in terms of the achievable data rate $R_{ij}$ and then informs these BSs. Consequently, each BS can calculate the possible AV associations at each time instance denoted by $n_{i}$ . Then, the AV collects the traffic load information from these BSs (i.e., the number of AVs associated with each BS $n_{i}$ ). Based on the quota of each BS $i$ , $Q_{i}\in[Q_{R},Q_{T}]$ , each AV computes a weighted data rate metric that encourages traffic load balancing at each BS and discourages unnecessary HOs, i.e.,

\text{WR}_{ij}=\frac{R_{ij}}{\min\left(Q_{i},n_{i}\right)}(1-\mu)

(6)

where $\mu$ denotes the HO penalty to discourage unnecessary HOs that is defined as follows:

{\mu=\begin{dcases}0.1,&\text{if switch to a RBS}\\ 0.5,&\text{if switch to a TBS }\\ 0,&\text{keep previous BS}\end{dcases}}

(7)

As AVs traverse the corridor, they transition from one BS to another, which is referred to as a HOs. We distinguish between two types of HOs: horizontal and vertical. Horizontal HO denotes the AV connection shifting from one BS of the same type to another. In contrast, vertical HO pertains to the scenario where the AV connection transitions from one specific type of BS to a distinct type of BS, such as moving from an RBS to a TBS. It is evident that frequent HOs can have a significant impact on the data rate that AVs receive, due to the inherent latency and failure rates associated with HOs. In this paper, we propose the introduction of a penalty, denoted by the parameter $\mu$ , which is designed to discourage HOs. This penalty is higher for TBSs and lower for RBSs, reflecting the fact that THz transmission is limited to a relatively short distance, rendering it more vulnerable to unnecessary HOs.

Then, each AV prepares a sorted list of BSs offering the best weighted data rates $\text{WR}_{ij}$ and associates to those that can fulfill the data rate requirement of the AV given by $R_{\mathrm{th}}$ .

II-B Transportation Model

We categorize $M$ AVs into two groups: target vehicles, denoted as $M_{1}$ , and surrounding vehicles, denoted as $M_{2}$ .

II-B1 Kinematics Model for all AVs

Following [20, 16], we update all AVs’ real-time physical location based on the Kinematics model. Suppose only front wheels can be steered, we have the following relations applied for each AV $j$ , $j\in M$ :

\frac{\partial}{\partial t}(x_{j})=v_{j}\cos(\psi_{j}+\beta_{j}),\quad\beta_{j% }=\arctan\left({\frac{\tan{\delta_{j}^{\mathrm{fa}}}}{2}}\right)

(8)

\frac{\partial}{\partial t}(y_{j})=v_{j}\sin(\psi_{j}+\beta_{j})

(9)

\frac{\partial}{\partial t}(v_{j})=a_{j},\quad\frac{\partial}{\partial t}(\psi% _{j})=\frac{v_{j}}{l_{j}}\sin{\beta_{j}}

(10)

where $(x_{j},y_{j})$ is the location of AV $j$ . $\psi_{j},a_{j},\beta_{j},l_{j},\delta_{j}^{\mathrm{fa}}$ are AV $j$ ’s heading, command for acceleration, slip angle at the gravity center, half-length of AV $j$ , and front wheel angle, respectively.

II-B2 Acceleration and Lane Change of Target Vehicles

As in [16, 21], each target AV $j$ follow a lane $L_{j}$ through its lateral position $y_{j}$ and heading $\psi_{L_{j}}$ , for $i\in M_{1}$ . Thus, target AVs follow a cascade controller of lateral position and heading, i.e.,

\frac{\partial}{\partial t}(\psi_{j})=K_{j}^{\psi}\left[(\psi_{L_{j}}+\arcsin% \left(\frac{\tilde{v}_{i,y}}{v_{j}}\right)-\psi_{j}\right]

(11)

where $\tilde{v}_{i,y}=K_{j}^{y}\left(y_{L_{j}}-y_{j}\right),$ $K_{j}^{\psi}$ and $K_{j}^{y}$ are the control gains. $y_{L_{j}}$ is the lane $L_{j}$ lateral position. The acceleration of target AV $j$ , for $j\in M_{1}$ , is given as[16, 21]:

a_{j}=K_{0}^{v}(v_{r}-v_{j})

(12)

where $v_{r}$ is the desired AV speed and $K_{0}^{v}$ is the control gain.

II-B3 Acceleration and Lane Change of Surrounding Vehicles

Note that surrounding AVs will not get involved in training. Instead, the surrounding AVs select their accelerations/decelerations based on the IDM and MOBIL model, which allows them to track for a specific target speed and follow in a specific lane. For each surrounding vehicle, we compute the acceleration using the Intelligent Driver Model (IDM) [16, 22] and applied the MOBIL model for changing lanes [23, 16]. IDM is based on the idea that surrounding AVs choose their acceleration or deceleration based on their current speed and the distance to the vehicle in front. The model combines a desire to drive at a certain free-flow speed with a reaction to the traffic situation, particularly to avoid collisions. We define that AV $j$ own velocity $v_{j}$ , a real gap distance $d_{j}$ to the leading AV, and the speed difference $\Delta v_{j}$ between two AVs. With IDM, the acceleration $a_{j}$ and desired minimum gap $\hat{d_{j}}$ from the vehicle ahead of AV $j$ , $j\in M_{2}$ is modeled as [16, 22]:

a_{j}=a_{c}-a_{c}\left[\left(\frac{|v_{j}|}{v_{0}}\right)^{\delta_{a}}+\left(% \frac{\hat{d_{j}}}{d_{j}}\right)^{2}\right],

(13)

where

\hat{d_{j}}=d_{0}+\max\left(0,Tv_{j}+\frac{v_{j}\Delta v_{j}}{2\sqrt{a_{c}b_{c% }}}\right),

(14)

where $a_{c},b_{c},v_{0}$ are predefined parameters denoting AV’s maximum acceleration, comfortable braking deceleration, and desired velocity, respectively. $\delta_{a}$ denotes acceleration reduction factor, i.e., a higher $\delta_{a}$ reduces acceleration. $d_{0}$ and $T$ denote the minimum distance in stopped traffic and safe time to approach the front vehicle, respectively. The desired gap has a steady-state (equilibrium) term and a dynamic term that implements the intelligent braking strategy.

To maintain a high transportation efficiency, redundant brakes during lane changing for surrounding AVs are discouraged. MOBIL model focuses on minimizing overall braking induced by a lane change in AV lateral behavior. MOBIL model indicates that an AV should change lanes when one of the following conditions is met, (1) AVs can accelerate more after changing lane $\tilde{a}_{j}\geq-b_{\textbf{safe}},$ where $b_{\textbf{safe}}$ shows the maximum braking imposed on AV when the AV cuts into the adjacent lane, (2) vehicle imposes unsafe and incentives braking on their new following vehicle, i.e,

\tilde{a_{j}}-a_{j}+p\left(\tilde{a_{o}}-a_{o}+\tilde{a_{n}}-a_{n}\right)\geq% \Delta a_{\textbf{th}}

(15)

where $a_{j}$ and $\tilde{a}_{j}$ are the acceleration of AV before and after lane change. The subscript $o$ and $n$ denote the AV’s older follower and the new follower before and after the lane change, respectively. $p$ represents the politeness coefficient. $\Delta a_{\textbf{th}}$ is the acceleration gain to execute the lane change for AV. A decision of lane change adjusts the target lane $L_{j}$ of AV $j$ on its current lane segment. The actual motion planning and steering to track the newly targeted lane are then executed by the lateral controller in (11).

III Preliminaries and MOMDP Formulation

This section first provides a primer on the formulation of the Markov Decision Process (MDP) for single-objective problems and the Multi-Objective Markov Decision Process (MOMDPs) for multi-objective problems. We then formulate our considered problem as an MOMDP and discuss the design of state-action space and rewards function.

III-A Mathematical Preliminaries

III-A1 MDPs

In the distribution of tasks pool, the interaction between tasks and agents can be defined as MDP $\mathcal{M}$ , represented by a tuple, $\mathcal{M}=<\mathcal{S},\mathcal{A},r,\mathcal{P},\mathcal{\gamma}>$ , where $\mathcal{S},\mathcal{A},r,$ are the set of states $s$ , actions $a$ , and reward $r(s,a)$ , respectively. $\gamma\in[0,1)$ is the discount factor. $r(s,a)$ represents the stochastic instantaneous reward value that the agent can receive given a specific action $a\in\mathcal{A}$ has been taken in a given specific state $s\in\mathcal{S}$ . $\mathcal{P}(s_{t+1}|s_{t},a_{t})$ indicates the transition probability for the agent to take an action $a_{t}\in\mathcal{A}$ on state $s_{t}\in\mathcal{S}$ to the next state $s_{t+1}\in\mathcal{S}$ in time step $t$ . In RL, the agent interacts with the environment by following a trajectory $\tau=\{(s_{t},a_{t})\}_{t=0}^{\infty}$ from time stamp $t=0$ to the end of the interacting episode, and receives a total discounted reward as ${r}_{\tau}=\sum_{t=0}^{\infty}\mathcal{\gamma}\cdot r_{t}(s_{t},a_{t})$ . The goal in RL is to find the map** policy $\pi$ between states and actions that maximizes the discounted total reward. In each time step $t$ , the agent selects an action based on its current state to maximize the discounted total reward ${r}_{\tau}$ . Finally, we can obtain a policy $\pi$ that belongs to the policy set $\Pi$ . The state-action value function $Q_{\pi}(s,a)$ of the policy $\pi$ on state $s$ can be given as:

Q_{\pi}(s,a)=\mathbb{E}_{\pi}\left[r_{t}(s_{t},a_{t})+\mathcal{\gamma}Q_{\pi}(% s_{t-1},a_{t-1})\right]

(16)

Thus, the optimal policy is given as $\pi^{*}=\max_{\pi\in\Pi}Q_{\pi}(s,a).$

III-A2 MOMDPs

The goal of MORL is to obtain policies among $M$ conflicting or competing objectives, where the relative importance (preferences) of each objective may be known or unknown to the agent. Similar to RL, MORL can be formulated by MOMDP which extends the MDP by defining a new reward space, preference space, and preference function, i.e., $\mathcal{M}=<\mathcal{S},\mathcal{A},\mathbf{r},\mathcal{P},\Omega,\iota_{0}>$ , where $\mathbf{r}\in\mathbb{R}^{H}$ is a vector of reward functions corresponding to $J$ objectives, e.g., $\mathbf{r}=[r^{1},r^{2},\dots,r^{H}]$ , $\Omega$ is the preference space where $\boldsymbol{\omega}\in\Omega$ is the preference vector corresponding to $H$ objectives, and $\sum_{h=1}^{H}{{\omega}}^{h}=1$ . $\iota_{0}$ is the probability distribution over initial states. In MOMDPs, a policy $\pi:\mathcal{S}\to\mathcal{A}$ defines a map** from states to actions with the goal of maximizing a vector of expected rewards. Given the distribution $\iota_{0}$ and a policy $\pi$ , the expected discounted return is defined as:

\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\mathbf{r}(s_{% t},a_{t})+\mathcal{\gamma}\mathbf{Q}_{\pi}(s_{t-1},a_{t-1},\boldsymbol{\omega}% )\right]

(17)

where $\mathbf{r}(s_{t},a_{t})$ is the immediate vector valued reward at time step $t$ for $H$ objectives. Maximizing the expected reward involves solving the following MOO problem $\max\mathbf{Q}_{\pi}=\max_{\pi}[Q_{\pi}^{1},Q_{\pi}^{2},\dots,Q_{\pi}^{H}]$ .

A policy $\pi$ strictly dominates another policy $\pi^{\prime}$ if $\pi$ achieves values at least as high as $\pi^{\prime}$ in all objectives and strictly higher in at least one objective:

\pi>\pi^{\prime}\iff\forall h,{V}^{h}_{\pi}\geq{V}^{h}_{\pi^{\prime}}\land% \exists\>h,{V}^{h}_{\pi}>{V}^{h}_{\pi^{\prime}}

(18)

Furthermore, a policy $\pi$ weakly dominates another policy $\pi^{\prime}$ , if $\pi$ achieves values greater than or equal to $\pi^{\prime}$ in all objectives, i.e., $\pi\geq\pi^{\prime}$ , when ${V}^{h}_{\pi}\geq{V}^{h}_{\pi^{\prime}},\quad\forall h$ . A policy $\pi$ is considered Pareto-optimal (or non-dominated) if it is not strictly dominated by any other policies.

Considering all returns from MOMDP, we have Pareto frontier set $\mathcal{F}^{*}:=\left\{\mathbf{\hat{r}}\mid\nexists\mathbf{\hat{r}}^{\prime}% \geq\mathbf{\hat{r}}\right\}$ [24], where $\mathbf{\hat{r}}=\sum_{t=0}^{\infty}\gamma\cdot\mathbf{r}(s_{t},a_{t})$ . For all possible preferences in $\Omega$ , we define a convex coverage set (CCS) of the Pareto frontier which contains all returns that provide the maximum cumulative reward, i.e.,

\mathrm{CCS}:=\left\{\mathbf{\hat{r}}\in\mathcal{F}^{*}\mid\exists{\boldsymbol% {\omega}}\in\Omega,\forall\mathbf{\hat{r}}^{\prime}\in\mathcal{F}^{*}\text{ s.% t.}\>\boldsymbol{\omega}^{T}\mathbf{\hat{r}}\geq\boldsymbol{\omega}^{T}\mathbf% {\hat{r}}^{\prime}\right\}

(19)

where $(\cdot)^{T}$ denotes the transpose operator.

III-B MOMDP Formulation

In this section, we formulate the considered problem as a MOMDP and discuss the corresponding design of state, action, and reward. The state transitions and rewards are a function of the AV environment and actions taken by the AV.

III-B1 State Space

The state space consists of kinematics-related features, which is a $M_{1}\times F$ array that describes $F\to\{x_{j},y_{j},v_{j},\psi_{j},n_{R}^{j},n_{T}^{j}\}$ specific features of AVs. We consider $M_{1}$ target AVs and $M_{2}$ surrounding AVs. Each target AV is characterized by its (1) coordinates $(x_{j},y_{j})$ , (2) forward velocity $v_{j}$ , (3) heading $\psi_{j}$ , and (4) $n_{R}^{j}$ and $n_{T}^{j}$ which are the number of RBSs and TBSs that makes AV achieves the desired data rate in a predefined radius from its current position, respectively. Accordingly, the aggregated state space $\mathcal{S}$ at any time step $t$ is given by:

\mathcal{S}=\begin{bmatrix}x_{1}&y_{1}&v_{1}&\psi_{1}&n_{R}^{1}&n_{T}^{1}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ x_{M_{1}}&y_{M_{1}}&v_{M_{1}}&\psi_{M_{1}}&n_{R}^{M_{1}}&n_{T}^{M_{1}}\end{bmatrix}

III-B2 Two Dimensional Action Space

The action space consists of self-driving action space $\mathcal{A}_{\mathrm{tran}}$ and telecommunication action space $\mathcal{A}_{\mathrm{tele}}$ , which include 5 and 3 discrete actions, respectively. For each time step $t$ , the AV must select driving-related action and telecommunication-related action from action space, as shown below:

\mathcal{A}=\begin{bmatrix}\{a_{\rm tele}^{1},a_{\rm tran}^{1}\}&\{a_{\rm tele% }^{1},a_{\rm tran}^{2}\}&\dots&\{a_{\rm tele}^{1},a_{\rm tran}^{5}\}\\ \vdots&\vdots&\vdots&\vdots\\ \{a_{\rm tele}^{3},a_{\rm tran}^{1}\}&\{a_{\rm tele}^{3},a_{\rm tran}^{2}\}&% \dots&\{a_{\rm tele}^{3},a_{\rm tran}^{5}\}\end{bmatrix}

Note that $\mathcal{A}_{\rm tran}=\{a_{\rm tran}^{1},\ldots,a_{\rm tran}^{5}\}$ , where $a_{\rm tran}^{1},a_{\rm tran}^{2}$ and $a_{\rm tran}^{3}$ indicate that AV changes its lane to the left, maintains the same lane, and changes its lane to the right, respectively. $a_{\rm tran}^{4}$ and $a_{\rm tran}^{5}$ indicate the acceleration and deceleration of AV within the same lane. It is important to note that the acceleration and deceleration rates are dynamically determined by the model in Section II-B. With that being said, each AV selects the same actions does not imply that they will perform identical accelerations/deceleration. The communication action space is represented as $\mathcal{A}_{\rm tele}=\{a_{\rm tele}^{1},a_{\rm tele}^{2},a_{\rm tele}^{3}\}$ . $a_{\rm tele}^{1}$ indicates scenarios where AV selects a BS by maximizing weighted data rate metric (defined by equation (6) in Section II-A), which encourages traffic load balancing between BSs and discourages unnecessary HOs, especially for TBSs. In $a_{\rm tele}^{2},$ the AV selects a BS with maximum $\text{WR}_{{ij}}$ by substituting $\mu=0$ , if $Q_{i}\geq n_{i}$ . Otherwise, AV recursively selects the next vacant best-performing BS in terms of $\text{WR}_{ij}$ . In $a_{\rm tele}^{3}$ , the AV chooses to connect to a BS with the maximum data rate $R_{ij}$ .

III-B3 Rewards

The design of the associated reward is directly related to optimizing the driving policy and network selection, and is critical for accelerating the convergence of the model. Generally, the AV is given a positive reward when it receives a higher HO-aware data rate while guaranteeing safe driving. By taking any other actions that may lead to an increase in HOs, collisions, or traffic violations, the AV receives a penalty. We define the transportation reward as [16]:

r^{\mathrm{j,tran}}_{t}=c_{1}\left(\frac{v^{j}_{t}-v_{\mathrm{min}}}{v_{% \mathrm{max}}-v_{\mathrm{min}}}\right)-c_{2}\cdot\delta_{2}+c_{3}\cdot\delta_{% 3}+c_{4}\cdot\delta_{4},

(20)

where $v_{t}^{j},v_{\min}$ and $v_{\max}$ are the current longitudinal velocity for AV $j$ on time $t$ , the minimum and maximum speed limits, and $\delta_{2},\delta_{3},\delta_{4}$ is the collision indicator, AV right lane indicator, on road indicator, respectively. $c_{1}$ and $c_{2}$ are the weights that adjust the value of the AV transportation reward with its collision penalty. $c_{1}$ indicates that the reward received when driving at full speed, linearly mapped to zero for lower speeds. $c_{3}$ shows that AV was rewarded for driving on the right-most lanes, and linearly mapped to zero for other lanes. $c_{4}$ is the on-road reward factor, which penalize the AV for driving off highway. It is important to note that negative rewards are not allowed since they might encourage the agent to prioritize ending an episode early by causing a collision instead of taking the risk of receiving a negative return if no satisfactory trajectory is available.

For the telecommunication side, we define the reward for AV $j$ associated with BS $i^{*}$ at time step $t$ as:

r^{\mathrm{j,tele}}_{t}=c_{5}\text{WR}_{i^{*},j,t}\left(1-\text{min}(1,\xi^{j}% _{t})\right),

(21)

where $\text{WR}_{i^{*},j,t}$ is the achievable data rate compute by (6) and $\xi^{j}_{t}$ is the HO probability of AV $j$ computed by dividing the number of HOs accounted until the current time $t$ by the time duration of previous time slots in the episode³³3Note that $c_{1}\dots c_{5}$ are the weights to set the priority of each term. For instance, $c_{2}$ needs to be sufficiently large compared to other coefficients for collision avoidance. The highest penalty applies to vehicle collision..

Based on the instantaneous reward, we compute the accumulated rewards, which is the summation of discounted reward among all target AVs on the highway in each training episode. The expected return is defined as follows:

\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\sum_{j=1}^{M_% {1}}r^{\mathrm{j,tran}}_{t},\sum_{j=1}^{M_{1}}r^{\mathrm{j,tele}}_{t}\right]

(22)

Our MOMDP optimal strategy for maximizing the expected reward involves the simultaneous maximization of both transportation and telecommunications objectives, i.e., $\max_{\pi}{\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})}$ .

IV Proposed Single-Policy and Multi-Policy MORL Algorithms

In contrast to conventional DRL, MORL requires the agent to optimize multiple objectives simultaneously. These objectives might have predefined preferences, or the preferences could be unknown. In this section, we first investigate the single policy solutions to the MORL problem with predefined preferences, as discussed in Section IV-A. Subsequently, a multiple-policy envelope solution for MORL is proposed in Section IV for cases where the preferences are unknown.

IV-A Single-Policy MORL Algorithms

Given a set of preferences in MORL problems, single policy algorithms aim to scalarize the reward value to determine the best policy, considering the relative priorities assigned to competing objectives. We explore two DRL methods: DQN [1] and DDQN [25] for MORL, each of which employs a neural network parameterized by $\boldsymbol{\theta}_{t}$ (in each time step $t$ ) to approximate the $Q$ -value function for a state-action pair, i.e., $Q(s_{t},a_{t};\boldsymbol{\theta}_{t})$ . After taking action $a_{t}$ in state $s_{t}$ and receiving instant reward $r_{t+1}$ , we can formulate a target $Q$ function ${\hat{Q}}(s_{t},a_{t})$ as in Eq. 16, which is used to optimize the neural network $\boldsymbol{\theta}_{t}$ using gradient descent, as given in [25],

\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\kappa\left(\hat{Q}(s_{t},a_% {t})-Q(s_{t},a_{t};\boldsymbol{\theta}_{t})\right)\nabla_{\boldsymbol{\theta}_% {t}}Q(s_{t},a_{t};\boldsymbol{\theta}_{t}),

(23)

where $\kappa$ is a positive scalar representing the learning rate. To learn a single policy for multiple tasks, we scalarize the reward vector by applying the predefined priority of each objective function [1], where a weighted reward function $r_{t}=\sum_{j=1}^{M_{1}}r^{\mathrm{j,tran}}_{t}+\sum_{j=1}^{M_{1}}r^{\mathrm{j% ,tele}}_{t}$ is defined to facilitate the conversion of multi-dimensional rewards into a scalar value.

IV-A1 Multi-Objective Deep Q-Network (MO-DQN)

The proposed MO-DQN method incorporates a target Q-network and experience replay to stabilize the learning process and ensure convergence, as discussed in the following:

•

Target Network: Another set of neural network $\boldsymbol{\theta}_{t}^{-}$ is introduced to compute target $Q$ value at each time step $t$ , which has the same architecture as $\boldsymbol{\theta}_{t}$ , but with frozen parameters. Specifically, $\boldsymbol{\theta}_{t}^{-}$ only copies those parameters from $\boldsymbol{\theta}_{t}$ every $N^{-}$ steps and remains fixed until the next scheduled update [26]. The target value for the MO-DQN is defined as:

{\hat{Q}}(s_{t},a_{t})=r_{t+1}+\gamma\arg\max_{a}Q(s_{t+1},a;\boldsymbol{% \theta}_{t}^{-})

(24)

•

Experience Replay: To address issues related to correlations between sequential observations and to improve data efficiency, MO-DQN utilizes the experience replay mechanism, which stores past transition tuples $(s_{z},a_{z},s_{z+1},r_{z})$ in a replay buffer $\mathcal{D}_{\mathcal{T}}$ with size $N_{\mathcal{T}}$ , i.e., $z\in\{1,\dots,N_{\mathcal{T}}\}$ . During the training phase, mini-batches of these transitions are randomly sampled from the buffer. This method not only reduces the variance of each update but also allows the neural network to benefit from learning across a diverse range of past experiences, thus avoiding local optima and overfitting [27, 28].

IV-A2 Multi-Objective Double DQN (MO-DDQN)

Unlike MO-DQN, where the current weights $\boldsymbol{\theta}_{t}$ are used both to select and evaluate actions, MO-DDQN utilizes a separate set of parameter $\boldsymbol{\theta}_{t}^{\prime}$ to evaluate the value of the policy, ensuring a more reliable estimate by decoupling the selection and evaluation of actions. Given $\boldsymbol{\theta}_{t}$ and $\boldsymbol{\theta}_{t}^{\prime}$ corresponding to evaluation and target $Q$ networks, respectively, the target value function in MO-DDQN [25] is updated as follows:

{\hat{Q}}(s_{t},a_{t})=r_{t}+\gamma Q\left(s_{t+1},\arg\max_{a}{Q}(s_{t+1},a;% \boldsymbol{\theta}_{t});\boldsymbol{\theta}_{t}^{\prime}\right)

(25)

where the action selection is guided by the online weights $\boldsymbol{\theta}_{t}$ .

In MO-DQN and MO-DDQN, the neural network parameterized by $\boldsymbol{\theta}_{t}$ associated with evaluation function is updated by minimizing the mean square error loss $\mathcal{L}(\boldsymbol{\theta})$ between $Q$ and $\hat{Q}$ as follows [26]:

\mathcal{L}(\boldsymbol{\theta}_{t})=\mathbb{E}_{z\in\mathcal{D}_{\mathcal{T}}% }\left(Q(s_{z},a_{z};{\boldsymbol{\theta}_{t}})-\hat{Q}(s_{z},a_{z})\right)^{2}

(26)

The proposed MO-DQN and MO-DDQN are illustrated in Fig. 2. The training algorithm of the proposed MO-DQN and MO-DDQN are in Algorithm 1.

Although single-policy methods are adequate when we possess prior knowledge of task preferences, the acquired policy is constrained in its adaptability to situations with varying preferences. For instance, collision avoidance may not remain a high priority in highway environments with reduced vehicle density. Also, in traffic jams or parking lots where AVs are still, the preference for telecommunication rewards becomes higher. In the next section, we seek to design the multi-policy algorithm that handles unknown preferences in multi-objective RL scenarios.

Result: Learned action-value function

Q_{\boldsymbol{\theta}}

and Policy

\pi

Data: Evaluation

Q

-network

Q

with weights

{\boldsymbol{\theta}}

, Target

Q

-network

\hat{Q}

with weights

{\boldsymbol{\theta}^{\prime}}

(for MO-DDQN only), Experience replay memory

\mathcal{D}_{\mathcal{T}}

, Mini-batch size

N_{\mathcal{T}}

, Horizon limit of each episode

T_{hl}

Initialization:

Experience replay memory

\mathcal{D}_{\mathcal{T}}\leftarrow\emptyset

Initialize

Q

-network weights

\boldsymbol{\theta}

randomly,

For MO-DDQN: Initialize target network weights

\boldsymbol{\theta}^{\prime}\leftarrow\boldsymbol{\theta}

Initialize

Q(s,a)

for all states

s

and actions

a

, including AVs, TBSs, and RBSs.

while episode $<$ episode limit and runtime $<$ time limit do

Initialize

t\leftarrow 0

and state

s_{t}

based on environment

while $t\leq T_{hl}$ do

RL agent select

a_{t}

from

\mathcal{A}

with probability

\epsilon

or select

a_{t}

from

\max_{a\in\mathcal{A}}{Q(s_{t},a_{t};\boldsymbol{\theta})}

with probability of

1-\epsilon

Derive

a^{\mathrm{tran}}_{t}

and

a^{\mathrm{tele}}_{t}

from

a_{t}

Apply

a^{\text{tran}}_{t}

and

a^{\text{tele}}_{t}

observe reward

r_{t}

and next state

s_{t+1}

Store transition

(s_{t},a_{t},s_{t+1},r_{t})

\mathcal{D}_{\mathcal{T}}

Experience Replay: Sample a mini-batch of transitions

(s_{z},a_{z},r_{z},s_{z+1})

from

\mathcal{D}_{\mathcal{T}}

where

z\in\{1,\ldots,N_{\mathcal{T}}\}

Set target-

Q

for each sampled transition:

for each transition $z$ do

if episode ends at step $z+1$ then

\hat{Q}(s_{z},a_{z})=r_{z}

else

Use

\hat{Q}

to compute

\hat{Q}(s_{z},a_{z})

according to MO-DQN or MO-DDQN update by (24), (25).

end if

end for

Perform a gradient descent step on (26) with respect to network parameters

\boldsymbol{\theta}

if MO-DDQN then

Update target

\hat{Q}

weights

\boldsymbol{\theta}^{\prime}\leftarrow\boldsymbol{\theta}

every

N^{-}

steps;

end if

t\leftarrow t+1

end while

Update policy

\pi

based on learned

Q

end while

Algorithm 1 Multi-Objective Double Deep Q-Learning

IV-B Multi-Policy Envelope MORL Algorithm

In contrast to single-policy methods, multi-policy MORL methods optimize different objectives simultaneously by maximizing a vector of rewards associated with these objectives. Our proposed MORL framework reduces reliance on predefined preferences and scalar reward combinations, enabling dynamic adjustment to associated tasks featuring distinct preferences. This approach is effective in identifying Pareto-optimal policies when preferences are unknown.

Our proposed MO-DDQN-envelope algorithm is designed to learn a spectrum of Pareto optimal policies simultaneously (as defined in Section III-A2) in a preference space ${\Omega}$ , as illustrated in Fig. 2. Different from the Envelope-MOQ model in [24], the proposed MO-DDQN envelope algorithm incorporates DDQN instead of using the original REINFORCE algorithm to improve convergence and sample training efficiency.

During each time step, observation information is captured in the RF-THz-Highway environment. From this observation, the tuple $\{s_{t},s_{t+1},\boldsymbol{\omega}_{t}\}$ is computed. Following states information acquisition, the hindsight experience replay (HER) technique is employed to sample preference weights from the replay preference pool $\mathcal{D}_{\mathcal{T}}$ . Then, homotopy optimization is applied to execute gradient descent, as indicated in (32). Subsequently, we perform $Q$ network clone from evaluation network to target network periodically for every $N^{-}$ steps. Notably, unlike prior single policy MORL approaches that scalarize rewards before the experience replay, MO-DDQN-envelope scalarizes rewards after gradient descent. We elaborate on the Bellman operator update phase, the HER phase and homotopy optimization phase in detail in what follows.

IV-B1 Bellman Operation with Optimal Filter

In the context described in Section III-A and referenced by [24], the expected discounted return under a policy $\pi$ is defined as $\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\mathbf{r}(s_{% t},a_{t})+\gamma\mathbf{Q}_{\pi}(s_{t-1},a_{t-1},\boldsymbol{\omega})\right]$ . Yang et al. in [24] further introduces the concept of an optimal filter $\mathcal{H}$ ⁴⁴4The optimal filter $\mathcal{H}$ is instrumental in solving the convex envelope of PPF, which represents the current solution frontier. This process is key in optimizing the Q-function, $\mathbf{Q}_{\pi}$ for a given state $s$ and preference weights $\boldsymbol{\omega}$ ., which is applied to $\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})$ to obtain $(\mathcal{H}\mathbf{Q})_{\pi}(s,a,\boldsymbol{\omega})=\arg_{Q}\sup_{a\in% \mathcal{A},\boldsymbol{\omega}^{\prime}\in\Omega}\mathbf{Q}_{\pi}(s,a,% \boldsymbol{\omega}^{\prime})$ . Here, $\boldsymbol{\omega}$ is optimized through a process that balances preference between objectives, i.e., transportation and telecommunication. The $\arg_{Q}$ represents a multi-objective supremum value, ensuring that $(a,\boldsymbol{\omega}^{\prime})$ achieves the maximum supremum across actions in space $\mathcal{A}$ and states $\boldsymbol{\omega}^{\prime}$ within the space $\Omega$ . Consequently, we can streamline (17) to focus the optimization on actions solely dependent on $\mathcal{H}$ . The MO optimality operator can thus be defined as:

\mathbf{Q}(s,a,\boldsymbol{\omega})=\mathbb{E}_{s_{t+1}}\left[\mathbf{r}(s_{t}% ,a_{t})+\mathcal{\gamma}(\mathcal{H}\mathbf{Q})(s_{t+1},\boldsymbol{\omega})\right]

(27)

Based on the specific Bellman operator and the optimal filter $\mathcal{H}$ , we maintain the convex envelope $\sup_{\boldsymbol{\omega}^{\prime}}\boldsymbol{\omega}^{T}\mathbf{Q}_{\pi}(s,a% ,\boldsymbol{\omega}^{\prime})$ , corresponding to new preference weights for optimal MO rewards. These rewards may not be optimized by other past preference weights during training. Unlike benchmark approaches discussed in Section IV-A, single policy scalarized updates fail to optimize the scalar utility for acquiring the optimal solution with varying $\boldsymbol{\omega}$ , due to their inability to leverage information from $\max_{a}\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})$ .

IV-B2 Hindsight Experience Replay (HER)

HER is a method to train a RL agent to achieve multiple preferences to serve multiple objectives [24, 29]. The RL agent follows a policy based on a randomly selected goal in each episode and uses the previous trajectory to update other goals simultaneously.

In our enhanced MO-DDQN-envelope network, leveraging HER, we employ the sampling process from two distinct replay pools $\mathcal{D}_{\mathcal{T}}$ and $\mathcal{D}_{\omega}$ , targeting both transition mini-batches and preference vectors. We extract $N_{\mathcal{T}}$ mini-batch transitions, $(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})$ to form replay buffer pool $\mathcal{D}_{\mathcal{T}}$ , such as $(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})\sim\mathcal{D}_{\mathcal{T}}$ , where $z\in[1,N_{\mathcal{T}}]$ . Concurrently, we sample preference vectors $\omega_{g}$ in $\mathcal{D}_{\omega}$ to form replay buffer $\mathcal{W}\equiv\{\omega_{g}\sim\mathcal{D}_{\omega}\}$ , with $g\in[1,N_{\omega}]$ , $N_{\omega}$ indicates the count of preference weights in $\mathcal{W}$ . Therefore, the agent AV can replay the trajectories with any preferences using ”hindsight” since preferences only impact agent AVs actions rather than highway environment dynamics[24].

IV-B3 Homotopy Optimization

Our goal is to generate a single model which adapts the entire pareto frontier space $\Omega$ . By sampling $N_{\mathcal{T}}$ transitions $(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})$ and $N_{\mathcal{\omega}}$ preference weights $\mathcal{W}=\{\omega_{g}\sim\mathcal{D}_{\omega}\}$ in respective replay buffer $\mathcal{D}_{\tau}$ and $\mathcal{D}_{\omega}$ , we define MO-DDQN-envelope element-wise target function [24] as follows:

\hat{\mathbf{Q}}(s_{z},a_{z},\mathbf{r}_{z},s_{z+1},\omega_{g})=\mathbf{r}_{z}% +\gamma\max_{a^{\prime}\in\mathcal{A},\boldsymbol{\omega}^{\prime}\in\mathcal{% W}}[{\omega}_{g}]^{T}\mathbf{Q}(s_{z+1},a^{\prime},\boldsymbol{\omega}^{\prime})

(28)

for $\forall z\in[1,N_{\mathcal{T}}]$ and $\forall g\in[1,N_{\omega}]$ . Finding the optimal preference weight $\boldsymbol{\omega}^{\prime}$ in $\Omega$ can be an NP-hard problem due to the size and complexity of $\Omega$ . Instead, finding the optimal preference $\boldsymbol{\omega}^{\prime}$ in $\mathcal{W}$ is feasible. By replay sampling transition $(s_{t},a_{t},\mathbf{r}_{t},s_{t+1})$ across $N_{\mathcal{T}}$ transitions, we acquire the empirical estimate target function over new state $s_{t+1}$ as:

\begin{multlined}\hat{\mathbf{Q}}(s_{t},a_{t},\boldsymbol{\omega}_{t};% \boldsymbol{\theta}_{t}^{\prime})=\mathbb{E}_{s_{t+1}}[\mathbf{r}_{t}+\gamma% \arg_{Q}\max_{a_{t},\boldsymbol{\omega}_{t}^{\prime}}\boldsymbol{\omega}^{T}% \mathbf{Q}(s_{t+1},a_{t},\boldsymbol{\omega}_{t}^{\prime};\boldsymbol{\theta}_% {t}^{\prime})]\end{multlined}\hat{\mathbf{Q}}(s_{t},a_{t},\boldsymbol{\omega}_% {t};\boldsymbol{\theta}_{t}^{\prime})=\mathbb{E}_{s_{t+1}}[\mathbf{r}_{t}+% \gamma\arg_{Q}\max_{a_{t},\boldsymbol{\omega}_{t}^{\prime}}\boldsymbol{\omega}% ^{T}\mathbf{Q}(s_{t+1},a_{t},\boldsymbol{\omega}_{t}^{\prime};\boldsymbol{% \theta}_{t}^{\prime})]

(29)

where $\mathbf{Q}(\cdot)$ revisits (27). To ensure the correctness of the training for the target value $\hat{\mathbf{Q}}$ , which should be as close as possible to the actual value ( $\mathbf{Q}$ ). The loss function $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ in each time step $t$ is defined as:

\mathcal{L}^{A}(\boldsymbol{\theta}_{t})=\mathbb{E}_{s_{t},a_{t},\omega_{t}}% \Bigr{[}||\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}^{% \prime})-\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})||^{2}_{2}% \Bigr{]}

(30)

Since $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ contains many local maxima and minima, it is difficult to find the mean square error (MSE) and hard to optimize $Q_{\theta}$ . To smooth the landscape of loss function $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ , we introduce the auxiliary loss function $\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ as:

\mathcal{L}^{B}(\boldsymbol{\theta}_{t})=\mathbb{E}_{s_{t},a_{t},\omega_{t}}% \Bigr{[}|\omega_{t}^{T}\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t}^{\prime})-\omega_{t}^{T}\mathbf{Q}(s_{t},a_{t},\omega_{t};% \boldsymbol{\theta}_{t})|\Bigr{]}

(31)

$\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ contributes smooth policy adaptation for enhancing training efficiency with fewer spikes. $\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ is advantageous for boosting agent training, but not as good for accurate approximation as $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ [24]. Both $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ and $\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ are averaged over $\omega_{t}$ which highlights the sampling preference feature in the proposed algorithm. However, specific weight $\omega_{t}$ in the past training is not directly applied to the target state-action transitions. The proposed MO-DDQN-envelope can reevaluate past transitions in $\mathcal{D}_{\mathcal{T}}$ with later new preferences to enhance learning efficiency and sample utilization.

Combining (30) and (31), we generate loss function

\mathcal{L}(\boldsymbol{\theta}_{t})=(1-\lambda_{t})\mathcal{L}^{A}(% \boldsymbol{\theta}_{t})+\lambda_{t}\mathcal{L}^{B}(\boldsymbol{\theta}_{t})

(32)

In homotopy optimization, $\lambda_{t}$ gradually increases from 0 to 1 through balance weight path $p_{\lambda}$ , which adjusts the equilibrium between $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ and $\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ . Loss function $\mathcal{L}(\boldsymbol{\theta}_{t})$ smoothly shift from $\mathcal{L}^{A}(\boldsymbol{\theta}_{t})$ to $\mathcal{L}^{B}(\boldsymbol{\theta}_{t})$ to ensure achieving first accuracy $\mathbf{Q}$ optimization, then smoothing utility provided by auxiliary. We first trying to reduce the discrepancy between target and estimate $\mathbf{Q}$ value as $(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})-\mathbf{Q}(s% _{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})$ and then taking gradient descent $\nabla_{\boldsymbol{\theta}_{t}}$ for estimate $\mathbf{Q}$ value to adjust direction to reduce MSE. Consequently, parameter for MO-DDQN-envelope will be updated as,

\begin{multlined}\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\mathbb{E}_% {s_{t},a_{t},s_{t+1}}[(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t})\\ -\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}))^{T}\nabla_{% \boldsymbol{\theta}_{t}}\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_% {t})]\end{multlined}\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\mathbb{% E}_{s_{t},a_{t},s_{t+1}}[(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t})\\ -\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}))^{T}\nabla_{% \boldsymbol{\theta}_{t}}\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_% {t})]

(33)

We reset our target $Q$ -network with evaluation $Q$ -network every $N^{-}$ steps, i.e., $\theta\leftarrow\theta^{\prime}$ . The training algorithm of the proposed MO-DDQN-envelope is shown in Algorithm 2.

Result: Learned action-value function

\mathbf{Q}_{\theta}

and Policy

\pi

Data: Evaluation

Q

-network

\mathbf{Q}_{\theta}

, Target

Q

-network

\mathbf{Q}_{\theta^{\prime}}

, Preference sampling pool

\mathcal{D}_{\omega}

, HER transition sampling pool

\mathcal{D}_{\mathcal{T}}

, Balance weight path

p_{\lambda}

Initialization:

HER replay buffer

\mathcal{D}_{\mathcal{T}}\leftarrow\emptyset

Initialize Q-network weights

\theta

randomly,

Initialize target Q-network weights

\theta^{\prime}\leftarrow\theta

Initialize

Q(s,a)

for all states

s

and actions

a

, including AVs, TBSs, RBSs.

while episode $<$ episode limit and runtime $<$ time limit do

Initialize

t\leftarrow 0

and state

s_{t}

based on environment

while $t\leq T_{hl}$ do

for Target AV $j$ from 1 to $M_{1}$ do

a_{t}

select action from

\mathcal{A}

with probability of

\epsilon

or Select

a_{t}

from

a_{t}=\arg\max_{a\in\mathcal{A}}{\boldsymbol{\omega}^{T}}\mathbf{Q}(s_{t},a,% \boldsymbol{\omega};\boldsymbol{\theta}_{t})

with probability of

1-\epsilon

Derive

a^{\mathrm{tran}}_{t}

and

a^{\mathrm{tele}}_{t}

from

a_{t}

;

Apply

a^{\mathrm{tran}}_{t}

and

a^{\mathrm{tele}}_{t}

to target AV

j

;

Observe vector reward

\mathbf{r}_{t}

and next state

s_{t+1}

;

end for

if update neural network then

Store

(s_{t},a_{t},\mathbf{r}_{t},s_{t+1})

\mathcal{D}_{\mathcal{T}}

;

Hindsight Experience Replay (HER):

\{(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})\sim\mathcal{D}_{\mathcal{T}}\}

;

Sample

N_{\omega}

preferences

\mathcal{W}=\{\omega_{g}\sim\mathcal{D}_{\omega}\}

;

Bellman Update:

Compute

\hat{\mathbf{Q}}(s_{z},a_{z},\mathbf{r}_{z},s_{z+1},\omega_{g})

for each sampled transition and preference:

\begin{cases}\mathbf{r}_{z},&\text{if }s_{z+1}\text{ is terminal}\\ (\ref{eq:mo_envelope_ddqn_target}),&\text{otherwise}\end{cases}

\forall z\in[1,N_{\mathcal{T}}]

and

\forall g\in[1,N_{\omega}]

Homotopy Optimization:

Update

\mathbf{Q}_{\theta}

by minimizing the loss with

gradient descent by (32);

Gradually increase

\lambda

following the path

p_{\lambda}

;

end if

Update target

\mathbf{Q}_{\theta^{\prime}}

weights

\theta^{\prime}\leftarrow\theta

by (33) every

N^{-}

steps;

end while

t\leftarrow t+1

Compute policy

\pi

based on learned

\mathbf{Q}_{\theta}

;

end while

Algorithm 2 Multi-Objective Envelope DDQN

IV-C Complexity Analysis

We first analyze the main loop of Algorithm 2. The key components contributing to the time complexity include:

•

Action Selection: We consider the horizon limit for each episode, denoted as $T_{\text{hl}}$ . The process of selecting an action at each timestep incurs a time complexity of $\mathcal{O}(|\mathcal{A}|)$ for each target AV, where $|\mathcal{A}|$ represents the size of the action space. Thus, the time complexity is $\mathcal{O}(M_{1}\cdot T_{\text{hl}}\cdot|\mathcal{A}|)$ .
•

Hindsight Experience Replay (HER): Sampling $N_{\mathcal{T}}$ transitions from the HER replay buffer and $N_{\omega}$ preferences results in a time complexity of $\mathcal{O}(N_{\mathcal{T}}\cdot N_{\omega})$ .
•

Homotopy Optimization: This phase involves a Fully Connected Neural Network (FCNN) consisting of an input layer, an output layer, and $E$ fully connected hidden layers. Let $N_{e}$ denote the number of neurons in the $e$ -th fully connected layer. The time complexity for this phase is $\mathcal{O}\left(\sum_{e=1}^{E+1}(N_{e-1}\cdot N_{e})\right)$ , accounting for the computational cost for each layer’s connections.
•

Policy Adaptation: Utilizing a DDQN, the input layer’s neurons correspond to the dimensionality of the state space $\mathcal{S}$ , and the output layer’s neurons correspond to the action space $\mathcal{A}$ . Thus, the numbers of neurons in the input and output layers are $N_{0}=|\mathcal{S}|$ and $N_{E+1}=|\mathcal{A}|$ , respectively. The time complexity for this phase is $\mathcal{O}(|\mathcal{S}|\cdot|\mathcal{A}|)$ , which can be considered constant upon convergence.

Therefore, the overall time complexity for MO-DDQN-envelope can be expressed as $\mathcal{O}(M_{1}\cdot T_{\text{hl}}\cdot|\mathcal{A}|+N_{\mathcal{T}}\cdot N_% {\omega}+\sum_{e=1}^{E+1}(N_{e-1}\cdot N_{e})+|\mathcal{S}|\cdot|\mathcal{A}|)$ .

V Simulation and Performance Evaluation

In this section, we demonstrate the performance of the proposed algorithms and highlight the complex dynamics between wireless connectivity, traffic flow, and AV’s speed. All Experiments are executed on a PC equipped with Windows 11, Intel i7-8770 CPU 3.2 GHz and 16 GB DDR5, AMD RX580 8GB GDDR5. Additionally, Google Colab, is the cloud platform employed for reproduction and verification.

V-A Simulation Environment

Our proposed simulation framework is composed of three main components:

•

Telecommunication and Transportation Environment: We introduce the MO-RF-THz-Highway-env framework⁵⁵5https://github.com/sunnyyzj/highway-env-1.7, an enhanced version of highway-env [16], designed to support both autonomous driving policy and 5G/6G network selection for multiple AVs.
•

Extended MO-Gymnasium: We extend MO-Gymnasium⁶⁶6https://github.com/sunnyyzj/MO-Gymnasium [30] to provide an application programming interface (API) to communicate between DRL algorithms and MO-RF-THz-Highway-env.
•

MORL Algorithms Simulation: For single policy MORL, simulation utilizes modified rl-agents⁷⁷7https://github.com/sunnyyzj/rl-agents from [31]. For multi-policy MORL, the proposed MO-DDQN-envelope⁸⁸8https://github.com/sunnyyzj/morl-baselines and the other multi-policy algorithms simulation are extended from MORL-baselines in [32].

The MO-RF-THz-Highway-env features five one-way lanes, each with a length of 1500m and a width of 4m, as depicted in Figure 1. In our experimental setup, a default configuration consists of 5 target AVs and 20 surrounding AVs, each having a length of 5m. The longitudinal velocity of each AV ranges from 10 m/s to 35 m/s. At the beginning of each episode, these AVs are randomly placed across the five lanes. Along both sides of the highway, 5 RBSs and 10 to 50 TBSs are also randomly positioned to ensure a non-uniform distribution at the beginning of each episode. This random placement strategy aims to facilitate the examination of MORL training effectiveness across various VNets and traffic scenarios, as opposed to a singular, common scenario. The maximum duration for each episode is set to 30 time steps. An episode is considered collision-free if it meets two criteria: absence of collisions among all target AVs, and maintenance of a high weighted data rate regarding (6) during travel through the episode. For single policy MORL, we set rewards coefficients in (20) (21) from Section III-B3, $c_{1},c_{2},c_{3},c_{4},c_{5}$ set to $0.4$,1,0.1,0.2,$4.5\text{\times}{10}^{-7}$ , respectively.

We utilize a Multi-Layer Perceptron (MLP) to construct the $Q_{\theta}$ neural network in the training and evaluation phases. Our architecture comprises of three-layer FNNs, with each layer hosting 128 neurons. Following these 3 layers of FNNs, we apply the ReLU activation function. The output layer of the target policy network $Q_{\theta}^{\prime}$ employs a sigmoid function to constrain the range of actions. Apart from the input and output layers, the MO evaluation network $Q_{\theta}$ and the target policy network $Q_{\theta}^{\prime}$ share identical architectures and utilize the same activation functions. Simulation parameters are detailed in Table I unless otherwise specified.

Table I: Values of system parameters in experiments

Parameter	Value
Value used in system model
RBSs frequency $(f_{R})$	3.5 GHz
TBSs frequency $(f_{T})$	1 THz
Maximum number of affordable AVs quota for a single RBS, TBS $(Q_{R}),(Q_{T})$	5, 10
Antenna gain for TBSs and RBSs $(G_{T}^{\mathrm{tx}})$ , $(G_{T}^{\mathrm{rx}})$	316.2
RBSs channel bandwidth $(W_{R})$	$4\text{\times}{10}^{7}$
TBSs channel bandwidth $(W_{T})$	$5\text{\times}{10}^{8}$
Transmit powers of RBSs and TBSs $(P_{R}^{\mathrm{tx}})$ , $(P_{T}^{\mathrm{tx}})$	1 W
Molecular absorption coefficient $(K_{a})$	0.05 m^-1
Path loss exponent $(\alpha)$	4
Length for each AV $(l_{j})$	5 m
Target AV heading and lateral control gain $(K_{j}^{\psi})$ , $(K_{j}^{y})$	5, $\frac{5}{3}$
Maximum AV steering angle for AV $j$ $(\max{\beta_{j}})$	$\frac{\pi}{3}$
Surrounded AV $j$ desired maximum acceleration and deceleration $(a_{j})$	3 m/s , -5 m/s
Acceleration reduction factor $(\delta_{a})$	4
Number of lanes $(N_{L})$	5
Value used in MORL
Learning rate $(\alpha_{l})$	$3\text{\times}{10}^{-4}$
Discount factor $(\gamma)$	0.995
Size of the hidden layers of the value NN	$[256,256,256,256]$
Epsilon decay parameter $(\epsilon)$	0.1
MO-DDQN-envelope HER transition pool size $(N_{\mathcal{T}})$	$2\text{\times}{10}^{6}$
The number of weight vectors to sample for the envelope target $(N_{\omega})$	4
Frequency for cloning evaluation to target network $(N^{-})$	200
Episode horizon limit $T_{hl}$	30

V-B Baselines and Evaluation Metrics

To comprehensively evaluate the performance of MO-DDQN-envelope, we implement four baseline algorithms for comparison: MO-DQN, MO-DDQN (as discussed in Section IV-A), and two additional MORL algorithms, namely MO-dueling-DDQN and MO-PPO. MO-dueling-DDQN and MO-PPO are variations of MO-DDQN, employing single-policy multi-objective dueling DDQN and Proximal Policy Optimization (PPO) algorithms instead of DDQN, respectively. These algorithms are developed specifically for performance evaluation. It is noteworthy that dueling-DDQN and PPO are extensions of single-policy single-objective RL algorithms proposed in [33] and [34], respectively.

To evaluate the performance, we consider the following metrics. Assume episode $e$ ends on time step $T_{e}$ . we define (1) total transportation reward: $R^{tran}_{e}=\mathbb{E}_{j\in M_{1}}[\sum_{t=1}^{T_{e}}r^{\mathrm{j,tran}}_{t}]$ . (2) total communication reward: $R^{tele}_{e}=\mathbb{E}_{j\in M_{1}}[\sum_{t=1}^{T_{e}}r^{\mathrm{j,tele}}_{t}]$ . (3) collision rate: $\delta_{e}=1-\frac{T_{e}}{T_{hl}}$ (4) HOs Probability: $\xi_{e}=\mathbb{E}_{j\in M_{1}}[\xi^{j}_{T_{e}}]$ . where $r^{\mathrm{j,tran}}_{t}$ and $r^{\mathrm{j,tele}}_{t}$ are instantaneous transportation and telecommunication rewards specified in equations (20) and (21), respectively. $T_{hl}$ represents the horizon limit of each episode. $\xi^{j}_{T_{e}}$ denotes HOs probability by the end of each episode ( $T_{e}$ ’s step) defined in equation (21).

V-C Results and Discussions

V-C1 Training Performance

We examine the training performance of the proposed MO-DDQN-envelope algorithm and compare it with other baselines of MORL approaches. Fig. 3 depicts total transportation rewards, total telecommunication reward, collision Rate, and HOs probability as a function of desired velocity of AV. MO-DDQN-envelope performs better than benchmark algorithms (i.e. MO-DQN and MO-DDQN) when evaluating performance over every 100 episodes. Shown by learning curves, we note that MO-DDQN-envelope has slower convergence to higher total cumulative training rewards no matter transportation and telecommunication sides. However, MO-DDQN-envelope algorithm balances both transportation and telecommunication objectives. Collision rate and HOs probability reduce better than the other benchmarks. To better understand this improvement, recall MO-DDQN-envelope samples experience from the replay buffer which contains recent past preferences and rarely new exploration preferences. Past preferences are based on the weight vector $\vec{\omega}$ in terms of transportation and telecommunication objectives $\vec{\omega}=[\omega_{\mathbf{tran}},\omega_{\mathbf{tele}}]^{T}$ which marginally improves the training for each objective individually regardless of preferences.

For the convergence rate, thanks to Homotopy Optimization, training first focuses on training accuracy on two objectives but later gradually focuses on faster convergence, training rewards are less viable after 3000 episodes of training than in the early stage. Also, we found the collision rate is significantly reduced from 0.7 to around 0.2, which also illustrates the training model is improving safety on the highway.

V-C2 Impact of BSs Density

We evaluate the performance by averaging over 500 evaluation epochs on models after 4500 episodes of training. In each evaluation step, we randomly distribute different numbers of TBSs alongside the highway in the simulation environment. As depicted in Fig. 4, MO-DDQN-envelope gains an evaluation advantage compared to other benchmarks. As the number of BSs grow, the transportation rewards do not change significantly, however, the telecom rewards first increases due to better connectivity and later decreases due to more HOs. Growing TBSs also increases the average collision rate due to potential reduction in AVs speed to maintain connectivity.

V-C3 Impact of the Number of AVs

As shown in Fig. 4, increasing the number of AVs leads to more crowded highway scenarios. Thus, more grouped AVs connect to the same BS resulting in network outages due to the maximum quota at each BS. Also, frequent lane changes and speeding on the crowded highway cause more congestion and collisions, which reduces transportation performance.

V-C4 Impact of Desired AV Speeds

Fig. 4 depicts that slow moving AVs outperform in terms of both transportation and telecommunications. Increasing speeds lead to higher collision occurrences. Moreover, AVs at higher speeds switch BSs more frequently, incurring significant handover penalties according to (21), thus adversely affecting rewards in both domains.

V-C5 Pareto Front Analysis

We employ the CCS in (19) as a means to assess the excellence of the estimated Pareto fronts. A greater Pareto frontier value indicates a closer proximity of the Pareto front to the optimal one in terms of transportation and telecommunication objectives. To compute CCS, we select performance on single policy MO-DQN and MO-DDQN as reference points. Recall we need to maximize both transportation rewards and telecommunication rewards. The best solutions are situated in the top right corner, as depicted in Fig. 5. Specifically, it demonstrates that our proposed algorithm MO-DDQN-envelope outperforms the MORL baselines in approximating the Pareto fronts. However, in the high-density transportation scenario, MO-DDQN yields a Pareto front of similar quality to the other baselines.

VI Conclusion

We introduce a novel MORL framework tailored for devising joint network selection and autonomous driving policies within a multi-band VNet. Our goals encompass enhancing traffic flow, minimizing collisions, maximizing data rates, and minimizing handoffs (HOs). We achieve this through controlling vehicle motion dynamics and network selection, employing a unique reward function that optimizes data rate, traffic flow, load balancing, and penalizes HOs and unsafe driving behaviors. The problem is formalized as a MOMDP, integrating telecommunication and autonomous driving utilities in its rewards. We propose single policy MORL solutions with predefined preferences, transforming the MOOP into a single-objective one, utilizing DQN and double DQN solutions to derive optimal policies dependent on relative preferences.

Addressing the challenge of learning optimized policies across varied preferences, we develop an envelope MORL solution. This approach enables effective navigation across preference spaces, generating tailored policies. Our algorithm leverages the contraction properties of the optimality operator governing a generalized Bellman equation and optimizes the convex envelope of multi-objective Q-values. Hindsight experience replay and homotopy optimization aid in manageable learning across diverse preferences. Additionally, we construct a novel simulation testbed, ”RF-THz-Highway-Env,” based on ”highway-env,” emulating a multi-band wireless network-enabled VNet. Numerical results demonstrate the superiority of our proposed solution over weighted sum-based MORL solutions with DQN, showcasing improvements of 12.7%, 18.9%, and 12.3% on average transportation reward, average communication reward, and average HO rate, respectively.

References

[1] Z. Yan and H. Tabassum, “Reinforcement learning for joint v2i network selection and autonomous driving policies,” in GLOBECOM 2022 - 2022 IEEE Global Commun. Conf., 2022, pp. 1241–1246.
[2] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. on Vehicular Tech., vol. 68, no. 4, pp. 3163–3173, 2019.
[3] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multi-agent reinforcement learning,” IEEE Journal on Sel. Areas in Commun., vol. 37, no. 10, pp. 2282–2292, 2019.
[4] Y. Xu, K. Zhu, H. Xu, and J. Ji, “Deep reinforcement learning for multi-objective resource allocation in multi-platoon cooperative vehicular networks,” IEEE Trans. on Wireless Commun., 2023.
[5] X. Hu, Y. Zhang, X. Liao, Z. Liu, W. Wang, and F. M. Ghannouchi, “Dynamic beam hop** method based on multi-objective deep reinforcement learning for next generation satellite broadband systems,” IEEE Trans. on Broadcasting, vol. 66, no. 3, pp. 630–646, 2020.
[6] G. Yu, Y. Jiang, L. Xu, and G. Y. Li, “Multi-objective energy-efficient resource allocation for multi-rat heterogeneous networks,” IEEE Journal on Sel. Areas in Commun., vol. 33, no. 10, pp. 2118–2127, 2015.
[7] R. Devarajan, S. C. Jha, U. Phuyal, and V. K. Bhargava, “Energy-aware resource allocation for cooperative cellular network using multi-objective optimization approach,” IEEE Trans. on Wireless Commun., vol. 11, no. 5, pp. 1797–1807, 2012.
[8] D. Guo, L. Tang, X. Zhang, and Y.-C. Liang, “Joint optimization of handover control and power allocation based on multi-agent deep reinforcement learning,” IEEE Trans. on Vehicular Tech., vol. 69, no. 11, pp. 13 124–13 138, 2020.
[9] H. Khan, A. Elgabli, S. Samarakoon, M. Bennis, and C. S. Hong, “Reinforcement learning-based vehicle-cell association algorithm for highly mobile millimeter wave communication,” IEEE Trans. on Cognitive Commun. and Networking, vol. 5, no. 4, pp. 1073–1085, 2019.
[10] Z. Wu, K. Qiu, and H. Gao, “Driving policies of V2X autonomous vehicles based on reinforcement learning methods,” IET Intelligent Transport Systems, vol. 14, no. 5, pp. 331–337, 2020.
[11] Z. Yan, W. Jaafar, B. Selim, and H. Tabassum, “Multi-uav speed control with collision avoidance and handover-aware cell association: Drl with action branching,” in GLOBECOM 2023 - 2023 IEEE Global Commun. Conf., 2023, pp. 5067–5072.
[12] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Enhancing the fuel-economy of v2i-assisted autonomous driving: A reinforcement learning approach,” IEEE Trans. on Vehicular Tech., vol. 69, no. 8, pp. 8329–8342, 2020.
[13] A. Alizadeh, M. Moghadam, Y. Bicer, N. K. Ure, U. Yavas, and C. Kurtulus, “Automated lane change decision making using deep reinforcement learning in dynamic and uncertain highway environment,” in 2019 IEEE intelligent transportation systems conference (ITSC). IEEE, 2019, pp. 1399–1404.
[14] X. He and C. Lv, “Towards energy-efficient autonomous driving: A multi-objective reinforcement learning approach,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1329–1331, 2023.
[15] W. Wei, R. Yang, H. Gu, W. Zhao, C. Chen, and S. Wan, “Multi-objective optimization for resource allocation in vehicular cloud computing networks,” IEEE Trans. on Intelligent Transportation Systems, vol. 23, no. 12, pp. 25 536–25 545, 2021.
[16] E. Leurent, “An environment for autonomous driving decision-making,” https://github.com/eleurent/highway-env, 2018.
[17] J. Sayehvand and H. Tabassum, “Interference and coverage analysis in coexisting rf and dense terahertz wireless networks,” IEEE Wireless Commun. Letters, vol. 9, no. 10, pp. 1738–1742, 2020.
[18] C. She, C. Sun, Z. Gu, Y. Li, C. Yang, H. V. Poor, and B. Vucetic, “A tutorial on ultrareliable and low-latency communications in 6g: Integrating domain knowledge into deep learning,” Proceedings of the IEEE, vol. 109, no. 3, pp. 204–246, 2021.
[19] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Trans. on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
[20] P. Polack, F. Altché, B. d’Andréa Novel, and A. de La Fortelle, “The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?” in 2017 IEEE intelligent vehicles symposium (IV). IEEE, 2017, pp. 812–818.
[21] E. Leurent, “Safe and efficient reinforcement learning for behavioural planning in autonomous driving,” Ph.D. dissertation, Université de Lille, 2020.
[22] M. Treiber and A. Kesting, “Traffic flow dynamics,” Traffic Flow Dynamics: Data, Models and Simulation, Springer-Verlag Berlin Heidelberg, pp. 187–201, 2013.
[23] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model mobil for car-following models,” Transportation Research Record, vol. 1999, no. 1, pp. 86–94, 2007.
[24] R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” Advances in neural information processing systems, vol. 32, 2019.
[25] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[27] H.-M. Chen, S.-F. Wang, P. Wang, S. Lin, and C. Fang, “Deep Q-learning for intelligent band coordination in 5g heterogeneous network supporting v2x communication,” Wireless Commun. and Mobile Computing, 2022.
[28] Y. Hou, L. Liu, Q. Wei, X. Xu, and C. Chen, “A novel DDPG method with prioritized experience replay,” in IEEE Intl. Conf. on Systems, Man, and Cybernetics (SMC). IEEE, 2017, pp. 316–321.
[29] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in neural information processing systems, vol. 30, 2017.
[30] L. N. Alegre, F. Felten, E.-G. Talbi, G. Danoy, A. Nowé, A. L. C. Bazzan, and B. C. da Silva, “MO-Gym: A library of multi-objective reinforcement learning environments,” in Proceedings of the 34th Benelux Conf. on Artificial Intelligence BNAIC/Benelearn 2022, 2022.
[31] E. Leurent, “rl-agents: Implementations of reinforcement learning algorithms,” https://github.com/eleurent/rl-agents, 2018.
[32] F. Felten, L. N. Alegre, A. Nowé, A. L. C. Bazzan, E. G. Talbi, G. Danoy, and B. C. d. Silva, “A toolkit for reliable benchmarking and research in multi-objective reinforcement learning,” in Proceedings of the 37th Conf. on Neural Information Processing Systems (NeurIPS 2023), 2023.
[33] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning. PMLR, 2016, pp. 1995–2003.
[34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.