Generalized Multi-Objective Reinforcement Learning with Envelope Updates in URLLC-enabled Vehicular Networks

Zijiang Yan,  and Hina Tabassum Z.Yan and H.Tabassum were with the Department of Electrical Engineering and Computer Science, York University, Toronto, ON, M3J 1P3 Canada e-mail: {zjiyan,hinat}@yorku.ca.This research was supported by a Discovery Grant funded by the Natural Sciences and Engineering Research Council of Canada (NSERC). A preliminary version of this work has been presented at the IEEE Global Communications Conference (GLOBECOM), 2022 [1]
Abstract

We develop a novel multi-objective reinforcement learning (MORL) framework to jointly optimize wireless network selection and autonomous driving policies in a multi-band vehicular network operating on conventional sub-6GHz spectrum and Terahertz frequencies. The proposed framework is designed to (i) maximize the traffic flow and minimize collisions by controlling the vehicle’s motion dynamics (i.e., speed and acceleration), and (ii) enhance the ultra-reliable low-latency communication (URLLC) while minimizing handoffs (HOs). We cast this problem as a multi-objective Markov Decision Process (MOMDP) and develop solutions for both predefined and unknown preferences of the conflicting objectives. Specifically, deep-Q-network and double deep-Q-network-based solutions are developed first that consider scalarizing the transportation and telecommunication rewards using predefined preferences. We then develop a novel envelope MORL solution which develop policies that address multiple objectives with unknown preferences to the agent. While this approach reduces reliance on scalar rewards, policy effectiveness varying with different preferences is a challenge. To address this, we apply a generalized version of the Bellman equation and optimize the convex envelope of multi-objective Q values to learn a unified parametric representation capable of generating optimal policies across all possible preference configurations. Following an initial learning phase, our agent can execute optimal policies under any specified preference or infer preferences from minimal data samples. Numerical results validate the efficacy of the envelope-based MORL solution and demonstrate interesting insights related to the inter-dependency of vehicle motion dynamics, HOs, and the communication data rate. The proposed policies enable autonomous vehicles to adopt safe driving behaviors with improved connectivity.

Index Terms:
Autonomous driving, multi-objective reinforcement learning, multi-band network selection, resource allocation

I Introduction

Facilitating ultra-reliable and low-latency vehicle-to-infrastructure (V2I) communications is a fundamental prerequisite for the realization of autonomous and intelligent transportation systems. Different from throughput-oriented conventional communications, ensuring ultra-reliable low latency communications (URLLC) is challenging as it relies on ensuring the signal-to-interference ratio (SINR), data rate, over-the-air/queuing latency, and decoding probability. Conventional radio frequency (RF) alone cannot efficiently meet the stringent URLLC requirement due to its limited coverage and narrow transmission bandwidths. In this context, 6G technology enables combining the conventional sub-6GHz transmissions111We use sub-6GHz and RF communication interchangeably in this paper. in conjunction with extremely high frequencies such as THz transmissions, where the former can compensate for the severe path-loss attenuation of THz transmission, and the latter can help overcome the RF spectrum congestion.

On the other hand, the use of deep reinforcement learning (DRL) is becoming critical for online decision making in highly random mobility-oriented wireless environments. In the context of V2I communications, a plethora of research works focused on improving network quality of service (QoS) (e.g., including transmission delay, link throughput, etc) via DRL-based resource allocation [2, 3, 4]. These research works were focused on considering sub-channel and power allocation to enhance V2I communications. Particularly, the authors in [2, 3] adopted deep Q-network (DQN) and multi-agent DQN to enhance the total throughput of V2I connections and the payload delivery rate of vehicle-to-vehicle (V2V) connections simultaneously. Xu et al. [4] derived the contribution-based dual-clip proximal policy to optimize V2I and V2V links separately. However, their system model only contains a single BS where handovers (HOs) are not considered.

Recently, the authors in [5, 6, 7, 8, 9] formulated similar optimization tasks as multi-objective optimization problem (MOOP) and proposed to use multi-objective reinforcement learning (MORL) solutions. Hu et al. [5] implemented Double-Loop Learning (DLL) to minimize the latency of real-time services transmission and maximize the throughput of non-instant services transmission. In [6, 7], the authors adopted the weighted Tchebycheff method and weighted-sum-MORL to maximize the fraction between the data rate and the power consumption, respectively. From [8], Guo et al. applied the Multi-Agent Proximal Policy Optimization (MAPPO) algorithm to address the joint handover and power allocation problem. In [9], Khan et al. utilized the Asynchronous Advantage Actor-Critic (A3C) algorithm to devise a vehicle-RSU association policy, aiming to enhance the mobile user experience by maximizing sum data rate of multiple AVs while ensuring a minimum level of service rate for all AVs.

Along another note, most of the existing research works in the transportation are focused on collision-avoidance [10, 11], safe driving [12], and efficient fuel consumption [12, 13, 14]. For instance, in [10], the authors applied a RL framework for faster travel and the reward is proportional to the AV’s velocity along with a penalty for vehicle collision. The action space included acceleration, deceleration, lane changes (LC), and maintain speed, whereas the state space was based on AVs’ locations and their respective velocities. In [12], the authors applied DDQN to enhance AVs’ driving safety and fuel consumption. The state space included AVs’ locations, fuel consumption, and velocities, whereas the actions included speeds and LC of AVs. In [13], the authors applied the Intelligent Driver Model (IDM) and Minimizing Overall Braking Induced by Lane Change (MOBIL) to control steering and lane change. The proposed reward design encourages long traffic, high speed and discourages unnecessary LC. Authors in [14] introduced multi-objective actor-critic to improve the tradeoff between energy consumption and travel efficiency of AVs. The authors derived MO actor-critic to optimize two objectives.

To date, none of the existing research works [2, 4, 3, 8, 9, 5, 15, 6, 7, 10, 12, 13, 14] have considered the inter-dependency of the AV motion dynamics to wireless data rates. To improve the communication data rate and minimize road collisions of connected vehicles jointly, it is critical to optimize the AVs’ network selection and driving policies simultaneously. In the sequel, our contributions can be summarized as follows:

  • We develop an MORL framework to design joint network selection and autonomous driving policies in a multi-band vehicular network (VNet). The objectives are to (i) maximize the traffic flow and minimize collisions by controlling the vehicle’s motion dynamics (i.e., speed and acceleration) from a transportation perspective, and (ii) maximize the data rates and minimize handoffs (HOs) by jointly controlling the vehicle’s motion dynamics and network selection from telecommunication perspective. We consider a novel reward function that maximizes data rate and traffic flow, ensures traffic load balancing across the network, penalizes HOs, and unsafe driving behaviors.

  • The considered problem is formulated as a multi-objective Markov decision process (MOMDP) that has two-dimensional action space and rewards consist of telecommunication and autonomous driving utilities. We then propose single policy MORL solutions with predefined preferences thus converting the MOOP into a single-objective and apply DQN and double DQN solutions. The resulting optimal policy depends on the relative preferences of the objectives.

  • Learning optimized policies across multiple preferences remains challenging. To address this, we then develop a novel envelope MORL solution to effectively navigate the entire spectrum of preferences within a given domain. This approach empowers the trained model to generate the best possible policy tailored to any user-defined preference. Our algorithm hinges on two fundamental insights: firstly, we demonstrate that the optimality operator governing a generalized Bellman equation with preferences exhibits valid contraction properties. Secondly, by optimizing for the convex envelope of multi-objective Q-values, we ensure an efficient alignment between preferences and the resultant optimal policies. Leveraging hindsight experience replay, we recycle transitions to facilitate learning across various sampled preferences, while employing homotopy optimization to maintain manageable learning processes.

  • We develop a novel simulation testbed that emulates multi-band wireless network-enabled VNet RF-THz-Highway-Env based on highway-env [16]. This test environment not only inherits the advantages of autonomous driving, and lane changes on the highway from [16], but also implements RF/THz channel propagation modeling, network selection, and HO control.

  • Numerical results shows that the proposed solution outperforms weighted sum-based MORL solutions with DQN by 12.7%percent12.712.7\%12.7 %, 18.9%percent18.918.9\%18.9 %, and 12.3%percent12.312.3\%12.3 % on average transportation reward, average communication reward, and average HO rate, respectively.

The rest of this work is organized as follows. Section II shows the system model, and Section III illustrates the RL, DRL, MORL background, and problem formulation. Section IV introduces the Multi-Objective Reinforcement Learning Approach. The simulations are presented in Section V, and Section VI concludes this research work.

II System Model and Problem Formulation

A multi-band downlink network comprising nRsubscript𝑛𝑅n_{R}italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT RBSs and nTsubscript𝑛𝑇n_{T}italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT TBSs is considered. A multi-vehicle network is also considered, with a multi-lane road comprising NLsubscript𝑁𝐿N_{L}italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT lanes. M𝑀Mitalic_M AVs receive information from the BSs (deployed alongside the road) through V2I communications (as depicted in Figure 1). Each AV is permitted to associate with only one BS at a time, regardless of whether the BS is an RBS or TBS. The on-board units (OBUs) on the AVs receive real-time information from the VNet, including the velocity, acceleration, and lane position of surrounding vehicles. Each RBS and TBS has a bandwidth available to it, represented by WRsubscript𝑊𝑅W_{R}italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and WTsubscript𝑊𝑇W_{T}italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, respectively. Each RBS and TBS is capable of supporting a maximum number of AVs, denoted by QRsubscript𝑄𝑅Q_{R}italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and QTsubscript𝑄𝑇Q_{T}italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, respectively. All AVs are equipped with a single antenna.

Refer to caption
Figure 1: The diagram illustrates the structure of the multi-band VNet model. The blue and red circles represent TBSs and RBSs, respectively. The solid and dashed lines represent desired signal links and interference links, respectively.

II-A Downlink V2I Data Transmission Model

The signal transmitted by the RBS is subject to path-loss and short-term channel fading. Subsequently, the signal-to-interference-plus-noise ratio (SINR) of the j𝑗jitalic_j-th AV from i𝑖iitalic_i-th RBS is given as [17]:

SINRijRF=PRtxGRtxGRrx(c4πfR)2Hirijα(σ2+IRj),subscriptsuperscriptSINRRF𝑖𝑗superscriptsubscript𝑃𝑅txsuperscriptsubscript𝐺𝑅txsuperscriptsubscript𝐺𝑅rxsuperscript𝑐4𝜋subscript𝑓𝑅2subscript𝐻𝑖superscriptsubscript𝑟𝑖𝑗𝛼superscript𝜎2subscript𝐼subscript𝑅𝑗{}\mathrm{SINR}^{\mathrm{RF}}_{ij}=\frac{P_{R}^{\mathrm{tx}}\>G_{R}^{\mathrm{% tx}}\>G_{R}^{\mathrm{rx}}\left(\frac{c}{4\pi f_{R}}\right)^{2}H_{i}}{r_{ij}^{% \alpha}\left(\sigma^{2}+I_{R_{j}}\right)},roman_SINR start_POSTSUPERSCRIPT roman_RF end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT ( divide start_ARG italic_c end_ARG start_ARG 4 italic_π italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG start_ARG italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_α end_POSTSUPERSCRIPT ( italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT + italic_I start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) end_ARG , (1)

where PRtx,GRtx,GRrx,c,fR,rij,superscriptsubscript𝑃𝑅txsuperscriptsubscript𝐺𝑅txsuperscriptsubscript𝐺𝑅rx𝑐subscript𝑓𝑅subscript𝑟𝑖𝑗P_{R}^{\mathrm{tx}},G_{R}^{\mathrm{tx}},G_{R}^{\mathrm{rx}},c,f_{R},r_{ij},italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT , italic_c , italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , and α𝛼\alphaitalic_α denote the transmit power of the RBSs, antenna transmitting gain, antenna receiving gain, speed of light, RF carrier frequency (in GHz), distance between the j𝑗jitalic_j-th AV and the i𝑖iitalic_i-th RBS, and path-loss exponent, respectively. In addition, Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the exponentially distributed channel fading power observed at the j𝑗jitalic_j-th AV from the i𝑖iitalic_i-th RBS, σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the power of thermal noise at the receiver, IRjsubscript𝐼subscript𝑅𝑗I_{R_{j}}italic_I start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the cumulative interference at j𝑗jitalic_j-th AV from the interfering RBSs, i.e., where PRtxsuperscriptsubscript𝑃𝑅txP_{R}^{\text{tx}}italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tx end_POSTSUPERSCRIPT, GRtxsuperscriptsubscript𝐺𝑅txG_{R}^{\text{tx}}italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tx end_POSTSUPERSCRIPT, GRrxsuperscriptsubscript𝐺𝑅rxG_{R}^{\text{rx}}italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rx end_POSTSUPERSCRIPT, c𝑐citalic_c, fRsubscript𝑓𝑅f_{R}italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT, rijsubscript𝑟𝑖𝑗r_{ij}italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, and α𝛼\alphaitalic_α represent the transmit power of the RBSs, the gain of the transmitting antenna, the gain of the receiving antenna, the speed of light, the RF carrier frequency in GHz, the distance between the j𝑗jitalic_j-th AV and the i𝑖iitalic_i-th RBS, and the path-loss exponent, respectively. Furthermore, Hisubscript𝐻𝑖H_{i}italic_H start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT denotes the exponentially distributed channel fading power observed by the j𝑗jitalic_j-th AV from the i𝑖iitalic_i-th RBS, σ2superscript𝜎2\sigma^{2}italic_σ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT is the power of the thermal noise at the receiver, and IRjsubscript𝐼subscript𝑅𝑗I_{R_{j}}italic_I start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the cumulative interference experienced by the j𝑗jitalic_j-th AV from other interfering RBSs. IRj=kiPRtxγRrkjαHksubscript𝐼subscript𝑅𝑗subscript𝑘𝑖superscriptsubscript𝑃𝑅txsubscript𝛾𝑅superscriptsubscript𝑟𝑘𝑗𝛼subscript𝐻𝑘I_{R_{j}}=\sum_{k\neq i}P_{R}^{\mathrm{tx}}\gamma_{R}r_{kj}^{-\alpha}H_{k}italic_I start_POSTSUBSCRIPT italic_R start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - italic_α end_POSTSUPERSCRIPT italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT where rkjsubscript𝑟𝑘𝑗r_{kj}italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT is the distance between the k𝑘kitalic_k-th interfering RBS and the j𝑗jitalic_j-th AV, Hksubscript𝐻𝑘H_{k}italic_H start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the power of fading from the k𝑘kitalic_k-th interfering RBS to the j𝑗jitalic_j-th AV, and γR=GRtxGRrx(c/4πfR)2subscript𝛾𝑅superscriptsubscript𝐺𝑅txsuperscriptsubscript𝐺𝑅rxsuperscript𝑐4𝜋subscript𝑓𝑅2\gamma_{R}=G_{R}^{\mathrm{tx}}\>G_{R}^{\mathrm{rx}}\left({c}/{4\pi f_{R}}% \right)^{2}italic_γ start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT ( italic_c / 4 italic_π italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. In the context of a Terahertz (THz) network, where molecular absorption significantly impacts signal propagation, the significance of line-of-sight (LOS) transmissions over non-line-of-sight (NLOS) transmissions is dominant. Consequently, the SINR for a given j𝑗jitalic_j-th AV can be modeled as follows [17]:

SINRijTHz=GTtxGTrx(c4πfT)2PTtxexp(Ka(fT)rij)rij2NTj+ITj,subscriptsuperscriptSINRTHz𝑖𝑗superscriptsubscript𝐺𝑇txsuperscriptsubscript𝐺𝑇rxsuperscript𝑐4𝜋subscript𝑓𝑇2superscriptsubscript𝑃𝑇txexpsubscript𝐾𝑎subscript𝑓𝑇subscript𝑟𝑖𝑗superscriptsubscript𝑟𝑖𝑗2subscript𝑁subscript𝑇𝑗subscript𝐼subscript𝑇𝑗\mathrm{SINR}^{\mathrm{THz}}_{ij}=\frac{G_{T}^{\mathrm{tx}}G_{T}^{\mathrm{rx}}% \left(\frac{c}{4\pi f_{T}}\right)^{2}P_{T}^{\mathrm{tx}}\>\mathrm{exp}(-K_{a}(% f_{T})r_{ij})r_{ij}^{-2}}{N_{T_{j}}+I_{T_{j}}},roman_SINR start_POSTSUPERSCRIPT roman_THz end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT ( divide start_ARG italic_c end_ARG start_ARG 4 italic_π italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT roman_exp ( - italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_N start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + italic_I start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_ARG , (2)

where GTtx,GTrx,PTtx,fT,rij, and Ka(fT)superscriptsubscript𝐺𝑇txsuperscriptsubscript𝐺𝑇rxsuperscriptsubscript𝑃𝑇txsubscript𝑓𝑇subscript𝑟𝑖𝑗 and subscript𝐾𝑎subscript𝑓𝑇G_{T}^{\text{tx}},G_{T}^{\text{rx}},P_{T}^{\text{tx}},f_{T},r_{ij},\text{ and % }K_{a}(f_{T})italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tx end_POSTSUPERSCRIPT , italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rx end_POSTSUPERSCRIPT , italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tx end_POSTSUPERSCRIPT , italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , and italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) represent the transmit antenna gain of the TBS, the receiving antenna gain of the TBS, the transmit power of the TBS, the THz carrier frequency, the distance between the j𝑗jitalic_j-th AV and the i𝑖iitalic_i-th TBS, and the molecular absorption coefficient of the transmission medium, respectively 222For the sake of brevity, the argument of Ka(fT)subscript𝐾𝑎subscript𝑓𝑇K_{a}(f_{T})italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) will henceforth be omitted in this study.. It is important to note that: GTrx(θ) and GTtx(θ)superscriptsubscript𝐺𝑇rx𝜃 and superscriptsubscript𝐺𝑇tx𝜃G_{T}^{\text{rx}}(\theta)\text{ and }G_{T}^{\text{tx}}(\theta)italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT rx end_POSTSUPERSCRIPT ( italic_θ ) and italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT tx end_POSTSUPERSCRIPT ( italic_θ ) denote the antenna gains at the receiver and transmitter sides corresponding to the boresight direction angle θ[π,π)𝜃𝜋𝜋\theta\in[-\pi,\pi)italic_θ ∈ [ - italic_π , italic_π ). The beamforming gain from the main and side lobes of the TBS transmitting antenna is subsequently defined as,

GTq(θ)={GmaxqθwqGminqθ>wq,superscriptsubscript𝐺𝑇𝑞𝜃casessubscriptsuperscript𝐺𝑞maxdelimited-∣∣𝜃subscript𝑤𝑞subscriptsuperscript𝐺𝑞mindelimited-∣∣𝜃subscript𝑤𝑞G_{T}^{q}\left(\theta\right)=\begin{cases}G^{q}_{\mathrm{max}}&\mid\theta\mid% \leq w_{q}\\ G^{q}_{\mathrm{min}}&\mid\theta\mid>w_{q}\end{cases},italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT ( italic_θ ) = { start_ROW start_CELL italic_G start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT end_CELL start_CELL ∣ italic_θ ∣ ≤ italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_CELL end_ROW start_ROW start_CELL italic_G start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_CELL start_CELL ∣ italic_θ ∣ > italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT end_CELL end_ROW , (3)

where the superscript q𝑞qitalic_q is used to indicate the transmit/ receive antenna, i.e., q{tx,rx}𝑞txrxq\in\{\mathrm{tx,rx}\}italic_q ∈ { roman_tx , roman_rx }, wqsubscript𝑤𝑞w_{q}italic_w start_POSTSUBSCRIPT italic_q end_POSTSUBSCRIPT is the beamwidth of the main lobe, Gmaxqsubscriptsuperscript𝐺𝑞maxG^{q}_{\mathrm{max}}italic_G start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT and Gminqsubscriptsuperscript𝐺𝑞minG^{q}_{\mathrm{min}}italic_G start_POSTSUPERSCRIPT italic_q end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT are the beamforming gains of the main and side lobes, respectively. We assume that AVs can align the receiving beam with the TBS transmit beam using beam alignment techniques. For the alignment between the user and interfering TBSs, we define a random variable ΘΘ\Thetaroman_Θ, Θ{GmaxtxGmaxrx,GmaxtxGminrx,GmintxGmaxrx,GmintxGminrx},Θsubscriptsuperscript𝐺txmaxsubscriptsuperscript𝐺rxmaxsubscriptsuperscript𝐺txmaxsubscriptsuperscript𝐺rxminsubscriptsuperscript𝐺txminsubscriptsuperscript𝐺rxmaxsubscriptsuperscript𝐺txminsubscriptsuperscript𝐺rxmin\Theta\in\{G^{\mathrm{tx}}_{\mathrm{max}}G^{\mathrm{rx}}_{\mathrm{max}},G^{% \mathrm{tx}}_{\mathrm{max}}G^{\mathrm{rx}}_{\mathrm{min}},G^{\mathrm{tx}}_{% \mathrm{min}}G^{\mathrm{rx}}_{\mathrm{max}},G^{\mathrm{tx}}_{\mathrm{min}}G^{% \mathrm{rx}}_{\mathrm{min}}\},roman_Θ ∈ { italic_G start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT , italic_G start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT italic_G start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT } , and the respective probability for each case is FtxFrxsubscript𝐹txsubscript𝐹rxF_{\mathrm{tx}}F_{\mathrm{rx}}italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT, Ftx(1Frx)subscript𝐹tx1subscript𝐹rxF_{\mathrm{tx}}(1-F_{\mathrm{rx}})italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT ( 1 - italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT ), (1Ftx)Frx1subscript𝐹txsubscript𝐹rx(1-F_{\mathrm{tx}})F_{\mathrm{rx}}( 1 - italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT ) italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT, and (1Ftx)(1Frx)1subscript𝐹tx1subscript𝐹rx(1-F_{\mathrm{tx}})(1-F_{\mathrm{rx}})( 1 - italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT ) ( 1 - italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT ), where Ftx=θtx2πsubscript𝐹txsubscript𝜃tx2𝜋F_{\mathrm{tx}}=\frac{\theta_{\mathrm{tx}}}{2\pi}italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT = divide start_ARG italic_θ start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π end_ARG Frx=θrx2πsubscript𝐹rxsubscript𝜃rx2𝜋F_{\mathrm{rx}}=\frac{\theta_{\mathrm{rx}}}{2\pi}italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT = divide start_ARG italic_θ start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT end_ARG start_ARG 2 italic_π end_ARG with θtx,θrxsubscript𝜃txsubscript𝜃rx\theta_{\mathrm{tx}},\theta_{\mathrm{rx}}italic_θ start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT , italic_θ start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT being the beamwidth on the transmitter and receiver antenna, respectively. Without loss of generality, we consider negligible side lobe gains. Thus, the cumulative interference ITsubscript𝐼𝑇I_{T}italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT between AV and the interfering TBS is given as IT=kiγTPTtxFtxFrxrkj2exp(Karkj),subscript𝐼𝑇subscript𝑘𝑖subscript𝛾𝑇superscriptsubscript𝑃𝑇txsubscript𝐹txsubscript𝐹rxsuperscriptsubscript𝑟𝑘𝑗2expsubscript𝐾𝑎subscript𝑟𝑘𝑗I_{T}=\sum_{k\neq i}\gamma_{T}\>P_{T}^{\mathrm{tx}}\>F_{\mathrm{tx}}F_{\mathrm% {rx}}r_{kj}^{-2}\mathrm{exp}(-K_{a}\>{r_{kj}}),italic_I start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT roman_exp ( - italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT ) , where γT=GTtxGTrx(c4πfT)2subscript𝛾𝑇superscriptsubscript𝐺𝑇txsuperscriptsubscript𝐺𝑇rxsuperscript𝑐4𝜋subscript𝑓𝑇2\gamma_{T}=G_{T}^{\mathrm{tx}}G_{T}^{\mathrm{rx}}\left(\frac{c}{4\pi f_{T}}% \right)^{2}italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT ( divide start_ARG italic_c end_ARG start_ARG 4 italic_π italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT. The cumulative thermal and molecular absorption noise is thus given as:

NTj=N0+PTtxγTrij2(1eKarij)+kiγTFtxFrxPTtxrkj2(1eKarkj).subscript𝑁subscript𝑇𝑗subscript𝑁0superscriptsubscript𝑃𝑇txsubscript𝛾𝑇superscriptsubscript𝑟𝑖𝑗21superscript𝑒subscript𝐾𝑎subscript𝑟𝑖𝑗subscript𝑘𝑖subscript𝛾𝑇subscript𝐹txsubscript𝐹rxsuperscriptsubscript𝑃𝑇txsuperscriptsubscript𝑟𝑘𝑗21superscript𝑒subscript𝐾𝑎subscript𝑟𝑘𝑗\begin{multlined}N_{T_{j}}=N_{0}+\>P_{T}^{\mathrm{tx}}\gamma_{T}\>{r_{ij}^{-2}% }\>(1-e^{-K_{a}\>{r_{ij}}})+\\ \sum_{k\neq i}\gamma_{T}F_{\mathrm{tx}}F_{\mathrm{rx}}\>P_{T}^{\mathrm{tx}}\>{% r_{kj}^{-2}}(1-e^{-K_{a}\>{r_{kj}}}).\end{multlined}N_{T_{j}}=N_{0}+\>P_{T}^{% \mathrm{tx}}\gamma_{T}\>{r_{ij}^{-2}}\>(1-e^{-K_{a}\>{r_{ij}}})+\\ \sum_{k\neq i}\gamma_{T}F_{\mathrm{tx}}F_{\mathrm{rx}}\>P_{T}^{\mathrm{tx}}\>{% r_{kj}^{-2}}(1-e^{-K_{a}\>{r_{kj}}}).start_ROW start_CELL italic_N start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT = italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) + end_CELL end_ROW start_ROW start_CELL ∑ start_POSTSUBSCRIPT italic_k ≠ italic_i end_POSTSUBSCRIPT italic_γ start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT roman_tx end_POSTSUBSCRIPT italic_F start_POSTSUBSCRIPT roman_rx end_POSTSUBSCRIPT italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ( 1 - italic_e start_POSTSUPERSCRIPT - italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_r start_POSTSUBSCRIPT italic_k italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT ) . end_CELL end_ROW (4)

The traditional data rate relies on Shannon’s capacity, which can be attained as the block-length of channel codes approaches infinity. Nevertheless, to prevent prolonged transmission delays in URLLC, the block length must be limited. Consequently, Shannon’s capacity cannot be realized due to the presence of a non-zero decoding error probability [18]. From [19], the achievable rate in the short block-length regime over an AWGN channel can be approximated as:

Rij=Wjln2[ln(1+SINRij)VLBfQ1(ϵc)]subscript𝑅𝑖𝑗subscript𝑊𝑗2delimited-[]1subscriptSINR𝑖𝑗𝑉subscript𝐿𝐵superscriptsubscript𝑓𝑄1subscriptitalic-ϵ𝑐R_{ij}=\frac{W_{j}}{\ln{2}}\left[\ln(1+\mathrm{SINR}_{ij})-\sqrt{\frac{V}{L_{B% }}}f_{Q}^{-1}(\epsilon_{c})\right]italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_ln 2 end_ARG [ roman_ln ( 1 + roman_SINR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) - square-root start_ARG divide start_ARG italic_V end_ARG start_ARG italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT end_ARG end_ARG italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT ) ] (5)

where Wj,LB,ϵc,fQ1(),Vsubscript𝑊𝑗subscript𝐿𝐵subscriptitalic-ϵ𝑐superscriptsubscript𝑓𝑄1𝑉W_{j},L_{B},\epsilon_{c},f_{Q}^{-1}(\cdot),Vitalic_W start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT , italic_ϵ start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_f start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( ⋅ ) , italic_V are the transmission bandwidth of BS j𝑗jitalic_j, blocklength, decoding error probability, the inverse Q𝑄Qitalic_Q function, and the channel dispersion, respectively. V𝑉Vitalic_V can be calculated by 11(1+SINRij)211superscript1subscriptSINR𝑖𝑗21-\frac{1}{(1+\mathrm{SINR}_{ij})^{2}}1 - divide start_ARG 1 end_ARG start_ARG ( 1 + roman_SINR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG. Given that Dtsubscript𝐷𝑡D_{t}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT time to transmit LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT symbols, the time and frequency resources can be computed by DtW=LBsubscript𝐷𝑡𝑊subscript𝐿𝐵D_{t}W=L_{B}italic_D start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT italic_W = italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. As block length LBsubscript𝐿𝐵L_{B}italic_L start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT approaches infinity, the achieve rate in (5) reaches Shannon’s capacity.

Each AV maintains a list of the BSs in terms of the achievable data rate Rijsubscript𝑅𝑖𝑗R_{ij}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and then informs these BSs. Consequently, each BS can calculate the possible AV associations at each time instance denoted by nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Then, the AV collects the traffic load information from these BSs (i.e., the number of AVs associated with each BS nisubscript𝑛𝑖n_{i}italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT). Based on the quota of each BS i𝑖iitalic_i, Qi[QR,QT]subscript𝑄𝑖subscript𝑄𝑅subscript𝑄𝑇Q_{i}\in[Q_{R},Q_{T}]italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ [ italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ], each AV computes a weighted data rate metric that encourages traffic load balancing at each BS and discourages unnecessary HOs, i.e.,

WRij=Rijmin(Qi,ni)(1μ)subscriptWR𝑖𝑗subscript𝑅𝑖𝑗subscript𝑄𝑖subscript𝑛𝑖1𝜇\text{WR}_{ij}=\frac{R_{ij}}{\min\left(Q_{i},n_{i}\right)}(1-\mu)WR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = divide start_ARG italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG start_ARG roman_min ( italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) end_ARG ( 1 - italic_μ ) (6)

where μ𝜇\muitalic_μ denotes the HO penalty to discourage unnecessary HOs that is defined as follows:

μ={0.1,if switch to a RBS0.5,if switch to a TBS 0,keep previous BS𝜇cases0.1if switch to a RBS0.5if switch to a TBS 0keep previous BS{\mu=\begin{dcases}0.1,&\text{if switch to a RBS}\\ 0.5,&\text{if switch to a TBS }\\ 0,&\text{keep previous BS}\end{dcases}}italic_μ = { start_ROW start_CELL 0.1 , end_CELL start_CELL if switch to a RBS end_CELL end_ROW start_ROW start_CELL 0.5 , end_CELL start_CELL if switch to a TBS end_CELL end_ROW start_ROW start_CELL 0 , end_CELL start_CELL keep previous BS end_CELL end_ROW (7)

As AVs traverse the corridor, they transition from one BS to another, which is referred to as a HOs. We distinguish between two types of HOs: horizontal and vertical. Horizontal HO denotes the AV connection shifting from one BS of the same type to another. In contrast, vertical HO pertains to the scenario where the AV connection transitions from one specific type of BS to a distinct type of BS, such as moving from an RBS to a TBS. It is evident that frequent HOs can have a significant impact on the data rate that AVs receive, due to the inherent latency and failure rates associated with HOs. In this paper, we propose the introduction of a penalty, denoted by the parameter μ𝜇\muitalic_μ, which is designed to discourage HOs. This penalty is higher for TBSs and lower for RBSs, reflecting the fact that THz transmission is limited to a relatively short distance, rendering it more vulnerable to unnecessary HOs.

Then, each AV prepares a sorted list of BSs offering the best weighted data rates WRijsubscriptWR𝑖𝑗\text{WR}_{ij}WR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and associates to those that can fulfill the data rate requirement of the AV given by Rthsubscript𝑅thR_{\mathrm{th}}italic_R start_POSTSUBSCRIPT roman_th end_POSTSUBSCRIPT.

II-B Transportation Model

We categorize M𝑀Mitalic_M AVs into two groups: target vehicles, denoted as M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, and surrounding vehicles, denoted as M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

II-B1 Kinematics Model for all AVs

Following [20, 16], we update all AVs’ real-time physical location based on the Kinematics model. Suppose only front wheels can be steered, we have the following relations applied for each AV j𝑗jitalic_j, jM𝑗𝑀j\in Mitalic_j ∈ italic_M:

t(xj)=vjcos(ψj+βj),βj=arctan(tanδjfa2)formulae-sequence𝑡subscript𝑥𝑗subscript𝑣𝑗subscript𝜓𝑗subscript𝛽𝑗subscript𝛽𝑗superscriptsubscript𝛿𝑗fa2\frac{\partial}{\partial t}(x_{j})=v_{j}\cos(\psi_{j}+\beta_{j}),\quad\beta_{j% }=\arctan\left({\frac{\tan{\delta_{j}^{\mathrm{fa}}}}{2}}\right)divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_cos ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_arctan ( divide start_ARG roman_tan italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_fa end_POSTSUPERSCRIPT end_ARG start_ARG 2 end_ARG ) (8)
t(yj)=vjsin(ψj+βj)𝑡subscript𝑦𝑗subscript𝑣𝑗subscript𝜓𝑗subscript𝛽𝑗\frac{\partial}{\partial t}(y_{j})=v_{j}\sin(\psi_{j}+\beta_{j})divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_sin ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (9)
t(vj)=aj,t(ψj)=vjljsinβjformulae-sequence𝑡subscript𝑣𝑗subscript𝑎𝑗𝑡subscript𝜓𝑗subscript𝑣𝑗subscript𝑙𝑗subscript𝛽𝑗\frac{\partial}{\partial t}(v_{j})=a_{j},\quad\frac{\partial}{\partial t}(\psi% _{j})=\frac{v_{j}}{l_{j}}\sin{\beta_{j}}divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG roman_sin italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT (10)

where (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) is the location of AV j𝑗jitalic_j. ψj,aj,βj,lj,δjfasubscript𝜓𝑗subscript𝑎𝑗subscript𝛽𝑗subscript𝑙𝑗superscriptsubscript𝛿𝑗fa\psi_{j},a_{j},\beta_{j},l_{j},\delta_{j}^{\mathrm{fa}}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_fa end_POSTSUPERSCRIPT are AV j𝑗jitalic_j’s heading, command for acceleration, slip angle at the gravity center, half-length of AV j𝑗jitalic_j, and front wheel angle, respectively.

II-B2 Acceleration and Lane Change of Target Vehicles

As in [16, 21], each target AV j𝑗jitalic_j follow a lane Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT through its lateral position yjsubscript𝑦𝑗y_{j}italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and heading ψLjsubscript𝜓subscript𝐿𝑗\psi_{L_{j}}italic_ψ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT, for iM1𝑖subscript𝑀1i\in M_{1}italic_i ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT. Thus, target AVs follow a cascade controller of lateral position and heading, i.e.,

t(ψj)=Kjψ[(ψLj+arcsin(v~i,yvj)ψj]\frac{\partial}{\partial t}(\psi_{j})=K_{j}^{\psi}\left[(\psi_{L_{j}}+\arcsin% \left(\frac{\tilde{v}_{i,y}}{v_{j}}\right)-\psi_{j}\right]divide start_ARG ∂ end_ARG start_ARG ∂ italic_t end_ARG ( italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) = italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ end_POSTSUPERSCRIPT [ ( italic_ψ start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT + roman_arcsin ( divide start_ARG over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) - italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ] (11)

where v~i,y=Kjy(yLjyj),subscript~𝑣𝑖𝑦superscriptsubscript𝐾𝑗𝑦subscript𝑦subscript𝐿𝑗subscript𝑦𝑗\tilde{v}_{i,y}=K_{j}^{y}\left(y_{L_{j}}-y_{j}\right),over~ start_ARG italic_v end_ARG start_POSTSUBSCRIPT italic_i , italic_y end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ( italic_y start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) , Kjψsuperscriptsubscript𝐾𝑗𝜓K_{j}^{\psi}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ end_POSTSUPERSCRIPT and Kjysuperscriptsubscript𝐾𝑗𝑦K_{j}^{y}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT are the control gains. yLjsubscript𝑦subscript𝐿𝑗y_{L_{j}}italic_y start_POSTSUBSCRIPT italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT is the lane Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT lateral position. The acceleration of target AV j𝑗jitalic_j, for jM1𝑗subscript𝑀1j\in M_{1}italic_j ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, is given as[16, 21]:

aj=K0v(vrvj)subscript𝑎𝑗superscriptsubscript𝐾0𝑣subscript𝑣𝑟subscript𝑣𝑗a_{j}=K_{0}^{v}(v_{r}-v_{j})italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT ( italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) (12)

where vrsubscript𝑣𝑟v_{r}italic_v start_POSTSUBSCRIPT italic_r end_POSTSUBSCRIPT is the desired AV speed and K0vsuperscriptsubscript𝐾0𝑣K_{0}^{v}italic_K start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_v end_POSTSUPERSCRIPT is the control gain.

II-B3 Acceleration and Lane Change of Surrounding Vehicles

Note that surrounding AVs will not get involved in training. Instead, the surrounding AVs select their accelerations/decelerations based on the IDM and MOBIL model, which allows them to track for a specific target speed and follow in a specific lane. For each surrounding vehicle, we compute the acceleration using the Intelligent Driver Model (IDM) [16, 22] and applied the MOBIL model for changing lanes [23, 16]. IDM is based on the idea that surrounding AVs choose their acceleration or deceleration based on their current speed and the distance to the vehicle in front. The model combines a desire to drive at a certain free-flow speed with a reaction to the traffic situation, particularly to avoid collisions. We define that AV j𝑗jitalic_j own velocity vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, a real gap distance djsubscript𝑑𝑗d_{j}italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT to the leading AV, and the speed difference ΔvjΔsubscript𝑣𝑗\Delta v_{j}roman_Δ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT between two AVs. With IDM, the acceleration ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and desired minimum gap dj^^subscript𝑑𝑗\hat{d_{j}}over^ start_ARG italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG from the vehicle ahead of AV j𝑗jitalic_j , jM2𝑗subscript𝑀2j\in M_{2}italic_j ∈ italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is modeled as [16, 22]:

aj=acac[(|vj|v0)δa+(dj^dj)2],subscript𝑎𝑗subscript𝑎𝑐subscript𝑎𝑐delimited-[]superscriptsubscript𝑣𝑗subscript𝑣0subscript𝛿𝑎superscript^subscript𝑑𝑗subscript𝑑𝑗2a_{j}=a_{c}-a_{c}\left[\left(\frac{|v_{j}|}{v_{0}}\right)^{\delta_{a}}+\left(% \frac{\hat{d_{j}}}{d_{j}}\right)^{2}\right],italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT - italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT [ ( divide start_ARG | italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT | end_ARG start_ARG italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT end_POSTSUPERSCRIPT + ( divide start_ARG over^ start_ARG italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG end_ARG start_ARG italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] , (13)

where

dj^=d0+max(0,Tvj+vjΔvj2acbc),^subscript𝑑𝑗subscript𝑑00𝑇subscript𝑣𝑗subscript𝑣𝑗Δsubscript𝑣𝑗2subscript𝑎𝑐subscript𝑏𝑐\hat{d_{j}}=d_{0}+\max\left(0,Tv_{j}+\frac{v_{j}\Delta v_{j}}{2\sqrt{a_{c}b_{c% }}}\right),over^ start_ARG italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG = italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT + roman_max ( 0 , italic_T italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + divide start_ARG italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT roman_Δ italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG start_ARG 2 square-root start_ARG italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT end_ARG end_ARG ) , (14)

where ac,bc,v0subscript𝑎𝑐subscript𝑏𝑐subscript𝑣0a_{c},b_{c},v_{0}italic_a start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_b start_POSTSUBSCRIPT italic_c end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT are predefined parameters denoting AV’s maximum acceleration, comfortable braking deceleration, and desired velocity, respectively. δasubscript𝛿𝑎\delta_{a}italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT denotes acceleration reduction factor, i.e., a higher δasubscript𝛿𝑎\delta_{a}italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT reduces acceleration. d0subscript𝑑0d_{0}italic_d start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and T𝑇Titalic_T denote the minimum distance in stopped traffic and safe time to approach the front vehicle, respectively. The desired gap has a steady-state (equilibrium) term and a dynamic term that implements the intelligent braking strategy.

To maintain a high transportation efficiency, redundant brakes during lane changing for surrounding AVs are discouraged. MOBIL model focuses on minimizing overall braking induced by a lane change in AV lateral behavior. MOBIL model indicates that an AV should change lanes when one of the following conditions is met, (1) AVs can accelerate more after changing lane a~jbsafe,subscript~𝑎𝑗subscript𝑏safe\tilde{a}_{j}\geq-b_{\textbf{safe}},over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ≥ - italic_b start_POSTSUBSCRIPT safe end_POSTSUBSCRIPT , where bsafesubscript𝑏safeb_{\textbf{safe}}italic_b start_POSTSUBSCRIPT safe end_POSTSUBSCRIPT shows the maximum braking imposed on AV when the AV cuts into the adjacent lane, (2) vehicle imposes unsafe and incentives braking on their new following vehicle, i.e,

aj~aj+p(ao~ao+an~an)Δath~subscript𝑎𝑗subscript𝑎𝑗𝑝~subscript𝑎𝑜subscript𝑎𝑜~subscript𝑎𝑛subscript𝑎𝑛Δsubscript𝑎th\tilde{a_{j}}-a_{j}+p\left(\tilde{a_{o}}-a_{o}+\tilde{a_{n}}-a_{n}\right)\geq% \Delta a_{\textbf{th}}over~ start_ARG italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_ARG - italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT + italic_p ( over~ start_ARG italic_a start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT end_ARG - italic_a start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT + over~ start_ARG italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT end_ARG - italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) ≥ roman_Δ italic_a start_POSTSUBSCRIPT th end_POSTSUBSCRIPT (15)

where ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT and a~jsubscript~𝑎𝑗\tilde{a}_{j}over~ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are the acceleration of AV before and after lane change. The subscript o𝑜oitalic_o and n𝑛nitalic_n denote the AV’s older follower and the new follower before and after the lane change, respectively. p𝑝pitalic_p represents the politeness coefficient. ΔathΔsubscript𝑎th\Delta a_{\textbf{th}}roman_Δ italic_a start_POSTSUBSCRIPT th end_POSTSUBSCRIPT is the acceleration gain to execute the lane change for AV. A decision of lane change adjusts the target lane Ljsubscript𝐿𝑗L_{j}italic_L start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT of AV j𝑗jitalic_j on its current lane segment. The actual motion planning and steering to track the newly targeted lane are then executed by the lateral controller in (11).

III Preliminaries and MOMDP Formulation

This section first provides a primer on the formulation of the Markov Decision Process (MDP) for single-objective problems and the Multi-Objective Markov Decision Process (MOMDPs) for multi-objective problems. We then formulate our considered problem as an MOMDP and discuss the design of state-action space and rewards function.

III-A Mathematical Preliminaries

III-A1 MDPs

In the distribution of tasks pool, the interaction between tasks and agents can be defined as MDP \mathcal{M}caligraphic_M, represented by a tuple, =<𝒮,𝒜,r,𝒫,γ>\mathcal{M}=<\mathcal{S},\mathcal{A},r,\mathcal{P},\mathcal{\gamma}>caligraphic_M = < caligraphic_S , caligraphic_A , italic_r , caligraphic_P , italic_γ >, where 𝒮,𝒜,r,𝒮𝒜𝑟\mathcal{S},\mathcal{A},r,caligraphic_S , caligraphic_A , italic_r , are the set of states s𝑠sitalic_s, actions a𝑎aitalic_a, and reward r(s,a)𝑟𝑠𝑎r(s,a)italic_r ( italic_s , italic_a ), respectively. γ[0,1)𝛾01\gamma\in[0,1)italic_γ ∈ [ 0 , 1 ) is the discount factor. r(s,a)𝑟𝑠𝑎r(s,a)italic_r ( italic_s , italic_a ) represents the stochastic instantaneous reward value that the agent can receive given a specific action a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A has been taken in a given specific state s𝒮𝑠𝒮s\in\mathcal{S}italic_s ∈ caligraphic_S. 𝒫(st+1|st,at)𝒫conditionalsubscript𝑠𝑡1subscript𝑠𝑡subscript𝑎𝑡\mathcal{P}(s_{t+1}|s_{t},a_{t})caligraphic_P ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT | italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) indicates the transition probability for the agent to take an action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_A on state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ caligraphic_S to the next state st+1𝒮subscript𝑠𝑡1𝒮s_{t+1}\in\mathcal{S}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ∈ caligraphic_S in time step t𝑡titalic_t. In RL, the agent interacts with the environment by following a trajectory τ={(st,at)}t=0𝜏superscriptsubscriptsubscript𝑠𝑡subscript𝑎𝑡𝑡0\tau=\{(s_{t},a_{t})\}_{t=0}^{\infty}italic_τ = { ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT from time stamp t=0𝑡0t=0italic_t = 0 to the end of the interacting episode, and receives a total discounted reward as rτ=t=0γrt(st,at)subscript𝑟𝜏superscriptsubscript𝑡0𝛾subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡{r}_{\tau}=\sum_{t=0}^{\infty}\mathcal{\gamma}\cdot r_{t}(s_{t},a_{t})italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ ⋅ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). The goal in RL is to find the map** policy π𝜋\piitalic_π between states and actions that maximizes the discounted total reward. In each time step t𝑡titalic_t, the agent selects an action based on its current state to maximize the discounted total reward rτsubscript𝑟𝜏{r}_{\tau}italic_r start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT. Finally, we can obtain a policy π𝜋\piitalic_π that belongs to the policy set ΠΠ\Piroman_Π. The state-action value function Qπ(s,a)subscript𝑄𝜋𝑠𝑎Q_{\pi}(s,a)italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) of the policy π𝜋\piitalic_π on state s𝑠sitalic_s can be given as:

Qπ(s,a)=𝔼π[rt(st,at)+γQπ(st1,at1)]subscript𝑄𝜋𝑠𝑎subscript𝔼𝜋delimited-[]subscript𝑟𝑡subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝑄𝜋subscript𝑠𝑡1subscript𝑎𝑡1Q_{\pi}(s,a)=\mathbb{E}_{\pi}\left[r_{t}(s_{t},a_{t})+\mathcal{\gamma}Q_{\pi}(% s_{t-1},a_{t-1})\right]italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT ) ] (16)

Thus, the optimal policy is given as π=maxπΠQπ(s,a).superscript𝜋subscript𝜋Πsubscript𝑄𝜋𝑠𝑎\pi^{*}=\max_{\pi\in\Pi}Q_{\pi}(s,a).italic_π start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT = roman_max start_POSTSUBSCRIPT italic_π ∈ roman_Π end_POSTSUBSCRIPT italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a ) .

III-A2 MOMDPs

The goal of MORL is to obtain policies among M𝑀Mitalic_M conflicting or competing objectives, where the relative importance (preferences) of each objective may be known or unknown to the agent. Similar to RL, MORL can be formulated by MOMDP which extends the MDP by defining a new reward space, preference space, and preference function, i.e., =<𝒮,𝒜,𝐫,𝒫,Ω,ι0>\mathcal{M}=<\mathcal{S},\mathcal{A},\mathbf{r},\mathcal{P},\Omega,\iota_{0}>caligraphic_M = < caligraphic_S , caligraphic_A , bold_r , caligraphic_P , roman_Ω , italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT >, where 𝐫H𝐫superscript𝐻\mathbf{r}\in\mathbb{R}^{H}bold_r ∈ blackboard_R start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT is a vector of reward functions corresponding to J𝐽Jitalic_J objectives, e.g., 𝐫=[r1,r2,,rH]𝐫superscript𝑟1superscript𝑟2superscript𝑟𝐻\mathbf{r}=[r^{1},r^{2},\dots,r^{H}]bold_r = [ italic_r start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_r start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_r start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ], ΩΩ\Omegaroman_Ω is the preference space where 𝝎Ω𝝎Ω\boldsymbol{\omega}\in\Omegabold_italic_ω ∈ roman_Ω is the preference vector corresponding to H𝐻Hitalic_H objectives, and h=1Hωh=1superscriptsubscript1𝐻superscript𝜔1\sum_{h=1}^{H}{{\omega}}^{h}=1∑ start_POSTSUBSCRIPT italic_h = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT italic_ω start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT = 1. ι0subscript𝜄0\iota_{0}italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the probability distribution over initial states. In MOMDPs, a policy π:𝒮𝒜:𝜋𝒮𝒜\pi:\mathcal{S}\to\mathcal{A}italic_π : caligraphic_S → caligraphic_A defines a map** from states to actions with the goal of maximizing a vector of expected rewards. Given the distribution ι0subscript𝜄0\iota_{0}italic_ι start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT and a policy π𝜋\piitalic_π, the expected discounted return is defined as:

𝐐π(s,a,𝝎)=𝔼π[𝐫(st,at)+γ𝐐π(st1,at1,𝝎)]subscript𝐐𝜋𝑠𝑎𝝎subscript𝔼𝜋delimited-[]𝐫subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝐐𝜋subscript𝑠𝑡1subscript𝑎𝑡1𝝎\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\mathbf{r}(s_{% t},a_{t})+\mathcal{\gamma}\mathbf{Q}_{\pi}(s_{t-1},a_{t-1},\boldsymbol{\omega}% )\right]bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ bold_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_ω ) ] (17)

where 𝐫(st,at)𝐫subscript𝑠𝑡subscript𝑎𝑡\mathbf{r}(s_{t},a_{t})bold_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is the immediate vector valued reward at time step t𝑡titalic_t for H𝐻Hitalic_H objectives. Maximizing the expected reward involves solving the following MOO problem max𝐐π=maxπ[Qπ1,Qπ2,,QπH]subscript𝐐𝜋subscript𝜋superscriptsubscript𝑄𝜋1superscriptsubscript𝑄𝜋2superscriptsubscript𝑄𝜋𝐻\max\mathbf{Q}_{\pi}=\max_{\pi}[Q_{\pi}^{1},Q_{\pi}^{2},\dots,Q_{\pi}^{H}]roman_max bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT = roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , … , italic_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_H end_POSTSUPERSCRIPT ].

A policy π𝜋\piitalic_π strictly dominates another policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT if π𝜋\piitalic_π achieves values at least as high as πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in all objectives and strictly higher in at least one objective:

π>πh,VπhVπhh,Vπh>Vπh\pi>\pi^{\prime}\iff\forall h,{V}^{h}_{\pi}\geq{V}^{h}_{\pi^{\prime}}\land% \exists\>h,{V}^{h}_{\pi}>{V}^{h}_{\pi^{\prime}}italic_π > italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ⇔ ∀ italic_h , italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≥ italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT ∧ ∃ italic_h , italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT > italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT (18)

Furthermore, a policy π𝜋\piitalic_π weakly dominates another policy πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, if π𝜋\piitalic_π achieves values greater than or equal to πsuperscript𝜋\pi^{\prime}italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in all objectives, i.e., ππ𝜋superscript𝜋\pi\geq\pi^{\prime}italic_π ≥ italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, when VπhVπh,hsubscriptsuperscript𝑉𝜋subscriptsuperscript𝑉superscript𝜋for-all{V}^{h}_{\pi}\geq{V}^{h}_{\pi^{\prime}},\quad\forall hitalic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ≥ italic_V start_POSTSUPERSCRIPT italic_h end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_π start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , ∀ italic_h. A policy π𝜋\piitalic_π is considered Pareto-optimal (or non-dominated) if it is not strictly dominated by any other policies.

Considering all returns from MOMDP, we have Pareto frontier set :={𝐫^𝐫^𝐫^}assignsuperscriptconditional-set^𝐫not-existssuperscript^𝐫^𝐫\mathcal{F}^{*}:=\left\{\mathbf{\hat{r}}\mid\nexists\mathbf{\hat{r}}^{\prime}% \geq\mathbf{\hat{r}}\right\}caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT := { over^ start_ARG bold_r end_ARG ∣ ∄ over^ start_ARG bold_r end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≥ over^ start_ARG bold_r end_ARG } [24], where 𝐫^=t=0γ𝐫(st,at)^𝐫superscriptsubscript𝑡0𝛾𝐫subscript𝑠𝑡subscript𝑎𝑡\mathbf{\hat{r}}=\sum_{t=0}^{\infty}\gamma\cdot\mathbf{r}(s_{t},a_{t})over^ start_ARG bold_r end_ARG = ∑ start_POSTSUBSCRIPT italic_t = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ∞ end_POSTSUPERSCRIPT italic_γ ⋅ bold_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). For all possible preferences in ΩΩ\Omegaroman_Ω, we define a convex coverage set (CCS) of the Pareto frontier which contains all returns that provide the maximum cumulative reward, i.e.,

CCS:={𝐫^𝝎Ω,𝐫^ s.t.𝝎T𝐫^𝝎T𝐫^}assignCCSconditional-set^𝐫superscriptformulae-sequence𝝎Ωfor-allsuperscript^𝐫superscript s.t.superscript𝝎𝑇^𝐫superscript𝝎𝑇superscript^𝐫\mathrm{CCS}:=\left\{\mathbf{\hat{r}}\in\mathcal{F}^{*}\mid\exists{\boldsymbol% {\omega}}\in\Omega,\forall\mathbf{\hat{r}}^{\prime}\in\mathcal{F}^{*}\text{ s.% t.}\>\boldsymbol{\omega}^{T}\mathbf{\hat{r}}\geq\boldsymbol{\omega}^{T}\mathbf% {\hat{r}}^{\prime}\right\}roman_CCS := { over^ start_ARG bold_r end_ARG ∈ caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT ∣ ∃ bold_italic_ω ∈ roman_Ω , ∀ over^ start_ARG bold_r end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_F start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT s.t. bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG bold_r end_ARG ≥ bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG bold_r end_ARG start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT } (19)

where ()Tsuperscript𝑇(\cdot)^{T}( ⋅ ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the transpose operator.

III-B MOMDP Formulation

In this section, we formulate the considered problem as a MOMDP and discuss the corresponding design of state, action, and reward. The state transitions and rewards are a function of the AV environment and actions taken by the AV.

III-B1 State Space

The state space consists of kinematics-related features, which is a M1×Fsubscript𝑀1𝐹M_{1}\times Fitalic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT × italic_F array that describes F{xj,yj,vj,ψj,nRj,nTj}𝐹subscript𝑥𝑗subscript𝑦𝑗subscript𝑣𝑗subscript𝜓𝑗superscriptsubscript𝑛𝑅𝑗superscriptsubscript𝑛𝑇𝑗F\to\{x_{j},y_{j},v_{j},\psi_{j},n_{R}^{j},n_{T}^{j}\}italic_F → { italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } specific features of AVs. We consider M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT target AVs and M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT surrounding AVs. Each target AV is characterized by its (1) coordinates (xj,yj)subscript𝑥𝑗subscript𝑦𝑗(x_{j},y_{j})( italic_x start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ), (2) forward velocity vjsubscript𝑣𝑗v_{j}italic_v start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, (3) heading ψjsubscript𝜓𝑗\psi_{j}italic_ψ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, and (4) nRjsuperscriptsubscript𝑛𝑅𝑗n_{R}^{j}italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and nTjsuperscriptsubscript𝑛𝑇𝑗n_{T}^{j}italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT which are the number of RBSs and TBSs that makes AV achieves the desired data rate in a predefined radius from its current position, respectively. Accordingly, the aggregated state space 𝒮𝒮\mathcal{S}caligraphic_S at any time step t𝑡titalic_t is given by:

𝒮=[x1y1v1ψ1nR1nT1xM1yM1vM1ψM1nRM1nTM1]𝒮matrixsubscript𝑥1subscript𝑦1subscript𝑣1subscript𝜓1superscriptsubscript𝑛𝑅1superscriptsubscript𝑛𝑇1subscript𝑥subscript𝑀1subscript𝑦subscript𝑀1subscript𝑣subscript𝑀1subscript𝜓subscript𝑀1superscriptsubscript𝑛𝑅subscript𝑀1superscriptsubscript𝑛𝑇subscript𝑀1\mathcal{S}=\begin{bmatrix}x_{1}&y_{1}&v_{1}&\psi_{1}&n_{R}^{1}&n_{T}^{1}\\ \vdots&\vdots&\vdots&\vdots&\vdots&\vdots\\ x_{M_{1}}&y_{M_{1}}&v_{M_{1}}&\psi_{M_{1}}&n_{R}^{M_{1}}&n_{T}^{M_{1}}\end{bmatrix}caligraphic_S = [ start_ARG start_ROW start_CELL italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_CELL start_CELL italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL start_CELL italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL italic_x start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_y start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_v start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_ψ start_POSTSUBSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT end_CELL start_CELL italic_n start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL start_CELL italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_CELL end_ROW end_ARG ]

III-B2 Two Dimensional Action Space

The action space consists of self-driving action space 𝒜transubscript𝒜tran\mathcal{A}_{\mathrm{tran}}caligraphic_A start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT and telecommunication action space 𝒜telesubscript𝒜tele\mathcal{A}_{\mathrm{tele}}caligraphic_A start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT, which include 5 and 3 discrete actions, respectively. For each time step t𝑡titalic_t, the AV must select driving-related action and telecommunication-related action from action space, as shown below:

𝒜=[{atele1,atran1}{atele1,atran2}{atele1,atran5}{atele3,atran1}{atele3,atran2}{atele3,atran5}]𝒜matrixsuperscriptsubscript𝑎tele1superscriptsubscript𝑎tran1superscriptsubscript𝑎tele1superscriptsubscript𝑎tran2superscriptsubscript𝑎tele1superscriptsubscript𝑎tran5superscriptsubscript𝑎tele3superscriptsubscript𝑎tran1superscriptsubscript𝑎tele3superscriptsubscript𝑎tran2superscriptsubscript𝑎tele3superscriptsubscript𝑎tran5\mathcal{A}=\begin{bmatrix}\{a_{\rm tele}^{1},a_{\rm tran}^{1}\}&\{a_{\rm tele% }^{1},a_{\rm tran}^{2}\}&\dots&\{a_{\rm tele}^{1},a_{\rm tran}^{5}\}\\ \vdots&\vdots&\vdots&\vdots\\ \{a_{\rm tele}^{3},a_{\rm tran}^{1}\}&\{a_{\rm tele}^{3},a_{\rm tran}^{2}\}&% \dots&\{a_{\rm tele}^{3},a_{\rm tran}^{5}\}\end{bmatrix}caligraphic_A = [ start_ARG start_ROW start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } end_CELL start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_CELL start_CELL … end_CELL start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT } end_CELL end_ROW start_ROW start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL start_CELL ⋮ end_CELL end_ROW start_ROW start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT } end_CELL start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT } end_CELL start_CELL … end_CELL start_CELL { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT } end_CELL end_ROW end_ARG ]

Note that 𝒜tran={atran1,,atran5}subscript𝒜transuperscriptsubscript𝑎tran1superscriptsubscript𝑎tran5\mathcal{A}_{\rm tran}=\{a_{\rm tran}^{1},\ldots,a_{\rm tran}^{5}\}caligraphic_A start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , … , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT }, where atran1,atran2superscriptsubscript𝑎tran1superscriptsubscript𝑎tran2a_{\rm tran}^{1},a_{\rm tran}^{2}italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT and atran3superscriptsubscript𝑎tran3a_{\rm tran}^{3}italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT indicate that AV changes its lane to the left, maintains the same lane, and changes its lane to the right, respectively. atran4superscriptsubscript𝑎tran4a_{\rm tran}^{4}italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 4 end_POSTSUPERSCRIPT and atran5superscriptsubscript𝑎tran5a_{\rm tran}^{5}italic_a start_POSTSUBSCRIPT roman_tran end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 5 end_POSTSUPERSCRIPT indicate the acceleration and deceleration of AV within the same lane. It is important to note that the acceleration and deceleration rates are dynamically determined by the model in Section II-B. With that being said, each AV selects the same actions does not imply that they will perform identical accelerations/deceleration. The communication action space is represented as 𝒜tele={atele1,atele2,atele3}subscript𝒜telesuperscriptsubscript𝑎tele1superscriptsubscript𝑎tele2superscriptsubscript𝑎tele3\mathcal{A}_{\rm tele}=\{a_{\rm tele}^{1},a_{\rm tele}^{2},a_{\rm tele}^{3}\}caligraphic_A start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT = { italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT }. atele1superscriptsubscript𝑎tele1a_{\rm tele}^{1}italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT indicates scenarios where AV selects a BS by maximizing weighted data rate metric (defined by equation (6) in Section II-A), which encourages traffic load balancing between BSs and discourages unnecessary HOs, especially for TBSs. In atele2,superscriptsubscript𝑎tele2a_{\rm tele}^{2},italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , the AV selects a BS with maximum WRijsubscriptWR𝑖𝑗\text{WR}_{{ij}}WR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT by substituting μ=0𝜇0\mu=0italic_μ = 0, if Qinisubscript𝑄𝑖subscript𝑛𝑖Q_{i}\geq n_{i}italic_Q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ≥ italic_n start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Otherwise, AV recursively selects the next vacant best-performing BS in terms of WRijsubscriptWR𝑖𝑗\text{WR}_{ij}WR start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT. In atele3superscriptsubscript𝑎tele3a_{\rm tele}^{3}italic_a start_POSTSUBSCRIPT roman_tele end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT, the AV chooses to connect to a BS with the maximum data rate Rijsubscript𝑅𝑖𝑗R_{ij}italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT.

III-B3 Rewards

The design of the associated reward is directly related to optimizing the driving policy and network selection, and is critical for accelerating the convergence of the model. Generally, the AV is given a positive reward when it receives a higher HO-aware data rate while guaranteeing safe driving. By taking any other actions that may lead to an increase in HOs, collisions, or traffic violations, the AV receives a penalty. We define the transportation reward as [16]:

rtj,tran=c1(vtjvminvmaxvmin)c2δ2+c3δ3+c4δ4,subscriptsuperscript𝑟jtran𝑡subscript𝑐1subscriptsuperscript𝑣𝑗𝑡subscript𝑣minsubscript𝑣maxsubscript𝑣minsubscript𝑐2subscript𝛿2subscript𝑐3subscript𝛿3subscript𝑐4subscript𝛿4r^{\mathrm{j,tran}}_{t}=c_{1}\left(\frac{v^{j}_{t}-v_{\mathrm{min}}}{v_{% \mathrm{max}}-v_{\mathrm{min}}}\right)-c_{2}\cdot\delta_{2}+c_{3}\cdot\delta_{% 3}+c_{4}\cdot\delta_{4},italic_r start_POSTSUPERSCRIPT roman_j , roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( divide start_ARG italic_v start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG start_ARG italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT - italic_v start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT end_ARG ) - italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT ⋅ italic_δ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , (20)

where vtj,vminsuperscriptsubscript𝑣𝑡𝑗subscript𝑣v_{t}^{j},v_{\min}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_v start_POSTSUBSCRIPT roman_min end_POSTSUBSCRIPT and vmaxsubscript𝑣v_{\max}italic_v start_POSTSUBSCRIPT roman_max end_POSTSUBSCRIPT are the current longitudinal velocity for AV j𝑗jitalic_j on time t𝑡titalic_t, the minimum and maximum speed limits, and δ2,δ3,δ4subscript𝛿2subscript𝛿3subscript𝛿4\delta_{2},\delta_{3},\delta_{4}italic_δ start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_δ start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is the collision indicator, AV right lane indicator, on road indicator, respectively. c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT are the weights that adjust the value of the AV transportation reward with its collision penalty. c1subscript𝑐1c_{1}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT indicates that the reward received when driving at full speed, linearly mapped to zero for lower speeds. c3subscript𝑐3c_{3}italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT shows that AV was rewarded for driving on the right-most lanes, and linearly mapped to zero for other lanes. c4subscript𝑐4c_{4}italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT is the on-road reward factor, which penalize the AV for driving off highway. It is important to note that negative rewards are not allowed since they might encourage the agent to prioritize ending an episode early by causing a collision instead of taking the risk of receiving a negative return if no satisfactory trajectory is available.

For the telecommunication side, we define the reward for AV j𝑗jitalic_j associated with BS isuperscript𝑖i^{*}italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT at time step t𝑡titalic_t as:

rtj,tele=c5WRi,j,t(1min(1,ξtj)),subscriptsuperscript𝑟jtele𝑡subscript𝑐5subscriptWRsuperscript𝑖𝑗𝑡1min1subscriptsuperscript𝜉𝑗𝑡r^{\mathrm{j,tele}}_{t}=c_{5}\text{WR}_{i^{*},j,t}\left(1-\text{min}(1,\xi^{j}% _{t})\right),italic_r start_POSTSUPERSCRIPT roman_j , roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT WR start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j , italic_t end_POSTSUBSCRIPT ( 1 - min ( 1 , italic_ξ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) , (21)

where WRi,j,tsubscriptWRsuperscript𝑖𝑗𝑡\text{WR}_{i^{*},j,t}WR start_POSTSUBSCRIPT italic_i start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT , italic_j , italic_t end_POSTSUBSCRIPT is the achievable data rate compute by (6) and ξtjsubscriptsuperscript𝜉𝑗𝑡\xi^{j}_{t}italic_ξ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the HO probability of AV j𝑗jitalic_j computed by dividing the number of HOs accounted until the current time t𝑡titalic_t by the time duration of previous time slots in the episode333Note that c1c5subscript𝑐1subscript𝑐5c_{1}\dots c_{5}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT are the weights to set the priority of each term. For instance, c2subscript𝑐2c_{2}italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT needs to be sufficiently large compared to other coefficients for collision avoidance. The highest penalty applies to vehicle collision..

Based on the instantaneous reward, we compute the accumulated rewards, which is the summation of discounted reward among all target AVs on the highway in each training episode. The expected return is defined as follows:

𝐐π(s,a,𝝎)=𝔼π[j=1M1rtj,tran,j=1M1rtj,tele]subscript𝐐𝜋𝑠𝑎𝝎subscript𝔼𝜋superscriptsubscript𝑗1subscript𝑀1subscriptsuperscript𝑟jtran𝑡superscriptsubscript𝑗1subscript𝑀1subscriptsuperscript𝑟jtele𝑡\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\sum_{j=1}^{M_% {1}}r^{\mathrm{j,tran}}_{t},\sum_{j=1}^{M_{1}}r^{\mathrm{j,tele}}_{t}\right]bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ] (22)

Our MOMDP optimal strategy for maximizing the expected reward involves the simultaneous maximization of both transportation and telecommunications objectives, i.e., maxπ𝐐π(s,a,𝝎)subscript𝜋subscript𝐐𝜋𝑠𝑎𝝎\max_{\pi}{\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})}roman_max start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ).

IV Proposed Single-Policy and Multi-Policy MORL Algorithms

In contrast to conventional DRL, MORL requires the agent to optimize multiple objectives simultaneously. These objectives might have predefined preferences, or the preferences could be unknown. In this section, we first investigate the single policy solutions to the MORL problem with predefined preferences, as discussed in Section IV-A. Subsequently, a multiple-policy envelope solution for MORL is proposed in Section IV for cases where the preferences are unknown.

IV-A Single-Policy MORL Algorithms

Given a set of preferences in MORL problems, single policy algorithms aim to scalarize the reward value to determine the best policy, considering the relative priorities assigned to competing objectives. We explore two DRL methods: DQN [1] and DDQN [25] for MORL, each of which employs a neural network parameterized by 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT (in each time step t𝑡titalic_t) to approximate the Q𝑄Qitalic_Q-value function for a state-action pair, i.e., Q(st,at;𝜽t)𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝜽𝑡Q(s_{t},a_{t};\boldsymbol{\theta}_{t})italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). After taking action atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and receiving instant reward rt+1subscript𝑟𝑡1r_{t+1}italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT, we can formulate a target Q𝑄Qitalic_Q function Q^(st,at)^𝑄subscript𝑠𝑡subscript𝑎𝑡{\hat{Q}}(s_{t},a_{t})over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as in Eq. 16, which is used to optimize the neural network 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT using gradient descent, as given in [25],

𝜽t+1=𝜽t+κ(Q^(st,at)Q(st,at;𝜽t))𝜽tQ(st,at;𝜽t),subscript𝜽𝑡1subscript𝜽𝑡𝜅^𝑄subscript𝑠𝑡subscript𝑎𝑡𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝜽𝑡subscriptsubscript𝜽𝑡𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝜽𝑡\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\kappa\left(\hat{Q}(s_{t},a_% {t})-Q(s_{t},a_{t};\boldsymbol{\theta}_{t})\right)\nabla_{\boldsymbol{\theta}_% {t}}Q(s_{t},a_{t};\boldsymbol{\theta}_{t}),bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_κ ( over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) , (23)

where κ𝜅\kappaitalic_κ is a positive scalar representing the learning rate. To learn a single policy for multiple tasks, we scalarize the reward vector by applying the predefined priority of each objective function [1], where a weighted reward function rt=j=1M1rtj,tran+j=1M1rtj,telesubscript𝑟𝑡superscriptsubscript𝑗1subscript𝑀1subscriptsuperscript𝑟jtran𝑡superscriptsubscript𝑗1subscript𝑀1subscriptsuperscript𝑟jtele𝑡r_{t}=\sum_{j=1}^{M_{1}}r^{\mathrm{j,tran}}_{t}+\sum_{j=1}^{M_{1}}r^{\mathrm{j% ,tele}}_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is defined to facilitate the conversion of multi-dimensional rewards into a scalar value.

IV-A1 Multi-Objective Deep Q-Network (MO-DQN)

The proposed MO-DQN method incorporates a target Q-network and experience replay to stabilize the learning process and ensure convergence, as discussed in the following:

  • Target Network: Another set of neural network 𝜽tsuperscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{-}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT is introduced to compute target Q𝑄Qitalic_Q value at each time step t𝑡titalic_t, which has the same architecture as 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT, but with frozen parameters. Specifically, 𝜽tsuperscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{-}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT only copies those parameters from 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT every Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT steps and remains fixed until the next scheduled update [26]. The target value for the MO-DQN is defined as:

    Q^(st,at)=rt+1+γargmaxaQ(st+1,a;𝜽t)^𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡1𝛾subscript𝑎𝑄subscript𝑠𝑡1𝑎superscriptsubscript𝜽𝑡{\hat{Q}}(s_{t},a_{t})=r_{t+1}+\gamma\arg\max_{a}Q(s_{t+1},a;\boldsymbol{% \theta}_{t}^{-})over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT + italic_γ roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) (24)
  • Experience Replay: To address issues related to correlations between sequential observations and to improve data efficiency, MO-DQN utilizes the experience replay mechanism, which stores past transition tuples (sz,az,sz+1,rz)subscript𝑠𝑧subscript𝑎𝑧subscript𝑠𝑧1subscript𝑟𝑧(s_{z},a_{z},s_{z+1},r_{z})( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) in a replay buffer 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT with size N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, i.e., z{1,,N𝒯}𝑧1subscript𝑁𝒯z\in\{1,\dots,N_{\mathcal{T}}\}italic_z ∈ { 1 , … , italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT }. During the training phase, mini-batches of these transitions are randomly sampled from the buffer. This method not only reduces the variance of each update but also allows the neural network to benefit from learning across a diverse range of past experiences, thus avoiding local optima and overfitting [27, 28].

IV-A2 Multi-Objective Double DQN (MO-DDQN)

Unlike MO-DQN, where the current weights 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are used both to select and evaluate actions, MO-DDQN utilizes a separate set of parameter 𝜽tsuperscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{\prime}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT to evaluate the value of the policy, ensuring a more reliable estimate by decoupling the selection and evaluation of actions. Given 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and 𝜽tsuperscriptsubscript𝜽𝑡\boldsymbol{\theta}_{t}^{\prime}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT corresponding to evaluation and target Q𝑄Qitalic_Q networks, respectively, the target value function in MO-DDQN [25] is updated as follows:

Q^(st,at)=rt+γQ(st+1,argmaxaQ(st+1,a;𝜽t);𝜽t)^𝑄subscript𝑠𝑡subscript𝑎𝑡subscript𝑟𝑡𝛾𝑄subscript𝑠𝑡1subscript𝑎𝑄subscript𝑠𝑡1𝑎subscript𝜽𝑡superscriptsubscript𝜽𝑡{\hat{Q}}(s_{t},a_{t})=r_{t}+\gamma Q\left(s_{t+1},\arg\max_{a}{Q}(s_{t+1},a;% \boldsymbol{\theta}_{t});\boldsymbol{\theta}_{t}^{\prime}\right)over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , roman_arg roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (25)

where the action selection is guided by the online weights 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

In MO-DQN and MO-DDQN, the neural network parameterized by 𝜽tsubscript𝜽𝑡\boldsymbol{\theta}_{t}bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT associated with evaluation function is updated by minimizing the mean square error loss (𝜽)𝜽\mathcal{L}(\boldsymbol{\theta})caligraphic_L ( bold_italic_θ ) between Q𝑄Qitalic_Q and Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG as follows [26]:

(𝜽t)=𝔼z𝒟𝒯(Q(sz,az;𝜽t)Q^(sz,az))2subscript𝜽𝑡subscript𝔼𝑧subscript𝒟𝒯superscript𝑄subscript𝑠𝑧subscript𝑎𝑧subscript𝜽𝑡^𝑄subscript𝑠𝑧subscript𝑎𝑧2\mathcal{L}(\boldsymbol{\theta}_{t})=\mathbb{E}_{z\in\mathcal{D}_{\mathcal{T}}% }\left(Q(s_{z},a_{z};{\boldsymbol{\theta}_{t}})-\hat{Q}(s_{z},a_{z})\right)^{2}caligraphic_L ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_z ∈ caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT end_POSTSUBSCRIPT ( italic_Q ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT (26)

The proposed MO-DQN and MO-DDQN are illustrated in Fig. 2. The training algorithm of the proposed MO-DQN and MO-DDQN are in Algorithm 1.

Although single-policy methods are adequate when we possess prior knowledge of task preferences, the acquired policy is constrained in its adaptability to situations with varying preferences. For instance, collision avoidance may not remain a high priority in highway environments with reduced vehicle density. Also, in traffic jams or parking lots where AVs are still, the preference for telecommunication rewards becomes higher. In the next section, we seek to design the multi-policy algorithm that handles unknown preferences in multi-objective RL scenarios.

Result: Learned action-value function Q𝜽subscript𝑄𝜽Q_{\boldsymbol{\theta}}italic_Q start_POSTSUBSCRIPT bold_italic_θ end_POSTSUBSCRIPT and Policy π𝜋\piitalic_π
Data: Evaluation Q𝑄Qitalic_Q-network Q𝑄Qitalic_Q with weights 𝜽𝜽{\boldsymbol{\theta}}bold_italic_θ, Target Q𝑄Qitalic_Q-network Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG with weights 𝜽superscript𝜽{\boldsymbol{\theta}^{\prime}}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT (for MO-DDQN only), Experience replay memory 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, Mini-batch size N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, Horizon limit of each episode Thlsubscript𝑇𝑙T_{hl}italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT.
Initialization:
Experience replay memory 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}\leftarrow\emptysetcaligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ← ∅,
Initialize Q𝑄Qitalic_Q-network weights 𝜽𝜽\boldsymbol{\theta}bold_italic_θ randomly,
For MO-DDQN: Initialize target network weights 𝜽𝜽superscript𝜽𝜽\boldsymbol{\theta}^{\prime}\leftarrow\boldsymbol{\theta}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_italic_θ,
Initialize Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) for all states s𝑠sitalic_s and actions a𝑎aitalic_a, including AVs, TBSs, and RBSs.
while episode <<< episode limit and runtime <<< time limit do
       Initialize t0𝑡0t\leftarrow 0italic_t ← 0 and state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on environment
       while tThl𝑡subscript𝑇𝑙t\leq T_{hl}italic_t ≤ italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT  do
            
            RL agent select atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from 𝒜𝒜\mathcal{A}caligraphic_A with probability ϵitalic-ϵ\epsilonitalic_ϵ or select atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from maxa𝒜Q(st,at;𝜽)subscript𝑎𝒜𝑄subscript𝑠𝑡subscript𝑎𝑡𝜽\max_{a\in\mathcal{A}}{Q(s_{t},a_{t};\boldsymbol{\theta})}roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT italic_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ ) with probability of 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ.
             Derive attransubscriptsuperscript𝑎tran𝑡a^{\mathrm{tran}}_{t}italic_a start_POSTSUPERSCRIPT roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and attelesubscriptsuperscript𝑎tele𝑡a^{\mathrm{tele}}_{t}italic_a start_POSTSUPERSCRIPT roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT
             Apply attransubscriptsuperscript𝑎tran𝑡a^{\text{tran}}_{t}italic_a start_POSTSUPERSCRIPT tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and attelesubscriptsuperscript𝑎tele𝑡a^{\text{tele}}_{t}italic_a start_POSTSUPERSCRIPT tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT,
             observe reward rtsubscript𝑟𝑡r_{t}italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and next state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT.
             Store transition (st,at,st+1,rt)subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1subscript𝑟𝑡(s_{t},a_{t},s_{t+1},r_{t})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT.
             Experience Replay: Sample a mini-batch of transitions (sz,az,rz,sz+1)subscript𝑠𝑧subscript𝑎𝑧subscript𝑟𝑧subscript𝑠𝑧1(s_{z},a_{z},r_{z},s_{z+1})( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT ) from 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT,
             where z{1,,N𝒯}𝑧1subscript𝑁𝒯z\in\{1,\ldots,N_{\mathcal{T}}\}italic_z ∈ { 1 , … , italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT }.
             Set target-Q𝑄Qitalic_Q for each sampled transition:
             for each transition z𝑧zitalic_z do
                   if episode ends at step z+1𝑧1z+1italic_z + 1 then
                         Q^(sz,az)=rz^𝑄subscript𝑠𝑧subscript𝑎𝑧subscript𝑟𝑧\hat{Q}(s_{z},a_{z})=r_{z}over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) = italic_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT
                   else
                         Use Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG to compute Q^(sz,az)^𝑄subscript𝑠𝑧subscript𝑎𝑧\hat{Q}(s_{z},a_{z})over^ start_ARG italic_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT ) according to MO-DQN or MO-DDQN update by (24), (25).
                   end if
                  
             end for
            Perform a gradient descent step on (26) with respect to network parameters 𝜽𝜽\boldsymbol{\theta}bold_italic_θ
             if MO-DDQN then
                   Update target Q^^𝑄\hat{Q}over^ start_ARG italic_Q end_ARG weights 𝜽𝜽superscript𝜽𝜽\boldsymbol{\theta}^{\prime}\leftarrow\boldsymbol{\theta}bold_italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← bold_italic_θ every Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT steps;
             end if
            tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
       end while
      Update policy π𝜋\piitalic_π based on learned Q𝑄Qitalic_Q.
end while
Algorithm 1 Multi-Objective Double Deep Q-Learning

IV-B Multi-Policy Envelope MORL Algorithm

Refer to caption
Figure 2: Comparison of MO-DQN, MO-DDQN, and the proposed MO-DDQN-envelope framework

In contrast to single-policy methods, multi-policy MORL methods optimize different objectives simultaneously by maximizing a vector of rewards associated with these objectives. Our proposed MORL framework reduces reliance on predefined preferences and scalar reward combinations, enabling dynamic adjustment to associated tasks featuring distinct preferences. This approach is effective in identifying Pareto-optimal policies when preferences are unknown.

Our proposed MO-DDQN-envelope algorithm is designed to learn a spectrum of Pareto optimal policies simultaneously (as defined in Section III-A2) in a preference space ΩΩ{\Omega}roman_Ω, as illustrated in Fig. 2. Different from the Envelope-MOQ model in [24], the proposed MO-DDQN envelope algorithm incorporates DDQN instead of using the original REINFORCE algorithm to improve convergence and sample training efficiency.

During each time step, observation information is captured in the RF-THz-Highway environment. From this observation, the tuple {st,st+1,𝝎t}subscript𝑠𝑡subscript𝑠𝑡1subscript𝝎𝑡\{s_{t},s_{t+1},\boldsymbol{\omega}_{t}\}{ italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } is computed. Following states information acquisition, the hindsight experience replay (HER) technique is employed to sample preference weights from the replay preference pool 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT. Then, homotopy optimization is applied to execute gradient descent, as indicated in (32). Subsequently, we perform Q𝑄Qitalic_Q network clone from evaluation network to target network periodically for every Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT steps. Notably, unlike prior single policy MORL approaches that scalarize rewards before the experience replay, MO-DDQN-envelope scalarizes rewards after gradient descent. We elaborate on the Bellman operator update phase, the HER phase and homotopy optimization phase in detail in what follows.

IV-B1 Bellman Operation with Optimal Filter

In the context described in Section III-A and referenced by [24], the expected discounted return under a policy π𝜋\piitalic_π is defined as 𝐐π(s,a,𝝎)=𝔼π[𝐫(st,at)+γ𝐐π(st1,at1,𝝎)]subscript𝐐𝜋𝑠𝑎𝝎subscript𝔼𝜋delimited-[]𝐫subscript𝑠𝑡subscript𝑎𝑡𝛾subscript𝐐𝜋subscript𝑠𝑡1subscript𝑎𝑡1𝝎\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})=\mathbb{E}_{\pi}\left[\mathbf{r}(s_{% t},a_{t})+\gamma\mathbf{Q}_{\pi}(s_{t-1},a_{t-1},\boldsymbol{\omega})\right]bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ) = blackboard_E start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT [ bold_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t - 1 end_POSTSUBSCRIPT , bold_italic_ω ) ]. Yang et al. in [24] further introduces the concept of an optimal filter \mathcal{H}caligraphic_H444The optimal filter \mathcal{H}caligraphic_H is instrumental in solving the convex envelope of PPF, which represents the current solution frontier. This process is key in optimizing the Q-function, 𝐐πsubscript𝐐𝜋\mathbf{Q}_{\pi}bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT for a given state s𝑠sitalic_s and preference weights 𝝎𝝎\boldsymbol{\omega}bold_italic_ω., which is applied to 𝐐π(s,a,𝝎)subscript𝐐𝜋𝑠𝑎𝝎\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ) to obtain (𝐐)π(s,a,𝝎)=argQsupa𝒜,𝝎Ω𝐐π(s,a,𝝎)subscript𝐐𝜋𝑠𝑎𝝎subscript𝑄subscriptsupremumformulae-sequence𝑎𝒜superscript𝝎Ωsubscript𝐐𝜋𝑠𝑎superscript𝝎(\mathcal{H}\mathbf{Q})_{\pi}(s,a,\boldsymbol{\omega})=\arg_{Q}\sup_{a\in% \mathcal{A},\boldsymbol{\omega}^{\prime}\in\Omega}\mathbf{Q}_{\pi}(s,a,% \boldsymbol{\omega}^{\prime})( caligraphic_H bold_Q ) start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ) = roman_arg start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT roman_sup start_POSTSUBSCRIPT italic_a ∈ caligraphic_A , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ roman_Ω end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ). Here, 𝝎𝝎\boldsymbol{\omega}bold_italic_ω is optimized through a process that balances preference between objectives, i.e., transportation and telecommunication. The argQsubscript𝑄\arg_{Q}roman_arg start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT represents a multi-objective supremum value, ensuring that (a,𝝎)𝑎superscript𝝎(a,\boldsymbol{\omega}^{\prime})( italic_a , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) achieves the maximum supremum across actions in space 𝒜𝒜\mathcal{A}caligraphic_A and states 𝝎superscript𝝎\boldsymbol{\omega}^{\prime}bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT within the space ΩΩ\Omegaroman_Ω. Consequently, we can streamline (17) to focus the optimization on actions solely dependent on \mathcal{H}caligraphic_H. The MO optimality operator can thus be defined as:

𝐐(s,a,𝝎)=𝔼st+1[𝐫(st,at)+γ(𝐐)(st+1,𝝎)]𝐐𝑠𝑎𝝎subscript𝔼subscript𝑠𝑡1delimited-[]𝐫subscript𝑠𝑡subscript𝑎𝑡𝛾𝐐subscript𝑠𝑡1𝝎\mathbf{Q}(s,a,\boldsymbol{\omega})=\mathbb{E}_{s_{t+1}}\left[\mathbf{r}(s_{t}% ,a_{t})+\mathcal{\gamma}(\mathcal{H}\mathbf{Q})(s_{t+1},\boldsymbol{\omega})\right]bold_Q ( italic_s , italic_a , bold_italic_ω ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_r ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_γ ( caligraphic_H bold_Q ) ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , bold_italic_ω ) ] (27)

Based on the specific Bellman operator and the optimal filter \mathcal{H}caligraphic_H, we maintain the convex envelope sup𝝎𝝎T𝐐π(s,a,𝝎)subscriptsupremumsuperscript𝝎superscript𝝎𝑇subscript𝐐𝜋𝑠𝑎superscript𝝎\sup_{\boldsymbol{\omega}^{\prime}}\boldsymbol{\omega}^{T}\mathbf{Q}_{\pi}(s,a% ,\boldsymbol{\omega}^{\prime})roman_sup start_POSTSUBSCRIPT bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ), corresponding to new preference weights for optimal MO rewards. These rewards may not be optimized by other past preference weights during training. Unlike benchmark approaches discussed in Section IV-A, single policy scalarized updates fail to optimize the scalar utility for acquiring the optimal solution with varying 𝝎𝝎\boldsymbol{\omega}bold_italic_ω, due to their inability to leverage information from maxa𝐐π(s,a,𝝎)subscript𝑎subscript𝐐𝜋𝑠𝑎𝝎\max_{a}\mathbf{Q}_{\pi}(s,a,\boldsymbol{\omega})roman_max start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT bold_Q start_POSTSUBSCRIPT italic_π end_POSTSUBSCRIPT ( italic_s , italic_a , bold_italic_ω ).

IV-B2 Hindsight Experience Replay (HER)

HER is a method to train a RL agent to achieve multiple preferences to serve multiple objectives [24, 29]. The RL agent follows a policy based on a randomly selected goal in each episode and uses the previous trajectory to update other goals simultaneously.

In our enhanced MO-DDQN-envelope network, leveraging HER, we employ the sampling process from two distinct replay pools 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT and 𝒟ωsubscript𝒟𝜔\mathcal{D}_{\omega}caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, targeting both transition mini-batches and preference vectors. We extract N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT mini-batch transitions,(sz,az,𝐫z,sz+1)subscript𝑠𝑧subscript𝑎𝑧subscript𝐫𝑧subscript𝑠𝑧1(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT ) to form replay buffer pool 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, such as (sz,az,𝐫z,sz+1)𝒟𝒯similar-tosubscript𝑠𝑧subscript𝑎𝑧subscript𝐫𝑧subscript𝑠𝑧1subscript𝒟𝒯(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})\sim\mathcal{D}_{\mathcal{T}}( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, where z[1,N𝒯]𝑧1subscript𝑁𝒯z\in[1,N_{\mathcal{T}}]italic_z ∈ [ 1 , italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ]. Concurrently, we sample preference vectors ωgsubscript𝜔𝑔\omega_{g}italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT in 𝒟ωsubscript𝒟𝜔\mathcal{D}_{\omega}caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT to form replay buffer 𝒲{ωg𝒟ω}𝒲similar-tosubscript𝜔𝑔subscript𝒟𝜔\mathcal{W}\equiv\{\omega_{g}\sim\mathcal{D}_{\omega}\}caligraphic_W ≡ { italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT }, with g[1,Nω]𝑔1subscript𝑁𝜔g\in[1,N_{\omega}]italic_g ∈ [ 1 , italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ], Nωsubscript𝑁𝜔N_{\omega}italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT indicates the count of preference weights in 𝒲𝒲\mathcal{W}caligraphic_W. Therefore, the agent AV can replay the trajectories with any preferences using ”hindsight” since preferences only impact agent AVs actions rather than highway environment dynamics[24].

IV-B3 Homotopy Optimization

Our goal is to generate a single model which adapts the entire pareto frontier space ΩΩ\Omegaroman_Ω. By sampling N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT transitions (sz,az,𝐫z,sz+1)subscript𝑠𝑧subscript𝑎𝑧subscript𝐫𝑧subscript𝑠𝑧1(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT ) and Nωsubscript𝑁𝜔N_{\mathcal{\omega}}italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT preference weights 𝒲={ωg𝒟ω}𝒲similar-tosubscript𝜔𝑔subscript𝒟𝜔\mathcal{W}=\{\omega_{g}\sim\mathcal{D}_{\omega}\}caligraphic_W = { italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT } in respective replay buffer 𝒟τsubscript𝒟𝜏\mathcal{D}_{\tau}caligraphic_D start_POSTSUBSCRIPT italic_τ end_POSTSUBSCRIPT and 𝒟ωsubscript𝒟𝜔\mathcal{D}_{\omega}caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, we define MO-DDQN-envelope element-wise target function [24] as follows:

𝐐^(sz,az,𝐫z,sz+1,ωg)=𝐫z+γmaxa𝒜,𝝎𝒲[ωg]T𝐐(sz+1,a,𝝎)\hat{\mathbf{Q}}(s_{z},a_{z},\mathbf{r}_{z},s_{z+1},\omega_{g})=\mathbf{r}_{z}% +\gamma\max_{a^{\prime}\in\mathcal{A},\boldsymbol{\omega}^{\prime}\in\mathcal{% W}}[{\omega}_{g}]^{T}\mathbf{Q}(s_{z+1},a^{\prime},\boldsymbol{\omega}^{\prime})over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) = bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT + italic_γ roman_max start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_A , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∈ caligraphic_W end_POSTSUBSCRIPT [ italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q ( italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT , bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) (28)

for z[1,N𝒯]for-all𝑧1subscript𝑁𝒯\forall z\in[1,N_{\mathcal{T}}]∀ italic_z ∈ [ 1 , italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ] and g[1,Nω]for-all𝑔1subscript𝑁𝜔\forall g\in[1,N_{\omega}]∀ italic_g ∈ [ 1 , italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ]. Finding the optimal preference weight 𝝎superscript𝝎\boldsymbol{\omega}^{\prime}bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in ΩΩ\Omegaroman_Ω can be an NP-hard problem due to the size and complexity of ΩΩ\Omegaroman_Ω. Instead, finding the optimal preference 𝝎superscript𝝎\boldsymbol{\omega}^{\prime}bold_italic_ω start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT in 𝒲𝒲\mathcal{W}caligraphic_W is feasible. By replay sampling transition (st,at,𝐫t,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝐫𝑡subscript𝑠𝑡1(s_{t},a_{t},\mathbf{r}_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) across N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT transitions, we acquire the empirical estimate target function over new state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT as:

𝐐^(st,at,𝝎t;𝜽t)=𝔼st+1[𝐫t+γargQmaxat,𝝎t𝝎T𝐐(st+1,at,𝝎t;𝜽t)]^𝐐subscript𝑠𝑡subscript𝑎𝑡subscript𝝎𝑡superscriptsubscript𝜽𝑡subscript𝔼subscript𝑠𝑡1delimited-[]subscript𝐫𝑡𝛾subscript𝑄subscriptsubscript𝑎𝑡superscriptsubscript𝝎𝑡superscript𝝎𝑇𝐐subscript𝑠𝑡1subscript𝑎𝑡superscriptsubscript𝝎𝑡superscriptsubscript𝜽𝑡\begin{multlined}\hat{\mathbf{Q}}(s_{t},a_{t},\boldsymbol{\omega}_{t};% \boldsymbol{\theta}_{t}^{\prime})=\mathbb{E}_{s_{t+1}}[\mathbf{r}_{t}+\gamma% \arg_{Q}\max_{a_{t},\boldsymbol{\omega}_{t}^{\prime}}\boldsymbol{\omega}^{T}% \mathbf{Q}(s_{t+1},a_{t},\boldsymbol{\omega}_{t}^{\prime};\boldsymbol{\theta}_% {t}^{\prime})]\end{multlined}\hat{\mathbf{Q}}(s_{t},a_{t},\boldsymbol{\omega}_% {t};\boldsymbol{\theta}_{t}^{\prime})=\mathbb{E}_{s_{t+1}}[\mathbf{r}_{t}+% \gamma\arg_{Q}\max_{a_{t},\boldsymbol{\omega}_{t}^{\prime}}\boldsymbol{\omega}% ^{T}\mathbf{Q}(s_{t+1},a_{t},\boldsymbol{\omega}_{t}^{\prime};\boldsymbol{% \theta}_{t}^{\prime})]start_ROW start_CELL over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + italic_γ roman_arg start_POSTSUBSCRIPT italic_Q end_POSTSUBSCRIPT roman_max start_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q ( italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) ] end_CELL end_ROW (29)

where 𝐐()𝐐\mathbf{Q}(\cdot)bold_Q ( ⋅ ) revisits (27). To ensure the correctness of the training for the target value 𝐐^^𝐐\hat{\mathbf{Q}}over^ start_ARG bold_Q end_ARG, which should be as close as possible to the actual value (𝐐𝐐\mathbf{Q}bold_Q). The loss function A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) in each time step t𝑡titalic_t is defined as:

A(𝜽t)=𝔼st,at,ωt[||𝐐^(st,at,ωt;𝜽t)𝐐(st,at,ωt;𝜽t)||22]\mathcal{L}^{A}(\boldsymbol{\theta}_{t})=\mathbb{E}_{s_{t},a_{t},\omega_{t}}% \Bigr{[}||\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}^{% \prime})-\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})||^{2}_{2}% \Bigr{]}caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | | over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | | start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] (30)

Since A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) contains many local maxima and minima, it is difficult to find the mean square error (MSE) and hard to optimize Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. To smooth the landscape of loss function A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ), we introduce the auxiliary loss function B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) as:

B(𝜽t)=𝔼st,at,ωt[|ωtT𝐐^(st,at,ωt;𝜽t)ωtT𝐐(st,at,ωt;𝜽t)|]\mathcal{L}^{B}(\boldsymbol{\theta}_{t})=\mathbb{E}_{s_{t},a_{t},\omega_{t}}% \Bigr{[}|\omega_{t}^{T}\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t}^{\prime})-\omega_{t}^{T}\mathbf{Q}(s_{t},a_{t},\omega_{t};% \boldsymbol{\theta}_{t})|\Bigr{]}caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ | italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ) - italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) | ] (31)

B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) contributes smooth policy adaptation for enhancing training efficiency with fewer spikes. B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) is advantageous for boosting agent training, but not as good for accurate approximation as A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) [24]. Both A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) are averaged over ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT which highlights the sampling preference feature in the proposed algorithm. However, specific weight ωtsubscript𝜔𝑡\omega_{t}italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT in the past training is not directly applied to the target state-action transitions. The proposed MO-DDQN-envelope can reevaluate past transitions in 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT with later new preferences to enhance learning efficiency and sample utilization.

Combining (30) and (31), we generate loss function

(𝜽t)=(1λt)A(𝜽t)+λtB(𝜽t)subscript𝜽𝑡1subscript𝜆𝑡superscript𝐴subscript𝜽𝑡subscript𝜆𝑡superscript𝐵subscript𝜽𝑡\mathcal{L}(\boldsymbol{\theta}_{t})=(1-\lambda_{t})\mathcal{L}^{A}(% \boldsymbol{\theta}_{t})+\lambda_{t}\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) = ( 1 - italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) + italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) (32)

In homotopy optimization, λtsubscript𝜆𝑡\lambda_{t}italic_λ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT gradually increases from 0 to 1 through balance weight path pλsubscript𝑝𝜆p_{\lambda}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT, which adjusts the equilibrium between A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ). Loss function (𝜽t)subscript𝜽𝑡\mathcal{L}(\boldsymbol{\theta}_{t})caligraphic_L ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) smoothly shift from A(𝜽t)superscript𝐴subscript𝜽𝑡\mathcal{L}^{A}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_A end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to B(𝜽t)superscript𝐵subscript𝜽𝑡\mathcal{L}^{B}(\boldsymbol{\theta}_{t})caligraphic_L start_POSTSUPERSCRIPT italic_B end_POSTSUPERSCRIPT ( bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) to ensure achieving first accuracy 𝐐𝐐\mathbf{Q}bold_Q optimization, then smoothing utility provided by auxiliary. We first trying to reduce the discrepancy between target and estimate 𝐐𝐐\mathbf{Q}bold_Q value as (𝐐^(st,at,ωt;𝜽t)𝐐(st,at,ωt;𝜽t)(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})-\mathbf{Q}(s% _{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t})( over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) and then taking gradient descent 𝜽tsubscriptsubscript𝜽𝑡\nabla_{\boldsymbol{\theta}_{t}}∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT for estimate 𝐐𝐐\mathbf{Q}bold_Q value to adjust direction to reduce MSE. Consequently, parameter for MO-DDQN-envelope will be updated as,

𝜽t+1=𝜽t+𝔼st,at,st+1[(𝐐^(st,at,ωt;𝜽t)𝐐(st,at,ωt;𝜽t))T𝜽t𝐐(st,at,ωt;𝜽t)]subscript𝜽𝑡1subscript𝜽𝑡subscript𝔼subscript𝑠𝑡subscript𝑎𝑡subscript𝑠𝑡1delimited-[]superscript^𝐐subscript𝑠𝑡subscript𝑎𝑡subscript𝜔𝑡subscript𝜽𝑡𝐐subscript𝑠𝑡subscript𝑎𝑡subscript𝜔𝑡subscript𝜽𝑡𝑇subscriptsubscript𝜽𝑡𝐐subscript𝑠𝑡subscript𝑎𝑡subscript𝜔𝑡subscript𝜽𝑡\begin{multlined}\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\mathbb{E}_% {s_{t},a_{t},s_{t+1}}[(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t})\\ -\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}))^{T}\nabla_{% \boldsymbol{\theta}_{t}}\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_% {t})]\end{multlined}\boldsymbol{\theta}_{t+1}=\boldsymbol{\theta}_{t}+\mathbb{% E}_{s_{t},a_{t},s_{t+1}}[(\hat{\mathbf{Q}}(s_{t},a_{t},\omega_{t};\boldsymbol{% \theta}_{t})\\ -\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_{t}))^{T}\nabla_{% \boldsymbol{\theta}_{t}}\mathbf{Q}(s_{t},a_{t},\omega_{t};\boldsymbol{\theta}_% {t})]start_ROW start_CELL bold_italic_θ start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT = bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT + blackboard_E start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ( over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) end_CELL end_ROW start_ROW start_CELL - bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ) start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ∇ start_POSTSUBSCRIPT bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_POSTSUBSCRIPT bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) ] end_CELL end_ROW (33)

We reset our target Q𝑄Qitalic_Q-network with evaluation Q𝑄Qitalic_Q-network every Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT steps, i.e., θθ𝜃superscript𝜃\theta\leftarrow\theta^{\prime}italic_θ ← italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT. The training algorithm of the proposed MO-DDQN-envelope is shown in Algorithm 2.

Result: Learned action-value function 𝐐θsubscript𝐐𝜃\mathbf{Q}_{\theta}bold_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and Policy π𝜋\piitalic_π
Data: Evaluation Q𝑄Qitalic_Q-network 𝐐θsubscript𝐐𝜃\mathbf{Q}_{\theta}bold_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT, Target Q𝑄Qitalic_Q-network 𝐐θsubscript𝐐superscript𝜃\mathbf{Q}_{\theta^{\prime}}bold_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT, Preference sampling pool 𝒟ωsubscript𝒟𝜔\mathcal{D}_{\omega}caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT, HER transition sampling pool 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT, Balance weight path pλsubscript𝑝𝜆p_{\lambda}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT
Initialization:
HER replay buffer 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}\leftarrow\emptysetcaligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ← ∅,
Initialize Q-network weights θ𝜃\thetaitalic_θ randomly,
Initialize target Q-network weights θθsuperscript𝜃𝜃\theta^{\prime}\leftarrow\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_θ,
Initialize Q(s,a)𝑄𝑠𝑎Q(s,a)italic_Q ( italic_s , italic_a ) for all states s𝑠sitalic_s and actions a𝑎aitalic_a, including AVs, TBSs, RBSs.
while episode <<< episode limit and runtime <<< time limit do
       Initialize t0𝑡0t\leftarrow 0italic_t ← 0 and state stsubscript𝑠𝑡s_{t}italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT based on environment
       while tThl𝑡subscript𝑇𝑙t\leq T_{hl}italic_t ≤ italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT  do
             for  Target AV j𝑗jitalic_j from 1 to M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT do
                   atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT select action from 𝒜𝒜\mathcal{A}caligraphic_A with probability of ϵitalic-ϵ\epsilonitalic_ϵ or Select atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from at=argmaxa𝒜𝝎T𝐐(st,a,𝝎;𝜽t)subscript𝑎𝑡subscript𝑎𝒜superscript𝝎𝑇𝐐subscript𝑠𝑡𝑎𝝎subscript𝜽𝑡a_{t}=\arg\max_{a\in\mathcal{A}}{\boldsymbol{\omega}^{T}}\mathbf{Q}(s_{t},a,% \boldsymbol{\omega};\boldsymbol{\theta}_{t})italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT = roman_arg roman_max start_POSTSUBSCRIPT italic_a ∈ caligraphic_A end_POSTSUBSCRIPT bold_italic_ω start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT bold_Q ( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a , bold_italic_ω ; bold_italic_θ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) with probability of 1ϵ1italic-ϵ1-\epsilon1 - italic_ϵ.
                   Derive attransubscriptsuperscript𝑎tran𝑡a^{\mathrm{tran}}_{t}italic_a start_POSTSUPERSCRIPT roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and attelesubscriptsuperscript𝑎tele𝑡a^{\mathrm{tele}}_{t}italic_a start_POSTSUPERSCRIPT roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT from atsubscript𝑎𝑡a_{t}italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT;
                   Apply attransubscriptsuperscript𝑎tran𝑡a^{\mathrm{tran}}_{t}italic_a start_POSTSUPERSCRIPT roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and attelesubscriptsuperscript𝑎tele𝑡a^{\mathrm{tele}}_{t}italic_a start_POSTSUPERSCRIPT roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT to target AV j𝑗jitalic_j;
                   Observe vector reward 𝐫tsubscript𝐫𝑡\mathbf{r}_{t}bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and next state st+1subscript𝑠𝑡1s_{t+1}italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT;
             end for
            if update neural network then
                   Store (st,at,𝐫t,st+1)subscript𝑠𝑡subscript𝑎𝑡subscript𝐫𝑡subscript𝑠𝑡1(s_{t},a_{t},\mathbf{r}_{t},s_{t+1})( italic_s start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_t + 1 end_POSTSUBSCRIPT ) in 𝒟𝒯subscript𝒟𝒯\mathcal{D}_{\mathcal{T}}caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT;
                   Hindsight Experience Replay (HER):
                   {(sz,az,𝐫z,sz+1)𝒟𝒯}similar-tosubscript𝑠𝑧subscript𝑎𝑧subscript𝐫𝑧subscript𝑠𝑧1subscript𝒟𝒯\{(s_{z},a_{z},\mathbf{r}_{z},s_{z+1})\sim\mathcal{D}_{\mathcal{T}}\}{ ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT };
                   Sample Nωsubscript𝑁𝜔N_{\omega}italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT preferences 𝒲={ωg𝒟ω}𝒲similar-tosubscript𝜔𝑔subscript𝒟𝜔\mathcal{W}=\{\omega_{g}\sim\mathcal{D}_{\omega}\}caligraphic_W = { italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ∼ caligraphic_D start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT };
                  Bellman Update:
                   Compute 𝐐^(sz,az,𝐫z,sz+1,ωg)^𝐐subscript𝑠𝑧subscript𝑎𝑧subscript𝐫𝑧subscript𝑠𝑧1subscript𝜔𝑔\hat{\mathbf{Q}}(s_{z},a_{z},\mathbf{r}_{z},s_{z+1},\omega_{g})over^ start_ARG bold_Q end_ARG ( italic_s start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT ) for each sampled transition and preference:
{𝐫z,if sz+1 is terminal(28),otherwisecasessubscript𝐫𝑧if subscript𝑠𝑧1 is terminal28otherwise\begin{cases}\mathbf{r}_{z},&\text{if }s_{z+1}\text{ is terminal}\\ (\ref{eq:mo_envelope_ddqn_target}),&\text{otherwise}\end{cases}{ start_ROW start_CELL bold_r start_POSTSUBSCRIPT italic_z end_POSTSUBSCRIPT , end_CELL start_CELL if italic_s start_POSTSUBSCRIPT italic_z + 1 end_POSTSUBSCRIPT is terminal end_CELL end_ROW start_ROW start_CELL ( ) , end_CELL start_CELL otherwise end_CELL end_ROW
z[1,N𝒯]for-all𝑧1subscript𝑁𝒯\forall z\in[1,N_{\mathcal{T}}]∀ italic_z ∈ [ 1 , italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ] and g[1,Nω]for-all𝑔1subscript𝑁𝜔\forall g\in[1,N_{\omega}]∀ italic_g ∈ [ 1 , italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ] Homotopy Optimization:
                   Update 𝐐θsubscript𝐐𝜃\mathbf{Q}_{\theta}bold_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT by minimizing the loss with
                   gradient descent by (32);
                   Gradually increase λ𝜆\lambdaitalic_λ following the path pλsubscript𝑝𝜆p_{\lambda}italic_p start_POSTSUBSCRIPT italic_λ end_POSTSUBSCRIPT;
             end if
            Update target 𝐐θsubscript𝐐superscript𝜃\mathbf{Q}_{\theta^{\prime}}bold_Q start_POSTSUBSCRIPT italic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT weights θθsuperscript𝜃𝜃\theta^{\prime}\leftarrow\thetaitalic_θ start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ← italic_θ by (33) every Nsuperscript𝑁N^{-}italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT steps;
       end while
      tt+1𝑡𝑡1t\leftarrow t+1italic_t ← italic_t + 1
       Compute policy π𝜋\piitalic_π based on learned 𝐐θsubscript𝐐𝜃\mathbf{Q}_{\theta}bold_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT;
end while
Algorithm 2 Multi-Objective Envelope DDQN

IV-C Complexity Analysis

We first analyze the main loop of Algorithm 2. The key components contributing to the time complexity include:

  • Action Selection: We consider the horizon limit for each episode, denoted as Thlsubscript𝑇hlT_{\text{hl}}italic_T start_POSTSUBSCRIPT hl end_POSTSUBSCRIPT. The process of selecting an action at each timestep incurs a time complexity of 𝒪(|𝒜|)𝒪𝒜\mathcal{O}(|\mathcal{A}|)caligraphic_O ( | caligraphic_A | ) for each target AV, where |𝒜|𝒜|\mathcal{A}|| caligraphic_A | represents the size of the action space. Thus, the time complexity is 𝒪(M1Thl|𝒜|)𝒪subscript𝑀1subscript𝑇hl𝒜\mathcal{O}(M_{1}\cdot T_{\text{hl}}\cdot|\mathcal{A}|)caligraphic_O ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT hl end_POSTSUBSCRIPT ⋅ | caligraphic_A | ) .

  • Hindsight Experience Replay (HER): Sampling N𝒯subscript𝑁𝒯N_{\mathcal{T}}italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT transitions from the HER replay buffer and Nωsubscript𝑁𝜔N_{\omega}italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT preferences results in a time complexity of 𝒪(N𝒯Nω)𝒪subscript𝑁𝒯subscript𝑁𝜔\mathcal{O}(N_{\mathcal{T}}\cdot N_{\omega})caligraphic_O ( italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ).

  • Homotopy Optimization: This phase involves a Fully Connected Neural Network (FCNN) consisting of an input layer, an output layer, and E𝐸Eitalic_E fully connected hidden layers. Let Nesubscript𝑁𝑒N_{e}italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT denote the number of neurons in the e𝑒eitalic_e-th fully connected layer. The time complexity for this phase is 𝒪(e=1E+1(Ne1Ne))𝒪superscriptsubscript𝑒1𝐸1subscript𝑁𝑒1subscript𝑁𝑒\mathcal{O}\left(\sum_{e=1}^{E+1}(N_{e-1}\cdot N_{e})\right)caligraphic_O ( ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E + 1 end_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_e - 1 end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) ), accounting for the computational cost for each layer’s connections.

  • Policy Adaptation: Utilizing a DDQN, the input layer’s neurons correspond to the dimensionality of the state space 𝒮𝒮\mathcal{S}caligraphic_S, and the output layer’s neurons correspond to the action space 𝒜𝒜\mathcal{A}caligraphic_A. Thus, the numbers of neurons in the input and output layers are N0=|𝒮|subscript𝑁0𝒮N_{0}=|\mathcal{S}|italic_N start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT = | caligraphic_S | and NE+1=|𝒜|subscript𝑁𝐸1𝒜N_{E+1}=|\mathcal{A}|italic_N start_POSTSUBSCRIPT italic_E + 1 end_POSTSUBSCRIPT = | caligraphic_A |, respectively. The time complexity for this phase is 𝒪(|𝒮||𝒜|)𝒪𝒮𝒜\mathcal{O}(|\mathcal{S}|\cdot|\mathcal{A}|)caligraphic_O ( | caligraphic_S | ⋅ | caligraphic_A | ), which can be considered constant upon convergence.

Therefore, the overall time complexity for MO-DDQN-envelope can be expressed as 𝒪(M1Thl|𝒜|+N𝒯Nω+e=1E+1(Ne1Ne)+|𝒮||𝒜|)𝒪subscript𝑀1subscript𝑇hl𝒜subscript𝑁𝒯subscript𝑁𝜔superscriptsubscript𝑒1𝐸1subscript𝑁𝑒1subscript𝑁𝑒𝒮𝒜\mathcal{O}(M_{1}\cdot T_{\text{hl}}\cdot|\mathcal{A}|+N_{\mathcal{T}}\cdot N_% {\omega}+\sum_{e=1}^{E+1}(N_{e-1}\cdot N_{e})+|\mathcal{S}|\cdot|\mathcal{A}|)caligraphic_O ( italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ italic_T start_POSTSUBSCRIPT hl end_POSTSUBSCRIPT ⋅ | caligraphic_A | + italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_e = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_E + 1 end_POSTSUPERSCRIPT ( italic_N start_POSTSUBSCRIPT italic_e - 1 end_POSTSUBSCRIPT ⋅ italic_N start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT ) + | caligraphic_S | ⋅ | caligraphic_A | ).

V Simulation and Performance Evaluation

In this section, we demonstrate the performance of the proposed algorithms and highlight the complex dynamics between wireless connectivity, traffic flow, and AV’s speed. All Experiments are executed on a PC equipped with Windows 11, Intel i7-8770 CPU 3.2 GHz and 16 GB DDR5, AMD RX580 8GB GDDR5. Additionally, Google Colab, is the cloud platform employed for reproduction and verification.

V-A Simulation Environment

Our proposed simulation framework is composed of three main components:

  • Telecommunication and Transportation Environment: We introduce the MO-RF-THz-Highway-env framework555https://github.com/sunnyyzj/highway-env-1.7, an enhanced version of highway-env [16], designed to support both autonomous driving policy and 5G/6G network selection for multiple AVs.

  • Extended MO-Gymnasium: We extend MO-Gymnasium666https://github.com/sunnyyzj/MO-Gymnasium [30] to provide an application programming interface (API) to communicate between DRL algorithms and MO-RF-THz-Highway-env.

  • MORL Algorithms Simulation: For single policy MORL, simulation utilizes modified rl-agents777https://github.com/sunnyyzj/rl-agents from [31]. For multi-policy MORL, the proposed MO-DDQN-envelope888https://github.com/sunnyyzj/morl-baselines and the other multi-policy algorithms simulation are extended from MORL-baselines in [32].

The MO-RF-THz-Highway-env features five one-way lanes, each with a length of 1500m and a width of 4m, as depicted in Figure 1. In our experimental setup, a default configuration consists of 5 target AVs and 20 surrounding AVs, each having a length of 5m. The longitudinal velocity of each AV ranges from 10 m/s to 35 m/s. At the beginning of each episode, these AVs are randomly placed across the five lanes. Along both sides of the highway, 5 RBSs and 10 to 50 TBSs are also randomly positioned to ensure a non-uniform distribution at the beginning of each episode. This random placement strategy aims to facilitate the examination of MORL training effectiveness across various VNets and traffic scenarios, as opposed to a singular, common scenario. The maximum duration for each episode is set to 30 time steps. An episode is considered collision-free if it meets two criteria: absence of collisions among all target AVs, and maintenance of a high weighted data rate regarding (6) during travel through the episode. For single policy MORL, we set rewards coefficients in (20) (21) from Section III-B3, c1,c2,c3,c4,c5subscript𝑐1subscript𝑐2subscript𝑐3subscript𝑐4subscript𝑐5c_{1},c_{2},c_{3},c_{4},c_{5}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 4 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 5 end_POSTSUBSCRIPT set to 0.4,1,0.1,0.2,4.5×1070.410.10.24.5E-7$0.4$,1,0.1,0.2,$4.5\text{\times}{10}^{-7}$0.4 , 1 , 0.1 , 0.2 , start_ARG 4.5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 7 end_ARG end_ARG, respectively.

We utilize a Multi-Layer Perceptron (MLP) to construct the Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT neural network in the training and evaluation phases. Our architecture comprises of three-layer FNNs, with each layer hosting 128 neurons. Following these 3 layers of FNNs, we apply the ReLU activation function. The output layer of the target policy network Qθsuperscriptsubscript𝑄𝜃Q_{\theta}^{\prime}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT employs a sigmoid function to constrain the range of actions. Apart from the input and output layers, the MO evaluation network Qθsubscript𝑄𝜃Q_{\theta}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT and the target policy network Qθsuperscriptsubscript𝑄𝜃Q_{\theta}^{\prime}italic_Q start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT share identical architectures and utilize the same activation functions. Simulation parameters are detailed in Table I unless otherwise specified.

Table I: Values of system parameters in experiments
Parameter Value
Value used in system model
RBSs frequency (fR)subscript𝑓𝑅(f_{R})( italic_f start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) 3.5 GHz
TBSs frequency (fT)subscript𝑓𝑇(f_{T})( italic_f start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) 1 THz
Maximum number of affordable AVs quota for a single RBS, TBS (QR),(QT)subscript𝑄𝑅subscript𝑄𝑇(Q_{R}),(Q_{T})( italic_Q start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) , ( italic_Q start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) 5, 10
Antenna gain for TBSs and RBSs (GTtx)superscriptsubscript𝐺𝑇tx(G_{T}^{\mathrm{tx}})( italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT ), (GTrx)superscriptsubscript𝐺𝑇rx(G_{T}^{\mathrm{rx}})( italic_G start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_rx end_POSTSUPERSCRIPT ) 316.2
RBSs channel bandwidth (WR)subscript𝑊𝑅(W_{R})( italic_W start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT ) 4×1074E74\text{\times}{10}^{7}start_ARG 4 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 7 end_ARG end_ARG
TBSs channel bandwidth (WT)subscript𝑊𝑇(W_{T})( italic_W start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ) 5×1085E85\text{\times}{10}^{8}start_ARG 5 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 8 end_ARG end_ARG
Transmit powers of RBSs and TBSs (PRtx)superscriptsubscript𝑃𝑅tx(P_{R}^{\mathrm{tx}})( italic_P start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT ), (PTtx)superscriptsubscript𝑃𝑇tx(P_{T}^{\mathrm{tx}})( italic_P start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT roman_tx end_POSTSUPERSCRIPT ) 1 W
Molecular absorption coefficient (Ka)subscript𝐾𝑎(K_{a})( italic_K start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) 0.05 m-1
Path loss exponent (α)𝛼(\alpha)( italic_α ) 4
Length for each AV (lj)subscript𝑙𝑗(l_{j})( italic_l start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) 5 m
Target AV heading and lateral control gain (Kjψ)superscriptsubscript𝐾𝑗𝜓(K_{j}^{\psi})( italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_ψ end_POSTSUPERSCRIPT ),(Kjy)superscriptsubscript𝐾𝑗𝑦(K_{j}^{y})( italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_y end_POSTSUPERSCRIPT ) 5, 5353\frac{5}{3}divide start_ARG 5 end_ARG start_ARG 3 end_ARG
Maximum AV steering angle for AV j𝑗jitalic_j (maxβj)subscript𝛽𝑗(\max{\beta_{j}})( roman_max italic_β start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) π3𝜋3\frac{\pi}{3}divide start_ARG italic_π end_ARG start_ARG 3 end_ARG
Surrounded AV j𝑗jitalic_j desired maximum acceleration and deceleration (aj)subscript𝑎𝑗(a_{j})( italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) 3 m/s , -5 m/s
Acceleration reduction factor (δa)subscript𝛿𝑎(\delta_{a})( italic_δ start_POSTSUBSCRIPT italic_a end_POSTSUBSCRIPT ) 4
Number of lanes (NL)subscript𝑁𝐿(N_{L})( italic_N start_POSTSUBSCRIPT italic_L end_POSTSUBSCRIPT ) 5
Value used in MORL
Learning rate (αl)subscript𝛼𝑙(\alpha_{l})( italic_α start_POSTSUBSCRIPT italic_l end_POSTSUBSCRIPT ) 3×1043E-43\text{\times}{10}^{-4}start_ARG 3 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG - 4 end_ARG end_ARG
Discount factor (γ)𝛾(\gamma)( italic_γ ) 0.995
Size of the hidden layers of the value NN [256,256,256,256]256256256256[256,256,256,256][ 256 , 256 , 256 , 256 ]
Epsilon decay parameter (ϵ)italic-ϵ(\epsilon)( italic_ϵ ) 0.1
MO-DDQN-envelope HER transition pool size (N𝒯)subscript𝑁𝒯(N_{\mathcal{T}})( italic_N start_POSTSUBSCRIPT caligraphic_T end_POSTSUBSCRIPT ) 2×1062E62\text{\times}{10}^{6}start_ARG 2 end_ARG start_ARG times end_ARG start_ARG power start_ARG 10 end_ARG start_ARG 6 end_ARG end_ARG
The number of weight vectors to sample for the envelope target (Nω)subscript𝑁𝜔(N_{\omega})( italic_N start_POSTSUBSCRIPT italic_ω end_POSTSUBSCRIPT ) 4
Frequency for cloning evaluation to target network (N)superscript𝑁(N^{-})( italic_N start_POSTSUPERSCRIPT - end_POSTSUPERSCRIPT ) 200
Episode horizon limit Thlsubscript𝑇𝑙T_{hl}italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT 30

V-B Baselines and Evaluation Metrics

To comprehensively evaluate the performance of MO-DDQN-envelope, we implement four baseline algorithms for comparison: MO-DQN, MO-DDQN (as discussed in Section IV-A), and two additional MORL algorithms, namely MO-dueling-DDQN and MO-PPO. MO-dueling-DDQN and MO-PPO are variations of MO-DDQN, employing single-policy multi-objective dueling DDQN and Proximal Policy Optimization (PPO) algorithms instead of DDQN, respectively. These algorithms are developed specifically for performance evaluation. It is noteworthy that dueling-DDQN and PPO are extensions of single-policy single-objective RL algorithms proposed in [33] and [34], respectively.

To evaluate the performance, we consider the following metrics. Assume episode e𝑒eitalic_e ends on time step Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT. we define (1) total transportation reward: Retran=𝔼jM1[t=1Tertj,tran]subscriptsuperscript𝑅𝑡𝑟𝑎𝑛𝑒subscript𝔼𝑗subscript𝑀1delimited-[]superscriptsubscript𝑡1subscript𝑇𝑒subscriptsuperscript𝑟jtran𝑡R^{tran}_{e}=\mathbb{E}_{j\in M_{1}}[\sum_{t=1}^{T_{e}}r^{\mathrm{j,tran}}_{t}]italic_R start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_j ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. (2) total communication reward: Retele=𝔼jM1[t=1Tertj,tele]subscriptsuperscript𝑅𝑡𝑒𝑙𝑒𝑒subscript𝔼𝑗subscript𝑀1delimited-[]superscriptsubscript𝑡1subscript𝑇𝑒subscriptsuperscript𝑟jtele𝑡R^{tele}_{e}=\mathbb{E}_{j\in M_{1}}[\sum_{t=1}^{T_{e}}r^{\mathrm{j,tele}}_{t}]italic_R start_POSTSUPERSCRIPT italic_t italic_e italic_l italic_e end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_j ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_r start_POSTSUPERSCRIPT roman_j , roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ]. (3) collision rate: δe=1TeThlsubscript𝛿𝑒1subscript𝑇𝑒subscript𝑇𝑙\delta_{e}=1-\frac{T_{e}}{T_{hl}}italic_δ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = 1 - divide start_ARG italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_ARG start_ARG italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT end_ARG (4) HOs Probability: ξe=𝔼jM1[ξTej]subscript𝜉𝑒subscript𝔼𝑗subscript𝑀1delimited-[]subscriptsuperscript𝜉𝑗subscript𝑇𝑒\xi_{e}=\mathbb{E}_{j\in M_{1}}[\xi^{j}_{T_{e}}]italic_ξ start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT = blackboard_E start_POSTSUBSCRIPT italic_j ∈ italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT [ italic_ξ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT ]. where rtj,transubscriptsuperscript𝑟jtran𝑡r^{\mathrm{j,tran}}_{t}italic_r start_POSTSUPERSCRIPT roman_j , roman_tran end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT and rtj,telesubscriptsuperscript𝑟jtele𝑡r^{\mathrm{j,tele}}_{t}italic_r start_POSTSUPERSCRIPT roman_j , roman_tele end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT are instantaneous transportation and telecommunication rewards specified in equations (20) and (21), respectively. Thlsubscript𝑇𝑙T_{hl}italic_T start_POSTSUBSCRIPT italic_h italic_l end_POSTSUBSCRIPT represents the horizon limit of each episode. ξTejsubscriptsuperscript𝜉𝑗subscript𝑇𝑒\xi^{j}_{T_{e}}italic_ξ start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT end_POSTSUBSCRIPT denotes HOs probability by the end of each episode (Tesubscript𝑇𝑒T_{e}italic_T start_POSTSUBSCRIPT italic_e end_POSTSUBSCRIPT’s step) defined in equation (21).

Refer to caption
Figure 3: Training performance on (a) total transportation rewards, (b) total telecommunication rewards, (c) collision rate, and (d) HOs probability. nT=20subscript𝑛𝑇20n_{T}=20italic_n start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = 20, the desired minimum and maximum longitudinal velocities are vmin=20subscript𝑣min20v_{\text{min}}=20italic_v start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = 20m/s and vmax=30subscript𝑣max30v_{\text{max}}=30italic_v start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 30m/s, respectively, for the top row and vmin=30subscript𝑣min30v_{\text{min}}=30italic_v start_POSTSUBSCRIPT min end_POSTSUBSCRIPT = 30m/s and vmax=40subscript𝑣max40v_{\text{max}}=40italic_v start_POSTSUBSCRIPT max end_POSTSUBSCRIPT = 40m/s, respectively, for the bottom row, and number of AVs are 20 and 50 in the top and bottom row, respectively
Refer to caption
Figure 4: Evaluation performance on (1) total transportation rewards, (2) total telecommunication rewards, (3) collision rate, and (4) HOs probability, as a function of (a) variation in TBSs with counts ranging from 10 to 50 while maintaining an AV velocity of 25 m/s for 20 AVs, and (b) variation in vehicle numbers adjusting from 10 to 50 with a fixed AV velocity of 25 m/s alongside 20 TBSs. (c) variation in desired velocity from 5 to 30 m/s with fixed 20 AVs and 20 TBSs.

V-C Results and Discussions

V-C1 Training Performance

We examine the training performance of the proposed MO-DDQN-envelope algorithm and compare it with other baselines of MORL approaches. Fig. 3 depicts total transportation rewards, total telecommunication reward, collision Rate, and HOs probability as a function of desired velocity of AV. MO-DDQN-envelope performs better than benchmark algorithms (i.e. MO-DQN and MO-DDQN) when evaluating performance over every 100 episodes. Shown by learning curves, we note that MO-DDQN-envelope has slower convergence to higher total cumulative training rewards no matter transportation and telecommunication sides. However, MO-DDQN-envelope algorithm balances both transportation and telecommunication objectives. Collision rate and HOs probability reduce better than the other benchmarks. To better understand this improvement, recall MO-DDQN-envelope samples experience from the replay buffer which contains recent past preferences and rarely new exploration preferences. Past preferences are based on the weight vector ω𝜔\vec{\omega}over→ start_ARG italic_ω end_ARG in terms of transportation and telecommunication objectives ω=[ω𝐭𝐫𝐚𝐧,ω𝐭𝐞𝐥𝐞]T𝜔superscriptsubscript𝜔𝐭𝐫𝐚𝐧subscript𝜔𝐭𝐞𝐥𝐞𝑇\vec{\omega}=[\omega_{\mathbf{tran}},\omega_{\mathbf{tele}}]^{T}over→ start_ARG italic_ω end_ARG = [ italic_ω start_POSTSUBSCRIPT bold_tran end_POSTSUBSCRIPT , italic_ω start_POSTSUBSCRIPT bold_tele end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT which marginally improves the training for each objective individually regardless of preferences.

For the convergence rate, thanks to Homotopy Optimization, training first focuses on training accuracy on two objectives but later gradually focuses on faster convergence, training rewards are less viable after 3000 episodes of training than in the early stage. Also, we found the collision rate is significantly reduced from 0.7 to around 0.2, which also illustrates the training model is improving safety on the highway.

V-C2 Impact of BSs Density

We evaluate the performance by averaging over 500 evaluation epochs on models after 4500 episodes of training. In each evaluation step, we randomly distribute different numbers of TBSs alongside the highway in the simulation environment. As depicted in Fig. 4, MO-DDQN-envelope gains an evaluation advantage compared to other benchmarks. As the number of BSs grow, the transportation rewards do not change significantly, however, the telecom rewards first increases due to better connectivity and later decreases due to more HOs. Growing TBSs also increases the average collision rate due to potential reduction in AVs speed to maintain connectivity.

V-C3 Impact of the Number of AVs

As shown in Fig. 4, increasing the number of AVs leads to more crowded highway scenarios. Thus, more grouped AVs connect to the same BS resulting in network outages due to the maximum quota at each BS. Also, frequent lane changes and speeding on the crowded highway cause more congestion and collisions, which reduces transportation performance.

V-C4 Impact of Desired AV Speeds

Fig. 4 depicts that slow moving AVs outperform in terms of both transportation and telecommunications. Increasing speeds lead to higher collision occurrences. Moreover, AVs at higher speeds switch BSs more frequently, incurring significant handover penalties according to (21), thus adversely affecting rewards in both domains.

V-C5 Pareto Front Analysis

We employ the CCS in (19) as a means to assess the excellence of the estimated Pareto fronts. A greater Pareto frontier value indicates a closer proximity of the Pareto front to the optimal one in terms of transportation and telecommunication objectives. To compute CCS, we select performance on single policy MO-DQN and MO-DDQN as reference points. Recall we need to maximize both transportation rewards and telecommunication rewards. The best solutions are situated in the top right corner, as depicted in Fig. 5. Specifically, it demonstrates that our proposed algorithm MO-DDQN-envelope outperforms the MORL baselines in approximating the Pareto fronts. However, in the high-density transportation scenario, MO-DDQN yields a Pareto front of similar quality to the other baselines.

Refer to caption
Figure 5: Pareto Frontier Comparison in MOO for total Transportation reward (Rtransuperscript𝑅𝑡𝑟𝑎𝑛R^{{tran}}italic_R start_POSTSUPERSCRIPT italic_t italic_r italic_a italic_n end_POSTSUPERSCRIPT) and total telecommunication reward (Rtelesuperscript𝑅𝑡𝑒𝑙𝑒R^{{tele}}italic_R start_POSTSUPERSCRIPT italic_t italic_e italic_l italic_e end_POSTSUPERSCRIPT) among MO-DQN, MO-DDQN, MO-dueling-DDQN, MO-PPO, and MO-DDQN-Envelop, across instances:(a) I-(20,30,20,20), (b) I-(20,30,10,20), (c) I-(20,30,20,50)

VI Conclusion

We introduce a novel MORL framework tailored for devising joint network selection and autonomous driving policies within a multi-band VNet. Our goals encompass enhancing traffic flow, minimizing collisions, maximizing data rates, and minimizing handoffs (HOs). We achieve this through controlling vehicle motion dynamics and network selection, employing a unique reward function that optimizes data rate, traffic flow, load balancing, and penalizes HOs and unsafe driving behaviors. The problem is formalized as a MOMDP, integrating telecommunication and autonomous driving utilities in its rewards. We propose single policy MORL solutions with predefined preferences, transforming the MOOP into a single-objective one, utilizing DQN and double DQN solutions to derive optimal policies dependent on relative preferences.

Addressing the challenge of learning optimized policies across varied preferences, we develop an envelope MORL solution. This approach enables effective navigation across preference spaces, generating tailored policies. Our algorithm leverages the contraction properties of the optimality operator governing a generalized Bellman equation and optimizes the convex envelope of multi-objective Q-values. Hindsight experience replay and homotopy optimization aid in manageable learning across diverse preferences. Additionally, we construct a novel simulation testbed, ”RF-THz-Highway-Env,” based on ”highway-env,” emulating a multi-band wireless network-enabled VNet. Numerical results demonstrate the superiority of our proposed solution over weighted sum-based MORL solutions with DQN, showcasing improvements of 12.7%, 18.9%, and 12.3% on average transportation reward, average communication reward, and average HO rate, respectively.

References

  • [1] Z. Yan and H. Tabassum, “Reinforcement learning for joint v2i network selection and autonomous driving policies,” in GLOBECOM 2022 - 2022 IEEE Global Commun. Conf., 2022, pp. 1241–1246.
  • [2] H. Ye, G. Y. Li, and B.-H. F. Juang, “Deep reinforcement learning based resource allocation for V2V communications,” IEEE Trans. on Vehicular Tech., vol. 68, no. 4, pp. 3163–3173, 2019.
  • [3] L. Liang, H. Ye, and G. Y. Li, “Spectrum sharing in vehicular networks based on multi-agent reinforcement learning,” IEEE Journal on Sel. Areas in Commun., vol. 37, no. 10, pp. 2282–2292, 2019.
  • [4] Y. Xu, K. Zhu, H. Xu, and J. Ji, “Deep reinforcement learning for multi-objective resource allocation in multi-platoon cooperative vehicular networks,” IEEE Trans. on Wireless Commun., 2023.
  • [5] X. Hu, Y. Zhang, X. Liao, Z. Liu, W. Wang, and F. M. Ghannouchi, “Dynamic beam hop** method based on multi-objective deep reinforcement learning for next generation satellite broadband systems,” IEEE Trans. on Broadcasting, vol. 66, no. 3, pp. 630–646, 2020.
  • [6] G. Yu, Y. Jiang, L. Xu, and G. Y. Li, “Multi-objective energy-efficient resource allocation for multi-rat heterogeneous networks,” IEEE Journal on Sel. Areas in Commun., vol. 33, no. 10, pp. 2118–2127, 2015.
  • [7] R. Devarajan, S. C. Jha, U. Phuyal, and V. K. Bhargava, “Energy-aware resource allocation for cooperative cellular network using multi-objective optimization approach,” IEEE Trans. on Wireless Commun., vol. 11, no. 5, pp. 1797–1807, 2012.
  • [8] D. Guo, L. Tang, X. Zhang, and Y.-C. Liang, “Joint optimization of handover control and power allocation based on multi-agent deep reinforcement learning,” IEEE Trans. on Vehicular Tech., vol. 69, no. 11, pp. 13 124–13 138, 2020.
  • [9] H. Khan, A. Elgabli, S. Samarakoon, M. Bennis, and C. S. Hong, “Reinforcement learning-based vehicle-cell association algorithm for highly mobile millimeter wave communication,” IEEE Trans. on Cognitive Commun. and Networking, vol. 5, no. 4, pp. 1073–1085, 2019.
  • [10] Z. Wu, K. Qiu, and H. Gao, “Driving policies of V2X autonomous vehicles based on reinforcement learning methods,” IET Intelligent Transport Systems, vol. 14, no. 5, pp. 331–337, 2020.
  • [11] Z. Yan, W. Jaafar, B. Selim, and H. Tabassum, “Multi-uav speed control with collision avoidance and handover-aware cell association: Drl with action branching,” in GLOBECOM 2023 - 2023 IEEE Global Commun. Conf., 2023, pp. 5067–5072.
  • [12] X. Liu, Y. Liu, Y. Chen, and L. Hanzo, “Enhancing the fuel-economy of v2i-assisted autonomous driving: A reinforcement learning approach,” IEEE Trans. on Vehicular Tech., vol. 69, no. 8, pp. 8329–8342, 2020.
  • [13] A. Alizadeh, M. Moghadam, Y. Bicer, N. K. Ure, U. Yavas, and C. Kurtulus, “Automated lane change decision making using deep reinforcement learning in dynamic and uncertain highway environment,” in 2019 IEEE intelligent transportation systems conference (ITSC).   IEEE, 2019, pp. 1399–1404.
  • [14] X. He and C. Lv, “Towards energy-efficient autonomous driving: A multi-objective reinforcement learning approach,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1329–1331, 2023.
  • [15] W. Wei, R. Yang, H. Gu, W. Zhao, C. Chen, and S. Wan, “Multi-objective optimization for resource allocation in vehicular cloud computing networks,” IEEE Trans. on Intelligent Transportation Systems, vol. 23, no. 12, pp. 25 536–25 545, 2021.
  • [16] E. Leurent, “An environment for autonomous driving decision-making,” https://github.com/eleurent/highway-env, 2018.
  • [17] J. Sayehvand and H. Tabassum, “Interference and coverage analysis in coexisting rf and dense terahertz wireless networks,” IEEE Wireless Commun. Letters, vol. 9, no. 10, pp. 1738–1742, 2020.
  • [18] C. She, C. Sun, Z. Gu, Y. Li, C. Yang, H. V. Poor, and B. Vucetic, “A tutorial on ultrareliable and low-latency communications in 6g: Integrating domain knowledge into deep learning,” Proceedings of the IEEE, vol. 109, no. 3, pp. 204–246, 2021.
  • [19] Y. Polyanskiy, H. V. Poor, and S. Verdú, “Channel coding rate in the finite blocklength regime,” IEEE Trans. on Information Theory, vol. 56, no. 5, pp. 2307–2359, 2010.
  • [20] P. Polack, F. Altché, B. d’Andréa Novel, and A. de La Fortelle, “The kinematic bicycle model: A consistent model for planning feasible trajectories for autonomous vehicles?” in 2017 IEEE intelligent vehicles symposium (IV).   IEEE, 2017, pp. 812–818.
  • [21] E. Leurent, “Safe and efficient reinforcement learning for behavioural planning in autonomous driving,” Ph.D. dissertation, Université de Lille, 2020.
  • [22] M. Treiber and A. Kesting, “Traffic flow dynamics,” Traffic Flow Dynamics: Data, Models and Simulation, Springer-Verlag Berlin Heidelberg, pp. 187–201, 2013.
  • [23] A. Kesting, M. Treiber, and D. Helbing, “General lane-changing model mobil for car-following models,” Transportation Research Record, vol. 1999, no. 1, pp. 86–94, 2007.
  • [24] R. Yang, X. Sun, and K. Narasimhan, “A generalized algorithm for multi-objective reinforcement learning and policy adaptation,” Advances in neural information processing systems, vol. 32, 2019.
  • [25] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning with double q-learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  • [26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [27] H.-M. Chen, S.-F. Wang, P. Wang, S. Lin, and C. Fang, “Deep Q-learning for intelligent band coordination in 5g heterogeneous network supporting v2x communication,” Wireless Commun. and Mobile Computing, 2022.
  • [28] Y. Hou, L. Liu, Q. Wei, X. Xu, and C. Chen, “A novel DDPG method with prioritized experience replay,” in IEEE Intl. Conf. on Systems, Man, and Cybernetics (SMC).   IEEE, 2017, pp. 316–321.
  • [29] M. Andrychowicz, F. Wolski, A. Ray, J. Schneider, R. Fong, P. Welinder, B. McGrew, J. Tobin, O. Pieter Abbeel, and W. Zaremba, “Hindsight experience replay,” Advances in neural information processing systems, vol. 30, 2017.
  • [30] L. N. Alegre, F. Felten, E.-G. Talbi, G. Danoy, A. Nowé, A. L. C. Bazzan, and B. C. da Silva, “MO-Gym: A library of multi-objective reinforcement learning environments,” in Proceedings of the 34th Benelux Conf. on Artificial Intelligence BNAIC/Benelearn 2022, 2022.
  • [31] E. Leurent, “rl-agents: Implementations of reinforcement learning algorithms,” https://github.com/eleurent/rl-agents, 2018.
  • [32] F. Felten, L. N. Alegre, A. Nowé, A. L. C. Bazzan, E. G. Talbi, G. Danoy, and B. C. d. Silva, “A toolkit for reliable benchmarking and research in multi-objective reinforcement learning,” in Proceedings of the 37th Conf. on Neural Information Processing Systems (NeurIPS 2023), 2023.
  • [33] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and N. Freitas, “Dueling network architectures for deep reinforcement learning,” in International conference on machine learning.   PMLR, 2016, pp. 1995–2003.
  • [34] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.